1. Introduction
Optimisation problems are ubiquitous in artificial intelligence, operations research, and a wide range of application areas. In their simplest form, they require us to make a set of decisions in order to optimise an objective function, possibly while satisfying some constraints. A decision is usually represented by the assignment of a domain value to a decision variable, where the domain might be a set of real numbers, integers, categories or data structures. Different domains, constraints and objectives are supported by different methods. Mixed integer linear programming (MILP) and more generally mathematical programming, constraint programming (CP), dynamic programming and metaheuristics (such as local search and genetic algorithms) can be applied to such problems.
However, many optimisation problems are more complex than this. In particular, they might involve uncertainty, which can be modelled by chance variables that are not controlled by a decision maker. Instead, they take values according to some probability distribution. Stochastic programming is an extension of MILP that models such problems. In multi-stage stochastic programs, decision and chance variables might be interleaved so that we must make decisions without knowing the values of some chance variables. Having made decisions, we can then observe some chance variables, but must then make next-stage decisions, and so on. A solution to a multi-stage problem is not a simple set of decisions, but a policy taking the form of a tree to allow reactions to random events. Probability distributions are usually assumed to be unaffected by decisions (exogenous uncertainty), but some applications have decision-dependent distributions (endogenous uncertainty). Stochastic dynamic programming is often used in the latter case.
Another possible complication is that a decision maker, or agent, might have multiple objectives: for example, to maximise profit while minimising environmental damage. Some form of compromise must be reached for such problems. Yet another complication is that a problem might involve multiple agents. In multi-level programming, an agent must make a decision in the knowledge of how another agent will react, who might in turn have to take other agents into account. Knowledge might be partial: an agent might not know exactly how an adversary will act or has acted, and must make a decision based on assumptions.
Influence diagrams (IDs) are a particularly expressive formalism that can model multi-stage decision problems with endogenous uncertainty. Chance variables may be independent or related by conditional probability tables, forming a Bayesian network. Observations or the observed values of chance variables are also allowed, and Bayesian inference can be used to infer the distributions of unobserved chance variables based on these observations. Several exact and approximate methods are available for solving IDs.
A machine learning approach to complex optimisation problems is
reinforcement learning (RL) [
1] which can be used to solve multi-stage decision problems in stochastic environments. Multi-agent RL has famously been used to learn to play games such as Go and Chess to superhuman levels via self-play [
2]. Multi-objective RL algorithms have also been devised, and these extensions have been combined in the field of multi-objective multi-agent RL. Similar developments have occurred in the evolutionary computation literature.
Thus, we have a wide range of available technologies, each able to solve certain types of problem efficiently. However, faced with a new optimisation problem with hybrid features, there might be no available solver. Some hybrids have been investigated, for example, multi-objective reinforcement learning with constraints [
3], and bi-level multi-objective stochastic integer linear programming [
4]. IDs have been extended to handle nonlinear utilities [
5], multiple agents [
6,
7,
8,
9,
10], multiple objectives [
11,
12], hard constraints [
13], and partially observed variables [
14]—though not all in one system. But no method exists for complex hybrids in general. To take a hypothetical example, there is no obvious way of modelling and solving a discrete bi-level optimisation problem whose leader is a non-linear chance-constrained stochastic program, and whose followers are a bi-objective constraint satisfaction problem and a weighted Max-SAT problem. A researcher faced with such a problem must invent a new approach or make simplifying assumptions to fit it into an existing framework. A general-purpose method able to handle all these complications would be very useful, at least for a preliminary investigation, though a more efficient specialised method might eventually be needed.
Another drawback of the fragmentation of technologies is that it is difficult to compare different approaches. Suppose we are faced with a new application with two decision makers but we are unsure how best to model it. Should we treat it as a bi-level program (see
Section 3.8), a Bayesian game (see
Section 3.9), using level-k thinking (see
Section 3.10), model one decision maker as a random variable and implement a stochastic program (see
Section 3.6) or take an expected value of one agent and reduce it to a constraint program (see
Section 3.1)? Some simplifications might make little difference to the objective while others might make feasibility impossible. Each approach requires a different technology that takes considerable time to master, and possibly financial cost. A method that is general enough to solve all these models would be of great benefit for rapid prototyping, to explore the consequences of model design choices.
In this paper, we describe a very general class of discrete optimisation problem we call an
Influence Program (IP) that generalises a wide range of problems from operations research, artificial intelligence and game theory (the name is inspired by the flexibility of IDs). We also present a simple solver for IPs based on a combination of multi-agent multi-objective reinforcement learning and Markov Chain Monte Carlo sampling. RL has been proposed as a unifying approach to sequential decision-making under uncertainty [
15] and has solved complex adversarial games, making it a natural candidate for a general-purpose solver.
This paper extends our previous work. An early version using logic programming was described in [
16], and a more recent C-based version appears in [
17] where our problem class was called a “mixed influence diagram”. This paper extends [
17] by formalising the problem, providing more related work, replacing simple rejection sampling by Gibbs sampling, and solving a wider range of problems.
Section 2 formalises the problem and describes the algorithm.
Section 3 applies it to a variety of problems from different optimisation literatures to demonstrate its flexibility and ease of use. Finally,
Section 4 summarises the results and discusses future work.
2. The Influence Programming Framework
We now introduce our IP framework and an algorithm, after discussing related work.
2.1. Related Work
Considerable work has been performed on multi-objective RL (MORL). The MORL literature is too large to survey here, but a recent survey on MORL algorithms with a discussion on these issues is provided in [
18]. Some methods use scalarisation to convert MORL into RL, making it a single-policy method like several others. Scalarisation has been criticised because it leads to only a single policy, and some MORL algorithms approximate Pareto fronts and learn policies for a range of weightings. Scalarisation has also been criticised on the grounds that it can be hard to choose appropriate weights, and that the choice puts decision power in the hands of engineers running the algorithm. Most such methods use linear scalarisation. Despite the shortcomings of scalarisation, we shall show that it can give good results on a range of problems. It is also easy to implement and has little runtime overhead, which is important for our lightweight approach. Results using weighted metrics are also less sensitive to the choice of weights [
19].
Much work has also been performed on multi-agent RL (MARL), and a recent MARL survey can be found in [
20]. Less work has been performed on the intersection of MARL and MORL: multi-objective multi-agent RL (MOMARL). A survey on MOMARL and related algorithms is provided in [
21]. Because of the complexity of MOMARL problems, there is not even a single agreed definition of what constitutes a solution. MO-MIX [
22] is a MOMARL algorithm using linear scalarisation, and deep RL in which an artificial neural network is used for state aggregation.
2.2. The Problem Class
We define a (discrete) influence program (IP) as a tuple where:
is an ordered list of variables ;
is a list of their corresponding finite domains of possible values, typically ranges of integers or sets of symbolic names;
is a list of their corresponding agents (decision makers) , which have symbolic names (in bold);
U is a function mapping a total variable assignment to a utility vector for each agent;
is a set of directed links () between variables;
is a set of observations, where an observation is an assignment of a value to a chance variable ;
is a function assigning a probability to each chance variable assignment given an assignment for each variable such that : it is typically represented by a (conditional) probability table.
Each variable is associated with exactly one agent, and we allow the possibility of chance variables whose agent has the name chance. The links define which variables are visible to later decisions and chance variable distributions. A utility is a function mapping a total variable assignment (one value per variable) to a real value: we allow utilities to be programmable, requiring the user to write a small function to compute utilities from a total variable assignment. Each agent has at least one utility, apart from chance. The aim of an IP is to find a policy for each agent that Pareto-optimises its utilities in the context of observations and inter-variable visibility.
2.3. An Algorithm
We now describe the InfProg algorithm shown in Algorithm 1. It is a lightweight research prototype using known techniques, used only to demonstrate the flexibility of our approach, and many other combinations of methods are potentially possible.
InfProg is based on a simple RL algorithm: infinite-step tabular SARSA with
-greedy action selection and learning rate
[
1].
Infinite-step indicates that the reward is backed up equally to all values in the episode, which is more robust than Q-learning in the presence of unobserved variables [
1].
Tabular indicates that state–action pairs have values in a table. The discount factor
is set to 1 as our RL problem is episodic. If an agent has multiple utilities, these are scalarised so that there is one utility per agent (see
Section 2.6). The scalarised objectives are used as rewards, computed at the end of each episode when all variables have been assigned updated values, and backed up to earlier states for those agents (
denotes the agent
a corresponding to variable
v): the user must provide code for this step. However, the user is probably not interested in the value of the scalarised objective, so smoothed versions of the original objectives are printed out: these will be the values we report.
Algorithm 1 The InfProg algorithm for solving IPs |
Require: integers and a utility hyperparameter initialise , , for episode do for do if v is chance then sample from its distribution else if then randomly sample else end if end for for each agent a do compute scalarised objective using end for for do end for update and end for |
The utility is computed at the end of an episode: in RL terms, this is a
sparse reward which can make learning harder. In future work, we could allow for multiple value nodes as in IDs, or compensate for sparse rewards by adding RL techniques such as
hindsight experience replay [
23]. That technique was designed for off-policy RL algorithms but SARSA is on-policy, so some algorithm modification would be needed: for example, replacing SARSA by Q-learning or a more recent deep RL algorithm. There are many possibilities and this area is ripe for exploration.
The and parameters start at 1 and decay to 0. Many decay schemes have been proposed in RL, and we arbitrarily choose . In our use of RL, a state is an assignment to variables for some , an action is the assignment of a value to a decision variable , an episode assigns all the variables, and the reward is the utility (objective function value) computed at the end of an episode. The are state–action values used to define the policy, which should optimise the expected reward.
Note that it is likely that for some problems it would be better to use SARSA with an eligibility trace and
(
is a hyperparameter used with eligibility traces [
1]) and we shall investigate this in future work. However, for our research, prototype we effectively set
by choosing infinite-step SARSA, thus simplifying the algorithm by removing a hyperparameter that requires tuning.
2.4. State Aggregation
Each decision might depend on all previous decisions and random events (but see
Section 3.5), so a policy might involve a huge number of distinct states. To combat this problem, RL algorithms group together states via
state aggregation methods. We choose a simple form called
random tile coding [
1], specifically
Zobrist hashing [
24] with
H hash table entries for some large integer
H. This works as follows. To each (decision or chance) variable–value pair
, we assign a random integer
which remains fixed. At any point during an episode, we have some set
S of assignments
, and we take the exclusive -or of the
values (that is, their bit patterns) associated with assignments
. Finally, we use
to index an array
V with
H entries: the value of
is stored in
(in all our experiments, we fix
.)
The
InfProg algorithm takes two numerical hyperparameters: an integer
H used for state aggregation and an integer
E which is the number of iterations used by the solver. If
H is sufficiently large then hash collisions are unlikely, and we will have a unique array element for each state encountered. It might be expected that Zobrist hashing will perform poorly when the number of states approaches or exceeds the size of the hash table, because hash collisions will confuse the values of different state–action pairs. Surprisingly, it can perform well even when hash collisions are frequent, and has been used in chess programming [
25].
2.5. Sampling
InfProg samples chance variables using a Markov chain Monte Carlo algorithm (MCMC) as in probabilistic programming. Thus, each SARSA episode is also a sweep through the chance variables. Our earlier work [
16,
17] used rejection sampling: at the end of a SARSA episode, if the chance variables did not match the observations, then the episode was rejected, in the sense that rewards were not backed up to earlier states. This worked on some problems but is impractical when probabilities are very low, and the use of an MCMC algorithm was proposed as future work.
InfProg uses Gibbs sampling and can handle such cases.
In the experiments, we found that some conditional probability table values that were set to 0 or 1 required adjustment: this is a known failure mode of Gibbs sampling. In fact, for optimisation problems without observations (which is true of most problems), we replace Gibbs with simple inverse transform sampling.
2.6. Multiple Objectives
Many real-world applications have multiple objectives and we must find a trade-off. Objectives can be combined in more than one way in RL, and a simple and popular approach is linear scalarisation: take a linear combination of the objectives, giving a single objective that can be used as an RL reward. This has the drawback that it cannot generate any solutions in a non-convex region of the Pareto front. It can also be hard to choose appropriate weights, especially when the objectives use different units.
Another approach is to rank the objectives by importance, then search for the lexicographically best result. This has the same problem as linear scalarisation (though recent work addresses this [
26]) which can be fixed by thresholding all but the least important objective, a method called
thresholded lexicographic ordering [
27]. However, it has the drawback of using a specific RL algorithm. Moreover, in some of our applications, the most important objective is to minimise constraint violations, and this should not be thresholded.
The method we choose is to reduce the multiple objectives to a single objective via
weighted metric scalarisation [
19] in which the distance is minimised between the vector of values
for objectives
o and a
utopian point in the multi-objective space:
for some
and weights
chosen by the user. The need to choose weights is a disadvantage in terms of user-friendliness, but an advantage is that we can tune them to find different points on the Pareto front. The special case of
Chebyshev scalarisation (
) is theoretically guaranteed to make the entire Pareto front reachable, but the
-norms (
) may perform better in practice [
28]. In experiments, we found better results with
than with
or
despite Chebyshev’s theoretical advantages, so we shall use
. However, in future work, it might be better to use Chebyshev and find ways of improving its results.
The utopian point is often adjusted during search to be just beyond the best point found so far, but InfProg uses a fixed provided by the user. Depending on whether each objective is to be maximised or minimised, we choose a value that is optimal or high/low enough to be unattainable.
Thus, to apply InfProg to an optimisation problem, we must also provide a hyperparameter to guide scalarisation: a list of sublists of pairs. Each pair is a utopian value expressing a desired utility plus a weight, and each sublist corresponds to an agent and contains a pair for each of the agent’s utilities. We shall always choose weights that sum to 1 for each agent, though this is not strictly necessary. The use of will be illustrated below via examples.
2.7. An Intuitive Explanation
To aid the reader, we now provide an intuitive explanation of how the algorithm works. The outer loop of the algorithm performs E episodes (a term taken from RL) where E is an integer hyperparameter chosen by the user. Each episode sweeps through the variables in the order specified by the list , assigning each a value.
A decision variable is assigned a value as in the SARSA algorithm: during early episodes, assignment is largely random, but in later episodes, values are assigned more greedily in order to maximise the estimated expected reward (another RL term, corresponding to utility in influence diagrams and to objective function value in general optimisation). The degree of greediness is controlled by the parameter, which decays from 1 (completely random) in the first episode to 0 (completely greedy) in the last episode. In RL algorithms this strategy theoretically leads to an optimal policy.
A chance variable is assigned a value via sampling, thus an episode interleaves SARSA with sampling. In RL terms, the chance variable assignment is simply part of the environment in which the agent operates. In the special case where all variables are chance variables, the IP is a Bayesian network and an episode reduces to an MCMC sweep through the variables.
At the end of an episode, a reward is computed for each agent, and backed up to all earlier states as in SARSA, with the slight complication that in our multi-agent version each reward must be backed up to the appropriate agent.
The choice of infinite-step tabular SARSA was largely arbitrary, and has both advantages and disadvantages. An on-policy algorithm like SARSA is considered to be more consistent and stable than an off-policy algorithm like Q-learning, but also slower and less efficient. The choice of a tabular algorithm was made for its simplicity of implementation. The infinite-step choice corresponds to setting in an RL eligibility trace, making it a Monte Carlo method rather than a temporal difference method. The former are considered more robust in the presence of unobserved variables, as they are not based on the Markov property.
The restriction to rewards that occur only at the end of an episode was designed to simplify the implementation and user interface. In some cases, it would be preferable to allow intermediate rewards, if these occur naturally in the problem. However, this would require the user to write several pieces of code and to specify when each is to be executed. For our research prototype we took the simpler path.
Regarding the necessity for the user to provide programmable utility functions: although this might seem less user-friendly than some form of specification language, it is extremely flexible. We note that a similar approach is taken in the field of probabilistic programming [
29,
30].
3. Applications
In this section, we take small problems from a range of studies in the literature, model them as IPs, and solve them using InfProg. We shall not compare our method with others in terms of efficiency, as this is not the goal of the paper. For the same reason, we report few runtimes, though they are quite short—typically a few seconds and at most a few minutes. In fact, we do not expect it to be competitive on any particular class of problem, though as an RL-based method, it should perform reasonably well on applications from the RL literature. Our aim here is only to demonstrate that a single solver can solve a wide variety of optimisation problems, thus filling a gap in optimisation technology. We intend to implement faster, more scalable, and more user-friendly IP solvers in future work, and we hope that other researchers will also find better algorithms.
3.1. Constraint Programming
First we consider a “simple” optimisation problem: one agent and one objective. We take a well-known
constraint satisfaction problem (CSP) known as eight queens: place eight chess queens on a standard
chessboard in such a way that no two queens attack each other (by being on the same row, column, or diagonal). This is a popular problem in constraint programming [
31], and the smaller four queens problem is illustrated in
Figure 1. The first example is a solution because no two queens are in the same row, column, or diagonal. The second example is a non-solution because it violates three constraints: the queens in the first two rows are in the same column, those in the last two rows lie on the same diagonal, and those in the first and third rows lie on another diagonal.
We treat eight queens as a Max-CSP problem in which the objective is to maximise the number of satisfied constraints (in this case, minimising the number of constraint violations to 0). We shall consider two possible IP models.
3.1.1. A Sparse Model
The problem can be modelled as follows:
, each
,
(all variables belong to the same agent
a),
where
viol counts constraint violations:
is the
Iverson bracket that takes the value 1 if its argument is true and 0 if it is false.
is the empty function (there are no probability tables because there are no chance variables) and
. The utility hyperparameter is
: one agent with one utility whose utopian point is 0 and associated weight is 1.
Though we shall not report runtimes in general, it might be of interest to the reader to see one example. Our solver is implemented in C and executed on a Intel(R) Core(TM) i5-6500 CPU @ 3.20 GHz with 8 GB of system RAM (Intel, Santa Clara, CA, USA). With episodes, the solver takes 10.3 s per run, and in 100 runs, it found a correct solution 70 times. The other 30 times, the solution contained one constraint violation, illustrating that InfProg can become trapped in local optima. It is also quite successful using only episodes, finding 61 correct solutions.
This result is not competitive with methods such as constraint programming, integer programming or local search, which can all solve the problem more quickly and more reliably. But, as stated earlier, our aim is flexibility across a wide range of problems, not efficiency on any particular problem class.
3.1.2. A Dense Model
We referred to the previous model as sparse because it had few links (0). As an alternative dense model, we can add links to the IP from each decision variable to each earlier one: . These are not logically necessary but they provide more information to the decision variables. With the dense IP, we found 100% success on eight queens using only episodes, and even with only 30,000 episodes, it achieved 69% success. This suggests that making decisions visible to all subsequent decisions will improve performance. However, adding these links creates a number of states that are exponential in the number of variables, causing many hash collisions in InfProg, which in turn might degrade performance.
This phenomenon is well known in RL. As an example, consider the elevator dispatching problem in which we must control a set of elevators in order to minimise the expected waiting times. Even if we restrict the problem to discrete time, it has a vast number of states, because of the many possible combinations of elevator positions, the direction of movement, and which buttons have been pressed. In principle, it would be best to treat every state separately, and the superior performance of our dense model supports this view, but combinatorial explosion makes it impractical for large problems.
One way of avoiding the explosion is to treat each elevator independently in a
distributed RL approach—see, for example [
32]. There is a decision-making agent for each elevator and they do not communicate, but they have a common objective and they indirectly learn to cooperate. A recent survey of distributed RL with applications is given in [
33]. Our sparse IP works in the same way, though we only used one agent: we could instead use a separate agent for each variable, but it makes no difference because they have the same objective.
Controlling the links between variables makes it easy for InfProg to emulate RL methods that are fully-, partially-, and non-distributed. For subsequent problems, we shall use a mixture of sparse and dense models.
3.2. Bayesian Networks
From a problem with all decision variables, we move to a problem with all chance variables: Pearl’s alarm example [
34]. This is an example of
probabilistic inference, as performed in the field of probabilistic programming: inferring a conditional probability from a probabilistic program.
In this small problem, a house has an alarm that can be set off by an earthquake or a burglary, each with a prior probability. We also have conditional probabilities for the alarm under different circumstances: when there is/is not a burglary and there is/is not an earthquake. Given that the alarm has been activated, what is the probability that a burglary has occurred? The answer affects our actions.
We can model this as an IP. The variables are
, the domains
are all
, the agents
are all
chance, the links are
as in the Bayesian network, the probabilities
are:
there is one observation
and the utility vector is
Hence, the task is to compute the conditional probability . We can treat this as a utility, though the only agent is chance so we are not trying to optimise it by taking decisions. There are no decision-making agents, so the utility hyperparameter is . InfProg applies Gibbs sampling and finds an expected “utility” of approximately 0.23, which is the correct probability.
We shall not use observations in subsequent examples as they are not a feature of most optimisation problems. However, they can be used to compute the value of information to determine which variables are most likely to reduce the uncertainty in a variable of interest.
3.3. Influence Diagrams
We now move to problems containing both decision and chance variables. IDs [
35] are popular graphical models in decision analysis, and they can model important relationships between uncertainties, decisions, and values. They were initially conceived as tools for formulating problems but they have also emerged as efficient computational tools. They have many applications including medical diagnosis [
36], cybersecurity [
37], and risk management [
38].
An ID is a directed acyclic graph with three types of node: decision nodes correspond to decision variables and are drawn as rectangles; chance (or uncertain) nodes correspond to chance variables and are drawn as ovals; value nodes correspond to preferences or objectives and are drawn as rounded rectangles, or polygons such as diamonds.
Each chance variable is associated with a conditional probability table that specifies its distribution for every combination of values for its parent variables in the graph. Each decision variable also has a set of parent variables in the graph, and its value depends only on their values. The decision variables are usually considered to be temporally ordered, and chance variables are observed at different points in the ordering. A standard ID assumption is non-forgetting, which means that the parents of any decision variable are all its ancestors in the graph: thus, a decision may depend on everything that has occurred before. All variables are discrete.
Each value node is associated with a table showing the utility (a real number) of each combination of parent variable values. A decision policy is a rule for each decision variable indicating how to choose its value from those of its parents. Any policy has a total expected utility (it is an expectation because of the chance variables) and solving an ID means computing its optimal policy, which has maximum expected utility.
Several methods exist for solving IDs. Some are exact and based on variable elimination [
39,
40,
41,
42,
43] while others are approximate [
44,
45,
46,
47,
48]. However, apart from our work ([
16,
17] and this paper), we know of no other applications of RL to solving IDs, which seems surprising as both IDs and RL can be used to model and solve sequential decision problems. The main connection usually made between the two is that IDs can model problems that can be tackled by RL, for example, causal IDs have been used to model artificial general intelligence safety frameworks which often use RL [
49]. Our RL-based approach lies somewhere between exact and approximate methods: given sufficient memory and training time, it has the potential to find an optimal solution but is not guaranteed to do so. We do not expect it to be as efficient as specialised algorithms, but it can tackle IDs with extensions such as limited memory (see
Section 3.5), multiple agents, and multiple objectives. Moreover, the recent successes of deep RL make its application to large IDs a promising research direction. However, in this paper, we restrict ourselves to showing that RL can indeed solve IDs.
We use the Oil Wildcatter ID shown in
Figure 2, a well-known problem published in [
50]. An oil wildcatter must decide either to drill or not to drill for oil at a specific site. Before drilling, they may perform a seismic test that will help determine the geological structure of the site. The test result can be
closed (indicating significant oil),
open (indicating some oil), or
diffuse (probably no oil). The special value
notest means that test results are unavailable if the seismic test is not performed. The
test decision does not depend on any other variable, but the
drill decision depends on whether a test was made, and if so, on its result. The
oil variable is unobservable so no decision depends on it (this is distinct from
forgetting its value which is addressed in
Section 3.5).
An IP model is as follows. The variables are
, the domains are
the agents are
, the links are
(ID links to utilities are not part of an IP), there are no observations (
), the utility vector is
where
t is the test payoff and
d is the drill payoff, and the probabilities are as shown in
Figure 2. The utility hyperparameter is
: one agent with one objective to be maximised.
The known optimal policy given the above utilities and probabilities is as follows: apply the seismic test, and drill if the test result is open or closed. InfProg finds this solution and reports a close approximation to the correct expected utility: 22.5.
3.4. Multi-Objective Influence Diagrams
As a first example of multi-objective optimisation, we use a bi-objective oil wildcatter ID from [
12]. In addition to maximising payoff, the aim is to minimise environmental damage. The IP is as in
Section 3.3, except for additional payoffs: the utility of testing is now
while the utility of drilling is
,
, and
for dry, wet, and soak, respectively: in each case, the first value is the original utility while the second value refers to the new utility. Instead of one optimal solution, there are four Pareto-optimal solutions, each with two utility values: (1) test, then drill if the result is closed or open, with utility
; (2) do not test but drill, with the utility
; (3) test, then drill if the result is closed, with utility
; (4) do not test or drill, with utility
.
The utilities are now , where p is the payoff as before, and d is the environmental damage. With a utility hyperparameter , in multiple runs, InfProg found policies (1), (2), and (4), but not (3).
3.5. Limited Memory Influence Diagrams
Standard IDs are designed to handle situations involving a single, non-forgetful agent. Limited memory influence diagrams (LIMIDs) [
14] are generalisations of IDs that allow decision making with limited information and simultaneous decisions, and can have much smaller policies. They relax the regularity (total variable ordering) and non-forgetting assumptions of IDs. LIMIDs are considered harder to solve optimally than IDs. We handle the limited memory feature in a simple way: during Zobrist hashing, the set
S contains only assignments that are visible to the decision variable (as specified by the IP links
).
For example, we use a pig breeding problem from [
14]. A pig breeder grows pigs for four months and then sells them. During this period, the pig may or may not develop a disease. If it has the disease when it must be sold, then it must be sold for slaughter and its expected market price is 300. If it is disease-free, then its expected market price is 1000. Once a month, a veterinary surgeon tests the pig for the disease. If it is ill, then the test indicates this with a probability of 0.80, and if it is healthy, then the test indicates this with a probability of 0.90. At each monthly visit, the surgeon may or may not treat the pig, and the treatment costs 100. A pig has the disease in month 1 with a probability of 0.10. A healthy pig develops the disease in the next month with a probability of 0.20 without treatment and 0.10 with treatment. An unhealthy pig remains unhealthy in the next month with a probability of 0.90 without treatment, and 0.50 with treatment. The ID is shown in
Figure 3.
An IP to model this problem is as follows. The variables are (), the domains are for the , for the and for the , there is one agent in we call breeder, the links are , contains the probabilities shown, , and .
Using a utility hyperparameter
,
InfProg almost always finds the optimal policy with an expected utility of approximately 727: ignore the first test and do not treat, then follow the results of the other two tests (treat if positive). As noted in [
16], a different policy is given in [
14]: treat in month 3 if tests 1, and 2, or 3, are positive. We find that their policy has an expected utility of 725.884 while ours is optimal. They report the same expected utility as we do, so we believe this was simply a typographical error.
3.6. Multi-Stage Stochastic Programming
Stochastic programming [
51] models and solves problems involving decision and chance variables, with known distributions for the latter. Problems may have one or more stages: in each stage, decisions are taken, then chance variables are observed. A solution is not a simple assignment of variables, but a policy that tells us how to make decisions given the assignments to variables from earlier stages. Stochastic programming dates back to the 1950s and is now a major area of research in mathematical programming and operations research.
We now show that discrete stochastic programs can be modelled and solved as IPs. We use a two-stage stochastic program from [
52]:
A solution to this problem consists of fixed assignments to , and assignments to the that depend on the random values. The optimum policy has an objective of 61.32.
We model the problem as an IP as follows. The variables are
the agents are
(we choose the name
opt for the non-random agent), the domains are
where
,
and
,
assigns equal probabilities to the
domain values and
.
As in
Section 3.1, we model constraints by minimising their violations as an additional objective, choosing a large weight to prioritise feasibility. However, in this problem, we also have an objective so we shall use two utilities: the first to minimise the number of constraint violations to 0, and the second to maximise
z. Hence,
. The utility hyperparameter is
.
3.6.1. A Sparse Model
In this IP, the links are . In multiple runs of up to episodes, InfProg found policies with an expected utility of approximately 55. This is quite good but short of the known optimum of 61.32. Increasing to episodes made no difference.
3.6.2. A Dense Model
In an attempt to improve the results, we added the following links to the sparse IP: , ( and (. In multiple runs of episodes, InfProg found policies with an expected reward approximately 60, which is close to the optimum.
3.7. Chance-Constrained Programming
A
chance (or
probabilistic)
constraint is a constraint that should be satisfied with some probability threshold:
for a constraint
C on decision variables
x and chance variables
r. Chance-constrained programming [
53] is a method of optimising under uncertainty and has many applications, as it is a natural way of modelling uncertainty. Chance constraints are usually inequalities but we allow any form of constraint. Chance constraints have also been used in the area of safe (or constrained) RL using various approaches [
54,
55], and our approach is related to that of [
3]. They do not seem to have been added to IDs.
We model chance constraints by adding a new objective for each, with a reward of 1 for satisfaction, and 0 for violation, and setting the utopian value for that objective to the desired probability threshold. The weight attached to the objective should be high.
For example, we modify the stochastic program of
Section 3.6 by attaching probabilities to the two hard constraints:
Notice that the chance constraints are of the form instead of the usual . InfProg is not guaranteed to satisfy the probability thresholds because of its multi-objective approach: instead of forcing the satisfaction of a chance constraint to exceed a threshold probability, it tries to match the threshold in a trade-off with other objectives. However, it can be used as an exploratory tool to find a policy with the desired characteristics, and the user can iteratively increase thresholds.
To model this problem, we modify the dense model from
Section 3.6, adding a new objective for each chance constraint. The utility vector is now
. The utility hyperparameter is
To solve this tri-objective IP, we experimented with different weights and found a variety of policies, with different compromises between the chance constraints and original objective. Not all solutions were useful, but using weights , we found a policy with objectives that satisfies the requirements. Relaxing the hard constraints to chance constraints allowed us to increase the objective z from 60 to 71.
3.8. Multi-Level Programming
Many problems in economics, diplomacy, war, politics, industry, gaming, and other areas involve multiple agents, which form part of each other’s environment. Multi-agent RL can be applied to these problems, as can several forms of ID: bi-agent IDs [
7], multi-agent IDs [
8], game theory-based IDs [
10], networks of IDs [
6], and interactive dynamic IDs [
9]. LIMIDs can model multiple agents, but only the cooperative case in which they all have the same objective.
The most common case is a
bi-level program or
Stackelberg game. These usually involve continuous variables and relatively little work has been performed on discrete bi-level programs [
56], so new methods are of interest. However, as an even more challenging case, we use a discrete
tri-level program studied in at least two publications [
57,
58] and shown in
Figure 4.
Tri-level programs have been used to model problems in supply chain management, network defence, planning, logistics, and economics. They involve three agents: the first is the
top-level leader, the second is the
middle-level) follower, and the third is the (
bottom-level) follower. They are organised hierarchically: the leader makes decisions first, the middle-level follower reacts, then the bottom-level follower reacts. A survey on multi-level programming is provided in [
59], including a section on the tri-level case. Decentralised decision-making problems in a hierarchical management system often contain more than two levels. As a different application, we mention [
60] which uses tri-level programming to model a defender–attacker–defender problem in defending an electrical power grid.
This problem is also of interest because its objectives are quadratic for the leader, linear for the middle-level follower, and fractional for the bottom-level follower. Non-linear objectives were added to IDs in [
5] using non-linear optimal control approximations.
We model the problem using the following IP. The variables are
, and their domains are ranges of integer values
. The agents are
:
is a top-level leader variable,
is a middle-level leader variable, and
–
are follower variables. The links are
(a dense model in the sense of
Section 3.1), the utilities are
and
. The first objective used by all agents is the number of constraint violations, which should be 0: the constraints ensure that any solution is
tri-level feasible so all agents should be penalised for violations (in this problem, no agent is allowed to make a decision that leads to a constraint violation). The secondary objectives are
. Using a utility hyperparameter
This is the most complex value we used, so we provide some explanation. The trilevel problem has been modelled using three agents, each with two objectives (one for the problem objectives and one to maximise constraint satisfaction). We use weighted metric scalarisation to reduce this to three agents with one objective each, so we must specify three utopian points (one per agent). Each utopian point has two ideal objective values and an associated weight for each value. So, for example, agent 1 has a utopian point with associated weights .
In repeated runs with
episodes,
InfProg reliably finds the correct policy (0, 0, 9, 4, 1, 6, 0, 0) with objectives
. As pointed out in [
17], the first objective is given as 612 in [
58], but it can be verified that the solution yields
. The suboptimal solutions that it sometimes finds can be recognised by their lower
values.
3.9. Bayesian Games
The assumptions of classical game theory require complete information, which is often unrealistic. Bayesian games were a development that allowed incomplete (private or secret) information while avoiding infinite calculations [
61]. An important feature is that a player may have a “type” that is not known by other players, representing their state of mind.
As a simple example, we take a well-known problem called the
Sheriff’s dilemma. There are two players: a sheriff and an armed suspect. The suspect has two possible types, a criminal or a civilian, and only he knows what type he is. The sheriff must know whether to shoot without knowing the suspect’s type, while the suspect is allowed to use that information. Payoff matrices for the various possibilities are shown in
Figure 5, where in each case, the first figure is a suspect payoff and the second a sheriff payoff.
We can model this as an IP as follows. The variables are
with domains
the agents are
. The links are
allowing the suspect but not the sheriff to know the suspect’s type. The type probabilities
are
p for
and
for
, the utilities are
, and
.
Using a utility hyperparameter and 1000 episodes, InfProg reliably finds the correct policy: the suspect shoots if and only if he is a criminal, and the sheriff shoots if and only if . If p is close to , then more episodes are needed to reduce sampling error.
3.10. Level-k Reasoning
Bayesian games have also been criticised for making unrealistic assumptions that do not always predict real-world behaviour. Another game theory approach to incomplete information is
level-k reasoning [
62,
63], which assumes that players play strategically but with bounded rationality. It has recently been used in
adversarial risk analysis for modelling problems in counter-terrorism [
64] and other fields.
We consider a simple level-k problem: the Keynesian beauty contest. Contest participants are asked to choose a number that they hope will be as close as possible to some fraction p of the average of all participants’ choices. Classical game theory predicts that all players will choose 0, which is the Nash equilibrium. The reasoning is that, if all players choose randomly, it is best to choose p times the mean: or approximately 33 when (which is a popular value). But all players know this, so they should instead choose p times that value approximately 25. But all players know this, so the logical conclusion is that 0 is the only reasonable choice, yet in experiments, few people choose 0.
In level-k reasoning, we make assumptions about the other players. We might assume that all other players are level-0 thinkers who do not think strategically. They simply choose a number uniformly at random: (we shall restrict our numbers to be integers as our method is discrete). Under this assumption, we should use level-1 thinking and choose p times the expected level-0 choice: 33. If we instead assume that all other players are level-1, we should use level-2 thinking and choose 22. Similarly, if we assume that all others are level-2, we should use level-3 thinking and choose 15.
We can model and solve these problems as IPs, as we now show using the beauty contest. The variables represent the choices of the hypothetical level-k players (k ) and the real level-3 player: . The domains in are all . The agents are . The chance probabilities are all . There are no links as all choices are made simultaneously, without knowing the other choices: . There are no observations so . The level-i player should guess as close as possible to p times the value they expect from the level- player, i.e., should be minimised, so the utilities are . Applying InfProg with episodes and a utility hyperparameter to minimise all utilities, we find , taking values approximately 29–37, 19–25 and 13–15. Using more episodes reduces the variance in these numbers.
3.11. Practical Implications
We have demonstrated that a wide range of problems from diverse areas of literature can be modelled as IPs, and that these can be solved by the InfProg solver. Of course, we do not claim that any problem from Operations Research, Game Theory and so on can be tackled by our approach, especially as it can currently handle only discrete variables. However, we believe that it has interesting practical implications.
Firstly, our approach is useful for rapid prototyping. Few researchers are masters of stochastic programming, dynamic programming, game theory, machine learning, and multiple other areas. When faced with a new complex optimisation problem, it is not always clear how to model it, and in practice, we often make simplifying assumptions based on our background. A stochastic programmer might simplify a multi-agent problem by modelling an adversary using random variables, or pretend that decisions cannot affect chance variable distributions, or approximate a multi-stage problem by a two-stage one. A constraint programmer or integer programmer might approximate a chance variable by an expectation. A game theorist might assume a zero-sum game or complete knowledge, even when this is unrealistic. In general, it is hard to know whether our design choices are reasonable. We believe that a tool such as InfProg is useful during the early stages of modelling, as it enables us to explore the consequences of various simplifications without the need to master multiple research areas: we can simply change the problem formulation and observe the result. Some simplifications might cause little change in solution quality, while others might greatly reduce quality or even make a problem infeasible. The eight-queens and stochastic program models can be seen as examples: in both cases, we implemented different models and compared the resulting objective values, leading to the conclusion that dense models give better results.
Secondly, we hope that our approach will be useful in its own right. It is based on multi-agent multi-objective RL, a very general paradigm with wide applications. For specific problems, we do not expect InfProg to be competitive with specialised algorithms, but for problems that are inherently complex, it might be a useful tool, as there is little available software able to tackle problems involving multiple agents, multiple objectives, random events, and partial knowledge.
It should be noted that InfProg is merely a research prototype and we certainly do not consider it to be a finished product. We hope that more competitive IP algorithms will emerge in the future, especially based on deep RL and implemented on highly parallel hardware. This should greatly enhance the scalability of the IP approach to complex optimisation problems, as RL has already shown its ability to tackle large multi-agent problems such as learning to play Go.
3.12. A Small Case Study
To illustrate our approach in more detail, we show the programmable utility for the Oil Wildcatter of
Section 3.3 problem in
Figure 6. The user must provide a C function called
utility with two arguments: a one-dimensional array
v of integer chance and decision variables and a two-dimensional array
u of utilities. This function is called at the end of an episode, so we can assume that all variables in
v have been assigned values. (The
* and
** are C notations indicating array dimensionality).
For readability, we have created integer variables to name the IP variables (for example
oil=v[0]), their possible values (for example
yes=0) and the only agent (
company—note that chance variables are assigned agent number 0). The total payoff is the sum of the drill and test payoffs, and this sum is assigned to the only utility
u[1][0] in the problem. The second index is
0 because an agent’s utilities are numbered from 0: for a bi-objective problem such as that in
Section 3.4, we would also need to assign a value to
u[company][1].
The user must also provide a small text file describing the parameters of the problem (number of agents, variables, domain sizes, and so on). We do not show an example as InfProg is currently a research prototype and lacks a user-friendly specification language.