2.1. Texas Hold’em and the Complexity of Multi-Agent Dynamics
In many-agent systems, simple interactions can become complex adaptive systems due to agent behavior, as the game of poker shows. Solutions to simplified models of two-player poker predate game theory as a field [
29], and for simplified variants, two-player draw poker has a fairly simple optimal strategy [
30]. These early, manually computed solutions were made possible both by limiting the complexity of the cards, and more importantly by limiting interaction to a single bet size, with no raising or interaction between the players. In the more general case of heads-up limit Texas Hold’em, significantly more work was needed, given the multiplicity of card combinations, the existence of hidden information, and player interaction, but this multi-stage interactive game is “now essentially weakly solved” [
31]. Still, this game involves only two players. In the no-limit version of the game, Brown and Sandholm recently unveiled superhuman AI [
32], which restricts the game to “Heads’ Up” poker, which involves only two players per game, and still falls far short of a full solution to the game.
The complex adaptive nature of multi-agent systems means that each agent needs model not only model the system itself, but also the actions of the other player(s). The multiplicity of potential outcomes, betting strategies, and different outcomes becomes rapidly infeasible to represent other than heuristically. In limit Texas Hold’em poker, for example, the number of card combinations is immense, but the branching possibilities for betting is the more difficult challenge. In a no-betting game of Hold’em with P players, there are
possible situations. This is
hands in the two-player case,
in the three-player case, and growing by a similar factor when expanded to the four-, five-, or six-player case. The probability of winning is the probability that the five cards on the table plus two unknown other cards from the deck are a better hand than any that another player holds. In Texas Hold’em, there are four betting stages, one after each stage of cards is revealed. Billings et al. use a reduced complexity game (limiting betting to three rounds per stage) and find a complexity of
in the two-hand case [
33]. That means the two-player, three-round game complexity is comparable in size to a no-betting four-player game, with
card combinations possible.
Unlike a no-betting game, however, a player must consider much more than the simple probability that the hand held is better than those held by other players. That calculation is unmodified during the additional branching due to player choices. The somewhat more difficult issue is that the additional branching requires Bayesian updates to estimate the probable distribution of hand strengths held by other players based on their decisions, which significantly increases the complexity of solving the game. The most critical challenge, however, is that each player bets based on the additional information provided by not only the hidden information provided by their cards, but also based on the betting behavior of other players. Opponent(s) make betting decisions based on non-public information (in Texas Hold’em, hole cards) and strategy for betting requires a meta-update taking advantage of the information the other player reveals by betting. The players must also update based on potential strategic betting by other players, which occurs when a player bets in a way calculated to deceive. To deal with this, poker players need to model not just the cards, but also the strategic decisions of other players. This complex model of strategic decisions must be re-run for all the possible combinations at each decision point to arrive at a conclusion about what other players are doing. Even after this is complete, an advanced poker player, or an effective AI, must then decide not just how likely they are to win, but also how to play strategically, optimizing based on how other players will react to the different choices available.
Behaviors such as bluffing and slow play are based on these dynamics, which become much more complex as the number of rounds of betting and the number of players increases. For example, slow play involves underbidding compared to the strength of your hand. This requires that the players will later be able to raise the stakes, and allows a player to lure others into committing additional money. The complexity of the required modeling of other agents’ decision processes grows as a function of the number of choices and stages at which each agent makes a decision. This type of complexity is common in multi-agent systems. In general, however, the problem is much broader in scope than what can be illustrated by a rigidly structured game such as poker.
2.2. Limited Complexity Models versus the Real World
In machine learning systems, the underlying system is approximated by implicitly or explicitly learning a multidimensional transformation between inputs and outputs. This transformation approximates a combination of the relationships between inputs and the underlying system, and between the system state and the outputs. The complexity of the model learned is limited by the computational complexity of the underlying structure, and while the number of possible states for the input is large, it is typically dwarfed by the number of possible states of the system.
The critical feature of machine learning that allows such systems to be successful is that most relationships can be approximated without inspecting every available state. (All models simplify the systems they represent.) The implicit simplification done by machine learning is often quite impressive, picking up on clues present in the input that humans might not notice, but it comes at the cost of having difficult to understand and difficult to interpret implicit models of the system.
Any intelligence, whether machine learning-based, human, or AI, requires similar implicit simplification, since the branching complexity of even a relatively simple game such as Go dwarfs the number of atoms in the universe. Because even moderately complex systems cannot be fully represented, as discussed by Soares [
34], the types of optimization failures discussed above are inevitable. The contrapositive to Conant and Ashby’s theorem [
35] is that if a system is more complex than the model, any attempt to control the system will be imperfect. Learning, whether human or machine, builds approximate models based on observations, or input data. This implies that the behavior of the approximation in regions far from those covered by the training data is more likely to markedly differ from reality. The more systems change over time, the more difficult prediction becomes—and the more optimization is performed on a system, the more it will change. Worsening this problem, the learning that occurs in ML systems fails to account for the embedded agency issues discussed by Demski and Garrabrant [
36], and interaction between agents with implicit models of each other and themselves amplifies many of these concerns.
2.3. Failure modes
Because an essential part of multi-agent dynamic system modeling is opponent modeling, the opponent models are a central part of any machine learning model. These opponent models may be implicit in the overall model, or they may be explicitly represented, but they are still models that are approximate. In many cases, opponent behavior is ignored—by implicitly simplifying other agent behavior to noise, or by assuming no adversarial agents exist. Because these models are imperfect, they will be vulnerable to overoptimization failures discussed above.
The list below is conceptually complete, but limited in at least three ways. First, examples given in this list primarily discuss failures that occur between two parties, such as a malicious actor and a victim, or failures induced by multiple individually benign agents. This would exclude strategies where agents manipulate others indirectly, or those where coordinated interaction between agents is used to manipulate the system. It is possible that when more agents are involved, more specific classes of failure will be relevant.
Second, the below list does not include how other factors can compound metric failures. These are critical, but may involve overoptimization, or multiple-agent interaction, only indirectly. For example, O’Neil discusses a class of failure involving the interaction between the system, the inputs, and validation of outputs [
37]. These failures occur when a system’s metrics are validated in part based on outputs it contributes towards. For example, a system predicting greater crime rates in areas with high minority concentrations leads to more police presence, which in turn leads to a higher rate of crime found. This higher rate of crime in those areas is used to train the model, which leads it to reinforce the earlier unjustified assumption. Such cases are both likely to occur, and especially hard to recognize, when the interaction between multiple systems is complex, and it is unclear whether the system’s effects are due in part to its own actions (This class of failure seems particularly likely in systems that are trained via ”self-play,” where failures in the model of the system get reinforced by incorrect feedback on the basis of the models, which is also a case of model insufficiency failure.).
Third and finally, the failure modes exclude cases that do not directly involve metric overoptimizations, such as systems learning unacceptable behavior implicitly due to training data that contains unanticipated biases, or failing to attempt to optimize for social preferences such as fairness. These are again important, but they are more basic failures of system design.
With those caveats, we propose the following classes of multi-agent overoptimization failures. For each, a general definition is provided, followed by one or more toy models that demonstrate the failure mode. Each agent attempts to achieve their goal by optimizing for the metric, but the optimization is performed by different agents without any explicit coordination or a priori knowledge about the other agents. The specifics of the strategies that can be constructed and the structure of the system can be arbitrarily complex, but as explored below, the ways in which these models fail can still be understood generally.
These models are deliberately simplified, but where possible, real-world examples of the failures exhibited in the model are suggested. These examples come from both human systems where parallel dynamics exist, and examples of the failures in extent systems with automated agents. In the toy models, and stands for the metric and goal, respectively, for agent i. The metric is an imperfect proxy for the goal, and will typically be defined in relation to a goal. (The goal itself is often left unspecified, since the model applies to arbitrary systems and agent goals.) In some cases, the failure is non-adversarial, but where relevant, there is a victim agent V and an opponent agent O that attempts to exploit it. Please note that the failures can be shown with examples formulated with game-theoretic notation, but doing so requires more complex specifications of the system and interactions than is possible using the below characterization of the agent goals and the systems.
Failure Mode 1. Accidental Steering is when multiple agents alter the systems in ways not anticipated by at least one agent, creating one of the above-mentioned single-party overoptimization failures.
Remark 1. This failure mode manifests similarly to the single-agent case and differs only in that agents do not anticipate the actions of other agents. When agents have closely related goals, even if those goals are aligned, it can exacerbate the types of failures that occur in single-agent cases.
Because the failing agent alone does not (or cannot) trigger the failure, this differs from the single-agent case. The distributional shift can occur due to a combination of actors’ otherwise potentially positive influences by either putting the system in an extremal state where the previously learned relationship decays, or triggering a regime change where previously beneficial actions are harmful.
Model. 1.1—Group Overoptimization. A set of agents each have goals which affect the system in related ways, and metric-goal relationship changes in the extremal region where x>a. As noted above, and stands for the metric and goal, respectively, for agent i. This extremal region is one where single-agent failure modes will occur for some or all agents. Each agent i can influence the metric by an amount , where , but . In the extremal subspace where , the metric reverses direction, making further optimization of the metric harm the agent’s goal. Remark 2. In the presence of multiple agents without coordination, manipulation of factors not already being manipulated by other agents is likely to be easier and more rewarding, potentially leading to inadvertent steering due to model inadequacy, as discussed in Manheim and Garrabrant’s categorization of single-agent cases [3]. As shown there overoptimization can lead to perverse outcomes, and the failing agent(s) can hurt both their own goals, and in similar ways, can lead to negative impacts on the goals of other agents. Model. 1.2—Catastrophic Threshold Failure. Each agent manipulates their own variable, unaware of the overall impact. Even though the agents are collaborating, because they cannot see other agents’ variables, there is no obvious way to limit the combined impact on the system to stay below the catastrophic threshold T. Because each agent is exploring a different variable, they each are potentially optimizing different parts of the system. Remark 3. This type of catastrophic threshold is commonly discussed in relations to complex adaptive systems, but can occur even in systems where the catastrophic threshold is simple. The case discussed by Michael Eisen involves pricing on Amazon was due to a pair of deterministic linear pricing-setting bots interacting to set the price of an otherwise unremarkable biology book at tens of millions of dollars, showing that runaway dynamics are possible even in the simplest cases [38]. This phenomenon is also expected whenever exceeding some constraint breaks the system, and such constraints are often not identified until a failure occurs. Example 1. This type of coordination failure can occur in situations such as overfishing across multiple regions, where each group catches local fish, which they can see, but at a given threshold across regions the fish population collapses, and recovery is very slow. (In this case, the groups typically are selfish rather than collaborating, making the dynamics even more extreme.)
Example 2. Smaldino and McElreath [39] shows this failure mode specifically occurring with statistical methodology in academia, where academics find novel ways to degrade statistical rigor. The more general “Mutable Practices” model presented by Braganza [8], based on part on Smaldino and McElreath, has each agent attempting to both outperform the other agents on a metric as well as fulfill a shared societal goal, allows agents to evolve and find new strategies that combine to subvert a societal goal. Failure Mode 2. Coordination Failure occurs when multiple agents clash despite having potentially compatible goals.
Remark 4. Coordination is an inherently difficult task, and can in general be considered impossible [40]. In practice, coordination is especially difficult when the goals of other agents are incompletely known or not fully understood. Coordination failures such as Yudkowsky’s Inadequate equilibria are stable, and coordination to escape from such an equilibrium can be problematic even when agents share goals [41]. Model. 2.1—Unintended Resource Contention. A fixed resource R is split between uses by different agents. Each agent has limited funds , and is allocated to agent i for exploitation in proportion to their bid for the resources . The agents choose amounts to spend on acquiring resources, and then choose amounts to exploit each resource, resulting in utility . The agent goals are based on the overall exploitation of the resources by all agents.In this case, we see that conflicting instrumental goals that neither side anticipates will cause wasted funds due to contention. The more funds spent on resource capture, which is zero-sum, the less remaining for exploitation, which can be positive-sum. Above nominal spending on resources to capture them from aligned competitor-agents will reduce funds available for exploitation of those resources, even though less resource contention would benefit all agents. Remark 5. Preferences and gains from different uses can be homogeneous, so that all agents have no marginal gain from affecting the allocation, funds will be wasted on resource contention. More generally, heterogeneous preferences can lead to contention to control the allocation, with sub-optimal individual outcomes, and heterogeneous abilities can lead to less-capable agents harming their goals by capturing then ineffectively exploiting resources.
Example 3. Different forms of scientific research benefit different goals differently. Even if spending in every area benefits everyone, a fixed pool of resources implies that with different preferences, contention between projects with different positive impacts will occur. To the extent that effort must be directed towards grant-seeking instead of scientific work, the resources available for the projects themselves are reduced, sometimes enough to cause a net loss.
Remark 6. Coordination limiting overuse of public goods is a major area of research in economics. Ostrom explains how such coordination is only possible when conflicts are anticipated or noticed and where a reliable mechanism can be devised [42]. Model. 2.2—Unnecessary Resource Contention. As above, but each agent has an identical reward function of . Even though all goals are shared, a lack of coordination in the above case leads to overspending, as shown in simple systems and for specified algebraic objective functions in the context of welfare economics. This literature shows many methods for how gains are possible, and in the simplest examples this occurs when agents coordinate to minimize overall spending on resource acquisition.
Remark 7. Coordination mechanisms themselves can be exploited by agents. The field of algorithmic game theory has several results for why this is only sometimes possible, and how building mechanisms to avoid such exploitation is possible [43]. Failure Mode 3. Adversarial optimization can occur when a victim agent has an incomplete model of how an opponent can influence the system. The opponent’s model of the victim allows it to intentionally select for cases where the victim’s model performs poorly and/or promotes the opponent’s goal [3]. Model. 3.1—Adversarial Goal Poisoning. In this case, the Opponent O can see the metric for the victim, and can select for cases where y is large and X is small, so that V chooses maximal values of X, to the marginal benefit of O. Example 4. A victim’s model can be learned by “Stealing” models using techniques such as those explored by Tramèr et al. [44]. In such a case, the information gained can be used for model evasion and other attacks mentioned there. Example 5. Chess and other game engines may adaptively learn and choose openings or strategies for which the victim is weakest.
Example 6. Sophisticated financial actors can make trades to dupe victims into buying or selling an asset (“Momentum Ignition”) in order to exploit the resulting price changes [45], leading to a failure of the exploited agent due to an actual change in the system which it misinterprets. Remark 8. The probability of exploitable reward functions increases with the complexity of the system the agents manipulate [5], and the simplicity of the agent and their reward function. The potential for exploitation by other agents seems to follow the same pattern, where simple agents will be manipulated by agents with more accurate opponent models. Model. 3.2—Adversarial Optimization Theft. An attacker can discover exploitable quirks in the goal function to make the victim agent optimize for a new goal, as in Manheim and Garrabrant’s Campbell’s law example, slightly adapted here [3].O selects after seeing V’s choice of metric. In this case, we can assume the opponent chooses a metric to maximize based on the system and the victim’s goal, which is known to the attacker. The opponent can choose their so that the victim’s later selection then induces a relationship between X and the opponent goal, especially at the extremes. Here, the opponent selects such that even weak selection on hijacks the victim’s selection on to achieve their goal, because states where is high have changed. In the example given, if , the correlation between and is zero over the full set of states, but becomes positive on the subspace selected by the victim. (Please note that the opponent choice of metric is not itself a useful proxy for their goal absent the victim’s actions—it is a purely parasitic choice.) Failure Mode 4. Input spoofing and filtering—Filtered evidence can be provided, or false evidence can be manufactured and put into the training data stream of a victim agent.
Model. 4.1—Input Spoofing. Victim agent receives public data about the present world-state, and builds a model to choose actions which return rewards . The opponent can generate events to poison the victim’s learned model.
Remark 9. See the classes of data poisoning attacks explored by Wang and Chaudhuri [46] against online learning, and of Chen et al [47]. for creating backdoors in deep-learning verification systems. Example 7. Financial market participants can (illegally) spoof by posting orders that will quickly be canceled in a “momentum ignition” strategy to lure others into buying or selling, as has been alleged to be occurring in high-frequency-trading [45]. This differs from the earlier example in that the transactions are not bona-fide transactions which fool other agents, but are actually false evidence. Example 8. Rating systems can be attacked by inputting false reviews into a system, or by discouraging reviews by those likely to be the least or most satisfied reviewers.
Model. 4.2—Active Input Spoofing. As in (4.1), where the victim agent employs active learning. In this case, the opponent can potentially fool the system into collecting data that seems very useful to the victim from crafted poisoned sources.
Example 9. Honeypots can be placed, or Sybil attacks mounted by opponents to fool victims into learning from examples that systematically differ from the true distribution.
Example 10. Comments by users “Max” and “Vincent DeBacco” on Eisen’s blog post about Amazon pricing suggested that it is very possible to abuse badly built linear pricing models on Amazon to receive discounts, if the algorithms choose prices based on other quoted prices [38]. Model. 4.3—Input Filtering. As in (4.1), but instead of generating false evidence, true evidence is hidden to systematically alter the distribution of events seen.
Example 11. Financial actors can filter the evidence available to other agents by performing transactions they do not want seen as private transactions or dark pool transactions.
Remark 10. There are classes of system where it is impossible to generate arbitrary false data points, but selective filtering can have similar effects.
Failure Mode 5. Goal co-option is when an opponent controls the system the Victim runs on, or relies on, and can therefore make changes to affect the victim’s actions.
Remark 11. Whenever the computer systems running AI and ML systems are themselves insecure, it presents a very tempting weak point that potentially requires much less effort than earlier methods of fooling the system.
Model. 5.1—External Reward Function Modification. Opponent O directly modifies Victim V’s reward function to achieve a different objective than the one originally specified.
Remark 12. Slight changes in a reward function may have non-obvious impacts until after the system is deployed.
Model. 5.2—Output Interception. Opponent O intercepts and modifies Victim V’s output.
Model. 5.3—Data or Label Interception. Opponent O modifies externally stored scoring rules (labels) or data inputs provided to Victim V’s output.
Example 12. Xiao, Xiao, and Eckert explore a “label flipping” attack against support vector machines [48] where modifying a limited number of labels used in the training set can cause performance to deteriorate severely. Remark 13. As noted above, there are cases where generating false data may be impossible or easily detected. Modifying the inputs during training may create less obvious traces of an attack has occurred. Where this is impossible, access can also allow pure observation which, while not itself an attack, can allow an opponent to engage in various other exploits discussed earlier.
To conclude the list of failure modes, it is useful to note a few areas where the failures are induced or amplified. This is when agents explicitly incentivize certain behaviors on the part of other agents, perhaps by providing payments. These public interactions and incentive payments are not fundamentally different from other failure modes, but can create or magnify any of the other modes. This is discussed in literature on the evolution of collusion, such as Dixon’s treatment [
49]. Contra Dixon, however, the failure modes discussed here can prevent the collusion from being beneficial. A second, related case is when creating incentives where an agent fails to anticipate either the ways in which the other agents can achieve the incentivized target, or the systemic changes that are induced. These so-called “Cobra effects” [
3] can lead to both the simpler failures of the single-agent cases explored in Manheim and Garrabrant, and lead to the failures above. Lastly, as noted by Sandberg [
50], agents with different “speeds” (and, equivalently, processing power per unit time,) can exacerbate victimization, since older and slower systems are susceptible, and susceptibility to attacks only grows as new methods of exploitation are found.