Next Article in Journal
Differential Games of Cournot Oligopoly with Consideration of Pollution, Network Structure, and Continuous Updating
Previous Article in Journal
A Note on the Welfare and Policy Implications of a Two-Period Real Option Game Under Imperfect Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Optimal Strategies in a Duel Game

Department of Electronics and Computer Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
Games 2025, 16(1), 8; https://doi.org/10.3390/g16010008
Submission received: 9 November 2024 / Revised: 17 December 2024 / Accepted: 20 January 2025 / Published: 5 February 2025
(This article belongs to the Section Learning and Evolution in Games)

Abstract

:
We study a duel game in which each player has incomplete knowledge of the game parameters. We present a simple, heuristically motivated and easily implemented algorithm by which, in the course of repeated plays, each player estimates the missing parameters and consequently learns his optimal strategy.

1. Introduction

We study a two-player zero-sum duel game in which certain game parameters are unknown, and we present an algorithm by which each player can estimate the opponent’s parameters and consequently learn his optimal strategy.
The duel game with which we are concerned has been presented in Polak (2007), and a similar one was examined in Prisner (2014). The game is a variation of a duel game and, more generally, games of timing, as presented in Fox and Kimeldorf (1969); Garnaev (2000); Radzik (1988).
We call the two players P 1 and P 2 ; for n { 1 , 2 } , P n has a kill probability vector p n , with the component p n , d (for d N ) giving the probability that P n kills his opponent when he shoots at him from distance d. In our analysis, we make the following assumptions:
  • The kill probabilities are given by a function, f:
    n , d : p n , d = f ( d , θ n ) ,
    where θ n is a parameter vector.
  • Each P n knows the general form f ( d , θ ) and his own parameter vector θ n , but not that of his opponent.
  • The duel will be repeated a large number of times.
Each P n ’s goal is to compute a strategy that optimizes his payoff, which (as will be seen in Section 2), depends on both p 1 and p 2 ; since P n does not know his opponents kill probability, he does not know either his own or his opponent’s payoff. In short, we have an incomplete information game.
Games of incomplete information are those in which each player has only partial information about the payoff parameters. Clearly, the players must utilize some form of learning, namely the use of information (i.e., data) collected from previous plays of a game to adjust their strategies through belief update, imitation, reinforcement etc.
The first work on learning in games was, arguably, the introduction of the fictitious play algorithm in Brown (1949); Robinson (1951). In later years, several approaches to learning were proposed in the literature.
The Bayesian game approach introduced by Harsanyi (1962) has led to intensive research on the subject of Bayesian games; for a recent review, see Zamir (2020). However, the main corpus of this literature is concerned with Bayesian reasoning for a single play of a game, rather than learning from repeated plays, and this point of view is not particularly relevant to our approach.
Another approach, which can be understood as a form of “intergenerational” learning, is that of evolutionary game theory, as presented in Alexander (2023); Tanimoto (2015); Weibull (1997), but this is not directly concerned with games of incomplete information and hence is also not very relevant to our approach. However, for some interesting remarks on the connection between learning and evolutionary game theory, see the book by Cressman (2003).
The main approach to learning in games is the one exemplified by Fudenberg and Levine (1998)’s seminal book The Theory Of Learning In Games. This has spawned a tremendous amount of research; a recent overview of the field appears in Fudenberg and Levine (2016).
Fudenberg and Levine’s approach is directly relevant to learning in the duel game, but we believe it is too general. In particular, we have not been able to find any works which address the specific case in which the functional form of payoff functions is known to the players and it is only required to estimate parameter values.
This is essentially a problem of parameter estimation, and there is another research corpus in which this problem is addressed explicitly: that of machine learning. Particularly relevant to our needs is the literature on multi-agent reinforcement learning, such as Alonso et al. (2001); Gronauer and Diepold (2022). Reinforcement learning is a basic component of machine learning and, in the machine learning community, the connection between multi-agent problems and game theory has long been recognized (Nowe et al., 2012; Rezek et al., 2008); recent and very extensive reviews can be found in Jain et al. (2024); Yang and Wang (2020).
In particular, the issue of convergence of parameter estimation has been comprehensively studied (Hussain et al., 2023; Leonardos et al., 2021; Martin & Sandholm, 2021; Mertikopoulos & Sandholm, 2016). An important conclusion of these and related works is the trade-off between exploration and exploitation. Quoting from Hussain et al. (2023): “… understanding exploration remains an important endeavour as it allows agents to avoid suboptimal, or potentially unsafe areas of their state space … In addition, it is empirically known that the choice of exploration rate impacts the expected total reward”.
In short, extensive literature on learning and incomplete information games is available but, while some ingenious and highly technical approaches have been proposed, we believe that much simpler techniques suffice for our duel learning problem. This is mainly due to the fact that each player knows the general form of the kill probabilities and only needs to estimate his opponent’s parameter vector. Hence, in the current paper, we propose a simple, heuristically motivated, and easily implemented algorithm which allows each player to estimate his optimal strategy, using the information collected from repeated plays of the duel.
The paper is organized as follows. In Section 2 we present the rules of the game. In Section 3 we solve the game under the assumption of complete information. In Section 4 we present our algorithm for solving the game when the players have incomplete information. In Section 5 we evaluate the algorithm by numerical experiments. Finally, in Section 6 we summarize our results and present our conclusions.

2. Game Description

The duel with which we are concerned is a zero-sum game played between players P 1 and P 2 , under the following rules.
  • It is played in discrete rounds (time steps) t 1 , 2 , .
  • In the first turn, the players are at distance D.
  • P 1 (resp. P 2 ) plays on odd (resp. even) rounds.
  • On his turn, each player has two choices: (i) he can shoot his opponent or (ii) he can move one step forward, reducing their distance by one.
  • If P n shoots (with n { 1 , 2 } ), he has a kill probability  p n d of hitting (and killing) his opponent, where d is their current distance. If he misses, the opponent can walk right next to him and shoot him for a certain kill.
  • Each player’s payoff is 1 if he kills the opponent and 1 if he is killed. Note that a player who misses his shot has payoff 1 , since the opponent will approach him at distance one and shoot at him with kill probability one.
For n { 1 , 2 } , we will denote by x n ( t ) the position of P n at round t. The starting positions are x 1 ( 0 ) = 0 and x 2 ( 0 ) = D , with  D = 2 N , N N . The distance between the players at time t is
d = | x 1 t x 2 t | .
For n { 1 , 2 } , the kill probability is a decreasing function p n : { 1 , 2 , D } [ 0 , 1 ] with p n 1 = 1 . It is convenient to describe the kill probabilities as vectors:
p n = p n , 1 , , p n , D = p n 1 , , p n D .
This duel can be modeled as an extensive form game or tree game. The game tree is a directed graph G = ( V , E ) with vertex set
V = 1 , 2 , , 2 D ,
where
  • the vertex v = d 1 , 2 , , D corresponds to a game state in which the players are at distance d, and
  • the vertex v = d + D D + 1 , D + 2 , , 2 D is a terminal vertex, in which the “active” player has fired at his opponent.
The edges correspond to state transitions; it is easy to see that the edge set is
E = 1 , D + 1 , 2 , 1 , 2 , D + 2 , 3 , 2 , 3 , D + 3 , , D , D 1 , D , 2 D .
An example of the game tree, for  D = 6 , appears in Figure 1. The circular (resp. square) vertices are the ones in which P 1 (resp. P 2 ) is active, and the rhombic vertices are the terminal ones.
To complete the description of the game, we will define the expected payoff for the terminal vertices. Note that the terminal vertex d + D is the child of the nonterminal vertex d in which:
  • The distance of the players is d and, assuming P n to be the active player, his probability of hitting his opponent is p n , d .
  • The active player is P 1 (resp. P 2 ) if d is even (resp. odd).
Keeping the above in mind, we see that the payoff (to P 1 ) of vertex d + D is
d 1 , , D : Q d + D = p 1 , d · 1 + 1 p 1 , d · 1 = 2 p 1 , d 1 , when d is even ; p 2 , d · 1 + 1 p 2 , d · 1 = 1 2 p 2 , d when d is odd .
Since this is a zero-sum game, the payoff to P 2 at vertex d + D is Q d + D . This completes the description of the duel.

3. Solution with Complete Information

It is easy to solve the above duel when each player knows D and both  p 1 and p 2 . We construct the game tree as described in Section 2 and we solve it by backward induction. Since the method is standard, we simply give an example of its application. Suppose that
n , d : p n , d = 1 when d = 1 , min 1 , c n d k n when d > 1 .
We take c 1 = 1 , k 1 = 1 , c 2 = 1 , k 2 = 1 2 . The kill probabilities are as seen in Table 1.
The game tree with terminal payoffs is illustrated in Figure 2.
By the standard backward induction procedure, we select the optimal action at each vertex and also compute the values of the nonterminal vertices. These are indicated in Figure 3 (optimal actions correspond to thick edges). We see that the game value is 0.1547 , attained by P 2 shooting when the players are at distance 3 (which happens in round 4).
We next present a proposition which characterizes each player’s optimal strategy in terms of a shooting threshold.1 In the following, we use the standard notation by which “ n ” denotes the “other player”, i.e., p 1 = p 2 and p 2 = p 1 .2
Theorem 1. 
We define for  n 1 , 2  the shooting criterion vectors  K n = K n , 1 , , K n , D  where
K n , 1 = 1 and for d 2 : K n , d = p n , d + p n , d 1 .
Then, the optimal strategy for  P n  is to shoot as soon as it is his turn and the distance of the players is less than or equal to the shooting threshold  d n , where:
d 1 = max { d : K 1 , d 1 } , d 2 = max { d : K 2 , d 1 } .
Proof. 
Suppose the players are at distance d and the active player is P n .
  • If P n will not shoot in the next round, when their distance will be d 1 , then P n must also not shoot in the current round, because he will have a higher kill probability in his next turn, when they will be at distance d 2 .
  • If P n will shoot in the next round, when their distance will be d 1 , then P n should shoot if p n , d (his kill probability now) is higher than 1 p n , d 1 ( P n ’s miss probability in the next round). In other words, P n must shoot if
    p n , d 1 p n , d 1
    or, equivalently, if
    K n , d = p n , d + p n , d 1 1 .
Hence, we can reason as follows:
  • At vertex 1, P 2 is active and his only choice is to shoot.
  • At vertex 2, P 1 is active and he knows P 2 will certainly shoot in the next round. Hence, P 1 will shoot if he has an advantage, i.e., if
    Q 1 2 Q 1 1 2 p 1 , 2 1 1 2 p 2 , 1 p 1 , 2 + p 2 , 1 1 .
    This is equivalent to K 1 , 1 1 and will always be true.
  • Hence, at vertex 3, P 2 is active and he knows that P 1 will certainly shoot in the next round (at vertex d). So, P 2 will shoot if
    Q 1 3 Q 1 2 1 2 p 2 , 3 2 p 1 , 2 1 p 2 , 3 + p 1 , 2 1 .
    which is equivalent to K 2 , 3 1 . Also, if  K 2 , 3 < 1 , then P 1 will know, when the game is at vertex 4, that P 2 will not shoot when at 3. So,   P 1 will not shoot when at 4. But then, when at 5, P 2 knows that P 1 will not shoot when at 4. Continuing in this manner, we see that K 2 , 3 < 1 implies that firing will take place exactly at the vertex 2.
  • On the other hand, if  K 2 , 3 1 , then P 1 knows when at 4 that P 2 will shoot at the next round. So, when at 4, P 1 should shoot if K 1 , 4 1 . If, on the other hand, K 1 , 4 < 1 , then P 1 will not shoot when at 4 and P 2 will shoot when at 3.
  • We continue in this manner for increasing values of d. Since both K 1 , d and K 2 , d are decreasing with d, there will exist a maximum d n value (it could equal D) in which some K n , d will be greater than one and P n will be active; then, P n must shoot as soon as the game reaches or passes vertex d n and he “has the action”.
This completes the proof.    □
Returning to our previous example, we compute the vectors K n for n { 1 , 2 } and list them in the following Table 2.
For P 1 the shooting criterion is last satisfied when the distance is d = 3 ; this happens at round 4, in which P 1 is inactive, so he should shoot at round 5. However, for  P 2 the shooting criterion is also last satisfied at distance d = 3 and round 4, in which P 2 is active; so, he should shoot at round 4. This is the same result we obtained with backward induction.

4. Solution with Incomplete Information

As already mentioned, the implementation of either the backward induction or the shooting criterion requires complete information, i.e., knowledge by both players of all game parameters: D, p 1 , and p 2 .
In what follows, we will assume that the players’ kill probabilities have the functional form (where θ is a parameter vector):
d : p 1 , d = f d ; θ 1 , p 2 , d = f d ; θ 2
Suppose now that both players know D and f d ; θ but, for  n { 1 , 2 } , P n only knows his own parameter vector θ = θ n and is ignorant of the opponent’s θ n . Hence, each P n knows p n , but not p n . Consequently, neither player knows his payoff function, which depends on both p 1 and p 2 .
In this case, obviously, the players cannot perform the computations of either backward induction or the shooting criterion. Instead, assuming that the players will engage in multiple duels, we propose the use of an heuristic “exploration-and-exploitation” approach in which each player initially adopts a random strategy and, using information collected from played games, he gradually builds and refines an estimate of his optimal strategy. Our approach is implemented by Algorithm 1, presented below in pseudocode.
Algorithm 1 Learning the Optimal Duel Strategy.
  1:
Input: Duel parameters D, θ 1 , θ 2 ; Learning parameters λ , σ 0 ; Number of plays R
  2:
p 1 = CompKillProb ( θ 1 )
  3:
p 2 = CompKillProb ( θ 2 )
  4:
Randomly initialize parameter estimates θ 1 0 , θ 2 0
  5:
for  r { 1 , 2 , , R }  do
  6:
     p 1 r = CompKillProb ( θ 1 r 1 )
  7:
     p 2 r = CompKillProb ( θ 2 r 1 )
  8:
     d 1 r = CompShootDist ( p 1 , p 2 r )
  9:
     d 2 r = CompShootDist ( p 1 r , p 2 )
10:
     σ r = σ r 1 / λ
11:
    X = PlayDuel ( p 1 , p 2 , d 1 r , d 2 r , σ r , X )
12:
     ( p ^ 1 r , p ^ 2 r ) = EstKillProb ( X )
13:
     θ 1 r = EstPars ( p ^ 1 r , 1 )
14:
     θ 2 r = EstPars ( p ^ 2 r , 2 )
15:
end for
16:
return  d 1 R , d 2 R , θ 1 R , θ 2 R
The following remarks explain the operation of the algorithm.
  • In line 1, the algorithm takes as input: (i) the duel parameters D, θ 1 , and θ 2 ; (ii) two learning parameters  λ , σ 0 ; and (iii) the number R of duels used in the learning process.
  • Then, in lines 2–3, the true kill probability p n (for n { 1 , 2 } ) is computed by the function CompKillProb ( θ n ) , which simply computes
    n , d : p n d = f n d ; θ n .
    We emphasize that these are the true kill probabilities.
  • In line 4, randomly selected parameter vector estimates θ 1 0 , θ 2 0 are generated.
  • Then, the algorithm enters the loop of lines 5–15 (executed for R iterations), which constitutes the main learning process.
    (a)
    In lines 6–7 we compute new estimates of the kill probabilities p n r , by function CompKillProb, based on the estimates of parameters θ n r 1 :
    n , d : p n r d = f n d ; θ n r 1 .
    We emphasize that these are estimates of the kill probabilities, based on the parameter estimates θ n r 1 .
    (b)
    In lines 8–9 we compute new estimates of the shooting thresholds d n r , by function CompShootDist. For P n , this is achieved by computing the shooting criterion K n using the (known to P n ) p n and the (estimated by P n ) p n r .
    (c)
    In line 10, σ r (which will be used as a standard deviation parameter) is divided by the factor λ > 1 .
    (d)
    In line 11, the result of the duel is computed by the function PlayDuel. This is achieved as follows:
    • For n 1 , 2 , P n selects a random shooting distance d ^ n from the discrete normal distribution (Roy, 2003) with mean d n r and standard deviation σ r .
    • With both d ^ 1 , d ^ 2 selected, it is clear which player will shoot first; the outcome of the shot (hit or miss) is a Bernoulli random variable with success probability p n , d ^ n , where P n is the shooting player. Note that p n is the true kill probability.
    The result is stored in a table X, which contains the data (shooting distance, shooting player, hit or miss) of every duel played up to the r-th iteration.
    (e)
    In line 12, the entire game records X are used by EstKillProb ( X ) to obtain empirical estimates of the kill probabilities p ^ 1 r , p ^ 2 r . These estimates are as follows:
    n , d D n : p ^ n , d r = r R n , d Z r R n , d
    where
    D n = d : d such that P n may shoot R n , d = r : in the r - th game P n actually shot from distance d , Z r = 1 if the shot in the r - th game hit the target , 0 if the shot in the r - th game missed the target
    (f)
    In lines 13–14 the function EstPars uses a least squares algorithm to find (only for the P n who currently has the action) θ n r values which minimize the squared error
    J θ n = d D n f n d ; θ n p ^ n , d r 2 .
    (g)
    In line 16, the algorithm returns the final estimates of optimal shooting distances d 1 R , d 2 R and parameters θ 1 R , θ 2 R .
The core of the algorithm is the exploration-exploitation approach, which is implemented by the gradual reduction of σ r in line 10. Since 1 / λ 0 , 1 , we have lim r σ r = 0 .
Exploration is predominant in the initial iterations of the algorithm, when the relatively large value of σ r implies that the players use essentially random shooting thresholds. In this phase, P n ’s shooting threshold and, consequently, his payoff, is suboptimal. However, since P n is also using various, randomly selected shooting thresholds, P n collects information about P n ’s kill probability at various distances. This information can be used to estimate θ n ; the fact that the functional form of the kill probabilities is known results in a tractable parameter estimation problem.
As r increases, we gradually enter the exploitation phase. Namely, σ r tends to zero, which means that each P n uses a shooting threshold that is, with high probability, very close to the one which is optimal with respect to p ^ n r , and his current estimate of p n , the opponent’s kill probability. Provided a sufficient amount of information was collected in the exploration phase, p ^ n r is sufficiently close to p n to ensure that the used shooting threshold is close to the optimal one. The key issue is to choose a learning rate  λ that is “sufficiently slow” to ensure that the exploration phase is long enough to provide enough information for convergence of the parameter estimates to the true parameter values, but not “too slow”, because this will result in slow convergence. This is the exploration-exploitation dilemma, which has been extensively studied in the reinforcement learning literature. For several Q-learning algorithms it has been established that convergence to the optimal policy is guuaranteed, provided the learning rate is sufficiently slow (Ishii et al., 2002; Osband & Van Roy, 2017; Singh et al., 2000); in this manner, a balance is achieved between exploration, which collects data on long-term payoff, and exploitation, which maximizes near-term payoff. Similar conclusions have been established for multi-agent reinforcement learning (Hussain et al., 2023; Martin & Sandholm, 2021).
Since our approach is heuristic, the evaluation of an “appropriately slow” learning rate is performed by the numerical experiments in Section 5. Before proceeding, let us note that “methods with the best theoretical bounds are not necessarily those that perform best in practice” (Martin & Sandholm, 2021).

5. Experiments

Since the motivation for our proposed algorithm is heuristic, in this section, we give an empirical evaluation of its performance. In the following numerical experiments, we will investigate the influence of both the duel parameters (D, p 1 , p 2 ) and the algorithm parameters ( λ , σ 0 , R).

5.1. Experiments Setup

In the following subsection, we present several experiment groups, all of which share the same structure. Each experiment group corresponds to a particular form of the kill probability functions. In each case, the kill probability parameters, along with the initial player distance D, are the game parameters. For each choice of game parameters we proceed as follows.
First, we select the learning parameters λ , σ 0 and the number of learning steps R. These, together with the game parameters, are the experiment parameters. Then, we select a number J of estimation experiments to run for each choice of experiment parameters. For each of the J experiments, we compute the following quantities:
  • The relative error of the final kill probability parameter estimates. For a given parameter θ n , i , this error is defined to be
    Δ θ n , i = θ n , i θ n , i R θ n , i .
  • The relative error of the shooting threshold estimates. Letting d n R be the estimate of the shooting threshold based on the true kill probability vector p n and kill probability vector estimate p n R , this error is defined to be
    Δ d n = d n d n R d n .
  • The relative error of the optimal payoff estimates. Letting Q n R be the estimate of the optimal payoff (computed from the estimated shooting thresholds d 1 R , d 2 R ), this error is defined to be
    Δ Q n = Q n Q n R Q n iff Q n 0 0 iff Q n = 0 and Q n R = Q n 1 iff Q n = 0 and Q n R Q n
    Note that Δ Q 2 = Δ Q 1 , because the game is zero-sum.

5.2. Experiment Group A

In this group, the kill probability function has the form:
n 1 , 2 : p n , d = min c n d k n , 1 .
Let us look at the final results of a representative run of the learning algorithm. With c 1 = 1 , k 1 = 0.5 , c 2 = 1 , k 2 = 1 and D = 10 , we run the learning algorithm with R = 1500 , σ 0 = 6 D , and for three different values λ { 1.001 , 1.01 , 1.05 } . In Figure 4 we plot the logarithm (with base 10) of the relative payoff error Δ Q 1 + ϵ (we have added ϵ = 10 3 to deal with the logarithm of zero error). The three curves plotted correspond to the λ values 1.001, 1.01, and 1.05. We see that, for all λ values, the algorithm achieves zero relative error; in other words, it learns the optimal strategy for both players. Furthermore, convergence is achieved by the 1500th iteration of the algorithm (1500th duel played), as seen by the achieved logarithm value 3 (recall that we have added ϵ = 10 3 to the error, hence the true error is zero). Convergence is fastest for the largest λ value, i.e., λ = 1.05 , and slowest for the smallest value, λ = 1.001 .
In Figure 5, we plot the logarithmic relative errors log 10 Δ d n . These also, as expected, have converged to zero by the 1500th iteration.
The fact that the estimates of the optimal shooting thresholds and strategies achieve zero error does not imply that the same is true of the kill probability parameter estimates. In Figure 6 we plot the relative errors Δ c 1 and Δ k 1 .
It can be seen that these errors do not converge to zero; in fact, for λ { 1.01 , 1.05 } , the errors converge to fixed nonzero values, which indicates that the algorithm obtains wrong estimates. However, the error is sufficiently small to still result in zero-error estimates of the shooting thresholds. The picture is similar for the errors Δ c 2 and Δ k 2 ; hence, their plots are omitted.
In the above, we have given results for a particular run of the learning algorithm. This was a successful run, in the sense that it obtained zero-error estimates of the optimal strategies (and shooting thresholds). However, since our algorithm is stochastic, it is not guaranteed that every run will result in zero-error estimates. To better evaluate the algorithm, we ran it J = 10 times and averaged the obtained results. In particular, in Figure 7, we plot the average of ten curves of the type plotted in Figure 4. Note that now we plot the curve for R = 5000 plays of the duel.
Several observations can be made regarding Figure 7.
  • For the smallest λ value, namely, λ = 1.001 , the respective curve reaches 3 at r = 4521 . This corresponds to zero average error, which means that, in some algorithm runs, it took more than 4500 iterations (duel plays) to reach zero error.
  • For λ = 1.01 , all runs of the algorithm reached zero-error after r = 656 runs.
  • Finally, for λ = 1.05 , the average error never reached zero; in fact, 3 out of 10 runs converged to nonzero-error estimates, i.e., to non-optimal strategies.
The above observations corroborate the remarks at the end of Section 4 regarding reinforcement learning. Namely, a small learning rate (in our case small λ ) results in higher probability of converging to the true parameter values, but also in slower convergence. This can be explained as follows: a small λ results in higher σ r values for a large proportion of duels played by the algorithm; i.e., in more extensive exploration, which however results in slower exploitation (convergence).
We concluded this group of experiments by running the learning algorithm for various combinations of game parameters; for each combination, we recorded the average error attained at the end of the algorithm (i.e., at r = R = 5000 ). The results are summarized in the following Table 3, Table 4 and Table 5.
From Table 6 and Table 7, we see that for λ = 1.001 , almost all learning sessions conclude in zero Δ Q 1 , while increasing the value of λ results in more sessions concluding with non-zero error estimates. Furthermore, we observe that when the average Δ Q 1 converges to zero for multiple values of λ , the convergence is faster for bigger λ . These results highlight the trade off between exploration and exploitation discussed above.
Finally, in Table 8, we see how many learning sessions were run and how many converged to the zero error estimate Δ Q 1 for different values of D and λ .

5.3. Experiment Group B

In this group, the kill probability function is piecewise linear:
p n , d = 1 when d 1 , d n 1 1 d n 2 d n 1 d + d n 2 d n 2 d n 1 when d d n 1 , d n 2 0 when d d n 2 , D
Let us look again at the final results of a representative run of the learning algorithm. With d 11 = D , d 12 = D / 3 , d 21 = 1 , d 22 = D , and D = 8 , we run the learning algorithm with R = 500 , σ 0 = 6 D , and for the values λ { 1.001 , 1.01 , 1.05 } . In Figure 8 we plot the logarithm (with base 10) of the relative payoff error Δ Q 1 + ϵ . We see similar results as in Group A, for all λ values, the algorithm achieves zero relative error. Convergence is achieved by the 300th iteration of the algorithm and it is fastest for λ = 1.05 , and slowest for λ = 1.001 . The plots of the errors log 10 Δ d 1 , log 10 Δ d 2 are omitted, since they are similar to the ones given in Figure 5.
As in Group A, the fact that the estimates of the optimal shooting thresholds and strategies achieve zero error does not imply that the same is true for the kill probability parameter estimates. For example, in this particular run, the relative error Δ d 11 converges to a nonzero value, i.e., the algorithm obtains a wrong estimate of d 11 . However, the error is sufficiently small to still result in a zero-error estimate of the shooting thresholds.
As in Group A, to better evaluate the algorithm, we ran it J = 10 times and averaged the obtained results. In particular, in Figure 9, we plot the average of ten curves of the type plotted in Figure 4. Note that we now plot the curve for R = 500 plays of the duel.
We again ran the learning algorithm for various combinations of game parameters and recorded the average error attained at the end of the algorithm (again at r = R = 5000 ) for each combination. The results are summarized in the following tables.
From Table 6 and Table 7, we observe that for most parameter combinations, all learning sessions concluded with a zero Δ Q 1 for the smaller λ values. However, for λ = 1.05 , the algorithm failed to converge and exhibited a high relative error. Notably, increasing λ from 1.001 to 1.01 generally accelerated convergence, although this is not guaranteed in every case. In one instance, a higher λ (specifically 1.01 ) led to a slower average convergence of Δ Q 1 to zero, indicating the importance of initial random shooting choices. We also observed that the results for the highest λ were suboptimal, with the algorithm failing to converge in varying numbers of sessions, such as in 1 out of 10 or even 5 out of 10 cases. Notably, when convergence did occur, it happened relatively quickly, with sessions typically completing in under 1000 iterations. These findings also highlight the trade-off between exploration and exploitation, as discussed earlier.
Finally, in Table 8, we see how many learning sessions were run and how many converged to the zero-error estimate Δ Q 1 for different values of D and λ .

6. Discussion

We proposed an algorithm for estimating unknown game parameters and optimal strategies for a duel game through the course of repeated plays. We tested an algorithm for two models of the kill probability function and found that it converged for the majority of tests. Furthermore, we observed the established relationship between higher learning rates and reduced convergence quality, underscoring the trade-off between learning speed and stability in convergence. Future research could investigate additional models of the accuracy probability function, including scenarios where the two players employ distinct models, and work toward establishing theoretical bounds on the algorithm’s probabilistic convergence.

Author Contributions

All authors contributed equally to all parts of this work, namely conceptualization, methodology, software, validation, formal analysis, writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Notes

1
This proposition is stated informally in Polak (2007).
2
We use the same notation for several other quantities as will be seen in the sequel.

References

  1. Alexander, J. M. (2023). Evolutionary game theory. Cambridge University Press. [Google Scholar]
  2. Alonso, E., D’Inverno, M., Kudenko, D., Luck, M., & Noble, J. (2001). Learning in multi-agent systems. The Knowledge Engineering Review, 16, 277–284. [Google Scholar] [CrossRef]
  3. Brown, G. W. (1949). Some notes on computation of games solutions (Rand Corporation Report). Rand Corporation. [Google Scholar]
  4. Cressman, R. (2003). Evolutionary dynamics and extensive form games. MIT Press. [Google Scholar]
  5. Fox, M., & Kimeldorf, G. S. (1969). Noisy duels. SIAM Journal on Applied Mathematics, 17, 353–361. [Google Scholar] [CrossRef]
  6. Fudenberg, D., & Levine, D. K. (1998). The theory of learning in games. MIT Press. [Google Scholar]
  7. Fudenberg, D., & Levine, D. K. (2016). Whither game theory? Towards a theory of learning in games. Journal of Economic Perspectives, 30, 151–170. [Google Scholar] [CrossRef]
  8. Garnaev, A. (2000). Games of Timing. In Search games and other applications of game theory (pp. 81–120). Springer. [Google Scholar]
  9. Gronauer, S., & Diepold, K. (2022). Multi-agent deep reinforcement learning: A survey. Artificial Intelligence Review, 55, 895–943. [Google Scholar] [CrossRef]
  10. Harsanyi, J. C. (1962). Bargaining in ignorance of the opponent’s utility function. Journal of Conflict Resolution, 6, 29–38. [Google Scholar] [CrossRef]
  11. Hussain, A., Belardinelli, F., & Paccagnan, D. (2023). The impact of exploration on convergence and performance of multi-agent Q-learning dynamics. In International conference on machine learning (Vol. 1). PMLR. [Google Scholar]
  12. Ishii, S., Yoshida, W., & Yoshimoto, J. (2002). Control of exploitation–exploration meta-parameter in reinforcement learning. Neural Networks, 15, 665–687. [Google Scholar] [CrossRef] [PubMed]
  13. Jain, G., Kumar, A., & Bhat, S. A. (2024). Recent developments of game theory and reinforcement learning approaches: A systematic review. IEEE Access, 12, 9999–10011. [Google Scholar] [CrossRef]
  14. Leonardos, S., Piliouras, G., & Spendlove, K. (2021). Exploration-exploitation in multi-agent competition: Convergence with bounded rationality. Advances in Neural Information Processing Systems, 34, 26318–26331. [Google Scholar]
  15. Martin, C., & Sandholm, T. (2021, February 8–9). Efficient exploration of zero-sum stochastic games [Conference session]. AAAI Workshop on Reinforcement Learning in Games, Virtual Workshop. [Google Scholar]
  16. Mertikopoulos, P., & Sandholm, W. H. (2016). Learning in games via reinforcement and regularization. Mathematics of Operations Research, 41, 1297–1324. [Google Scholar] [CrossRef]
  17. Nowe, A., Vrancx, P., & De Hauwere, Y.-M. (2012). Game theory and multi-agent reinforcement learning. In Reinforcement learning: State-of-the-art (pp. 441–470). Springer. [Google Scholar]
  18. Osband, I., & Van Roy, B. (2017). Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning (pp. 2701–2710). PMLR. [Google Scholar]
  19. Polak, B. (2007). Backward induction: Reputation and duels. Open Yale Courses. Available online: https://oyc.yale.edu/economics/econ-159/lecture-16 (accessed on 19 January 2025).
  20. Prisner, E. (2014). Game theory through examples (Vol. 46). American Mathematical Society. [Google Scholar]
  21. Radzik, T. (1988). Games of timing related to distribution of resources. Journal of Optimization Theory and Applications, 58, 443–471. [Google Scholar] [CrossRef]
  22. Rezek, I., Leslie, D. S., Reece, S., Roberts, S. J., Rogers, A., Dash, R. K., & Jennings, N. R. (2008). On similarities between inference in game theory and machine learning. Journal of Artificial Intelligence Research, 33, 259–283. [Google Scholar] [CrossRef]
  23. Robinson, J. (1951). An iterative method of solving a game. Annals of Mathematics, 54, 296–301. [Google Scholar] [CrossRef]
  24. Roy, D. (2003). The discrete normal distribution. Communications in Statistics-Theory and Methods, 32, 1871–1883. [Google Scholar] [CrossRef]
  25. Singh, S., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement learning algorithms. Machine Learning, 38, 287–308. [Google Scholar] [CrossRef]
  26. Tanimoto, J. (2015). Fundamentals of evolutionary game theory and its applications. Springer. [Google Scholar]
  27. Weibull, J. W. (1997). Evolutionary game theory. MIT Press. [Google Scholar]
  28. Yang, Y., & Wang, J. (2020). An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv, arXiv:2011.00583. [Google Scholar]
  29. Zamir, S. (2020). Bayesian games: Games with incomplete information. Springer. [Google Scholar]
Figure 1. Game tree example.
Figure 1. Game tree example.
Games 16 00008 g001
Figure 2. Game tree example with values of terminal vertices.
Figure 2. Game tree example with values of terminal vertices.
Games 16 00008 g002
Figure 3. Game tree example with values of all vertices.
Figure 3. Game tree example with values of all vertices.
Games 16 00008 g003
Figure 4. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Figure 4. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Games 16 00008 g004
Figure 5. Plot of logarithmic relative errors log 10 Δ d 1 and log 10 Δ d 2 for a representative run of the learning process.
Figure 5. Plot of logarithmic relative errors log 10 Δ d 1 and log 10 Δ d 2 for a representative run of the learning process.
Games 16 00008 g005
Figure 6. Plot of relative parameter errors Δ c 1 and Δ k 1 for a representative run of the learning process.
Figure 6. Plot of relative parameter errors Δ c 1 and Δ k 1 for a representative run of the learning process.
Games 16 00008 g006
Figure 7. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Figure 7. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Games 16 00008 g007
Figure 8. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Figure 8. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Games 16 00008 g008
Figure 9. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Figure 9. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Games 16 00008 g009
Table 1. Kill probabilities.
Table 1. Kill probabilities.
d123456
p 1 , d 1.00000.50000.33330.25000.20000.1667
p 2 , d 1.00000.70710.57740.50000.44720.4082
Table 2. Shooting criterion.
Table 2. Shooting criterion.
d123456
Round654321
K 1 , d 1.50001.04040.82740.70000.6139
K 2 , d 1.70711.07740.83330.69720.6082
Table 3. Values of final average relative error Δ Q 1 for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 .
Table 3. Values of final average relative error Δ Q 1 for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 .
λ 1.001 1.01 1.05
k 1 0.501.001.500.501.001.500.501.001.50
c 1
1.00 0.0000.0000.0000.0000.1000.4820.2240.5001.207
1.50 0.0000.0000.0000.0590.0491.7480.2150.4803.777
2.00 0.0000.0000.0480.0740.1990.0970.1440.9800.072
Table 4. Round at which Δ Q 1 converged to zero for all J = 10 sessions for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions, we note for how many sessions it converged.
Table 4. Round at which Δ Q 1 converged to zero for all J = 10 sessions for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions, we note for how many sessions it converged.
λ 1.001 1.01 1.05
k 1 0.501.001.500.501.001.500.501.001.50
c 1
1.00 4521145629836569/108/107/105/105/10
1.50 3238293917547/109/109/100/108/106/10
2.00 41534238/102/109/106/103/104/106/10
Table 5. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
Table 5. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
λ 1.0011.011.05
D
8 163/170144/17089/170
10 165/170139/17081/170
12 161/170123/17068/170
14 164/170134/17073/170
total 653/680540/680311/680
Table 6. Values of final average relative error Δ Q 1 for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 .
Table 6. Values of final average relative error Δ Q 1 for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 .
λ 1.0011.011.05
d 11 d 12
1 D / 2 0.0000.2331.000
1 2 D / 3 0.0000.0000.799
1D 6 · 10 16 2 · 10 16 0.800
D / 3 2 D / 3 0.0000.0002.799
D / 3 D0.0000.0000.777
D / 2 D0.0000.0000.480
Table 7. Round at which Δ Q 1 converged to zero for all J = 10 sessions for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
Table 7. Round at which Δ Q 1 converged to zero for all J = 10 sessions for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
     λ 1.0011.011.05
d 11 d 12
1 D / 2 2877/100/10
1 2 D / 3 8123339/10
1D7/109/109/10
D / 3 2 D / 3 3261395/10
D / 3 D2253975/10
D / 2 D7114646/10
Table 8. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
Table 8. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
λ 1.0011.011.05
D
8 76/11080/11049/110
10 97/11095/11056/110
12 94/11078/11039/110
14 100/11084/11040/110
total 367/440337/440184/440
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gkekas, A.; Apostolidou, A.; Vernadou, A.; Kehagias, A. Learning Optimal Strategies in a Duel Game. Games 2025, 16, 8. https://doi.org/10.3390/g16010008

AMA Style

Gkekas A, Apostolidou A, Vernadou A, Kehagias A. Learning Optimal Strategies in a Duel Game. Games. 2025; 16(1):8. https://doi.org/10.3390/g16010008

Chicago/Turabian Style

Gkekas, Angelos, Athina Apostolidou, Artemis Vernadou, and Athanasios Kehagias. 2025. "Learning Optimal Strategies in a Duel Game" Games 16, no. 1: 8. https://doi.org/10.3390/g16010008

APA Style

Gkekas, A., Apostolidou, A., Vernadou, A., & Kehagias, A. (2025). Learning Optimal Strategies in a Duel Game. Games, 16(1), 8. https://doi.org/10.3390/g16010008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop