Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems

Liu, Luping; Sun, Gang

doi:10.3390/axioms14040311

Open AccessArticle

Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems

by

Luping Liu

and

Gang Sun

^*

Computer and Information Engineering College, Guizhou University of Commerce, Guiyang 550014, China

^*

Author to whom correspondence should be addressed.

Axioms 2025, 14(4), 311; https://doi.org/10.3390/axioms14040311

Submission received: 18 November 2024 / Revised: 8 December 2024 / Accepted: 11 December 2024 / Published: 18 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Since high complexity and uncertainty is inherent in real-world environments that can influence the strategies choices of agents, we introduce a stochastic perturbation term to characterize the interference caused by uncertain factors on multi-agent systems (MASs). Firstly, the stochastic Q learning is designed by introducing stochastic perturbation term into Q learning, and the corresponding replicator dynamic equations of stochastic Q learning are derived. Secondly, we focus on two-agent games with two and three action scenarios, analyzing the impact of learning parameters on agents’ strategy selection and demonstrating how the learning process converges to its Nash equilibria. Finally, we also conduct a sensitivity analysis on exploration parameters, demonstrating how exploration rates affect the convergence process in potential games. The analysis and numerical experiments offer insights into the effectiveness of different exploration parameters in scenarios involving uncertainty.

Keywords:

multi-agent systems; stochastic perturbation term; Nash equilibrium; stochastic Q learning

MSC:

91A22; 91A25; 37D35; 93E35

1. Introduction

Multi-agent systems (MASs) are a method for solving distributed problems, which are composed of multiple agents interacting with each other in the same environment, and mainly rely on centralized and distributed frameworks to realize equilibrium learning of agents. MASs have been widely used in various fields including financial markets [1], auctions [2], cloud computing [3], smart grids [4], machine learning [5] and formation control [6]. In MASs, the capability of learning is crucial for an agent to behave appropriately in face of unknown opponents and dynamic environments in order to optimize the performance of the agent. Reinforcement learning (RL) has provided driving forces for modeling MASs interaction problems [7,8], which has a certain value and significance for research [9,10,11,12,13,14,15].

The research of RL in a single-agent framework has been notably investigated; however, the understanding of learning ability in MASs has remained an open problem. When multiple agents simultaneously interact with each other, the reward of each agent depends not only on their own actions but also on the actions of other agents [16,17]. Evolutionary game theory (EGT) plays an important role in better understanding the development of social economy and nature. In recent years, RL has recently attracted a lot of attention due to its connection to EGT, and various Q learning variants have been widely studied in game and economics [18,19], respectively. Tuyls et al. [20] first considered the replicator dynamic of Q learning with Boltzmann distribution, and analyzed the asymptotic behavior of a two-agent game. Hart [21] studied the dynamic process of learning with adaptive heuristics by introducing a large class of simple behavior rules. Gomes et al. [22] proposed multi-agent Q learning with

ε

-greedy exploration, and analyzed the expected behavior of two-agent game with the help of different replicator dynamic equations. Barfuss et al. [23] derived the deterministic limit of SARSA-learning, Q learning, and actor–critic learning for stochastic games, respectively, and their dynamics diagrams on multi-agent, multi-state environments reveal a variety of different dynamics regimes. Tsitsiklis [24] introduced asynchronous stochastic approximation into a Q learning algorithm, which was applied to solve Markov decision problems and provided some convergence results under more general condition. Liu et al. [25] studied stochastic evolutionary game between special committees and chief executive officer (CEO) incentive and supervision, and analyzed the boundary conditions of the stability based on stochastic replicator dynamic equations. In addition, the research results between replicator dynamic and evolutionary game have been extended to a number of algorithms such as collective learning [26], regret minimization learning [27,28,29], learning automata and Q learning [20], and evolution of cooperation [29,30,31,32]. Besides, the dynamics of RL have been established, which provided a useful theoretical framework for achieving the equilibrium learning behavior of MASs [20,22,26,33,34,35,36,37], and multi-agent learning algorithm is designed to prevent collective learning from falling into local optimum [17,26]. The high complexity and uncertainty inherent in real-world environments that influence agents’ strategy options, however, introduce few uncertain factors into the decision-making behavior of multi-agent dynamic systems.

In recent years, game theory has provided a reliable mathematical framework for the research of multi-agent interaction problems. Furthermore, the optimal balance problem between exploring and exploiting in MASs which has been providing the basic driving factor of RL, deep learning and EGT [38]. From the perspective of behavioral economics, the learning of agents use time-varying parameters to explore optimal solutions and bounded rational decision-making agents, who try to coordinate with other agents to maximize their profits [39]. In other words, exploration and exploitation are critical to balancing the learning speed and quality of strategies obtained in the learning algorithm. Also, Kianercy et al. [40] considered the dynamics of Boltzman Q learning in a two-action two agents game, and studied the sensitivity of rest-point structure with regard to exploration parameters. Leonardos et al. [41] studied the smooth Q learning in MASs, which is used for balancing exploration and exploitation costs in potential games. The balance between exploring and exploiting in multi-agent RL has always been a challenging issue, but there is little research on this topic. How to understand the balance between exploration and exploitation has a certain value and significance for research.

Inspired by above research work, the capability of learning is crucial for an agent to behave appropriately in face of unknown opponents and dynamic environments. This paper analyzes the decision behavior of MAS from the viewpoint of EGT. We study a stochastic Q learning by introducing stochastic factors into Q learning. The stochastic Q learning has theoretical foundation as the optimal model for studying exploration and exploitation, which captures the cost between game and exploration rate, and it ensures convergence of the set of Nash equilibrium in MAS with heterogeneous learning agents. Besides, stochastic Q learning reveals an interesting connection between exploring–exploiting in RL and selecting–mutating in EGT, and the stochastic Q learning method is used to analyze the equilibria of MASs. Furthermore, we take the modified potential games as an example to investigate sensitivity analysis on exploration parameters, and further study the relationship between the exploration rate of agents and the game equilibria. The ideas of stochastic Q learning provide a new perspective for us to understand and fine-tune the learning process of MASs.

The rest of this paper is structured as follows. Section 2 shows the model of MAS and some necessary prerequisites. Section 3 introduces the stochastic Q learning method and derives the replicator dynamic of stochastic Q learning. Section 4 takes the two-agent game using some examples to realize the equilibrium of MAS; we then extend the two-action game to a three-action game, and further analyze the replicator dynamic of the stochastic Q learning algorithm. Section 5 studies the relationship between the exploration rate of agents and the game equilibrium by showing the results of sensitivity analysis. Section 6 contains a brief and concise summary.

2. Model and Prerequisites

In this section, we introduce the game model of MAS and some prerequisites.

In the framewrk of EGT, an MAS model is described as a tuple

Γ = < N, X_{i}, f_{i} >_{i \in N}

, where

N = {1, \dots, N}

is a finite set of all agents and N stands for the number of all agents. The strategy or action set of the i-th agent is expressed as

A_{i} = {1, \dots, n_{i}}_{i \in N}

. Also, we denote

A = \prod_{i = 1}^{N} A_{i}

as the strategies set of all agents, and

a = (a_{i}, a_{- i})

,

\forall i \in N

. We will denote with

A_{- i} = \prod_{k = 1, k \neq i}^{N} A_{k}

the strategies set of all other agents apart from the i-th agent. To analyze the evolutionary track of strategies choices of agents, let

X_{i} = {x_{i} = (x_{i 1}, \dots, x_{i n_{i}}) \in R^{n_{i}} : x_{i l} \geq 0, \sum_{l = 1}^{n_{i}} x_{i l} = 1, l = 1, \dots, n_{i}}

denote the set of mixed strategy for the i-th agent.

X = \prod_{i = 1}^{N} X_{i}

is all mixed strategies spaces for all agents, with

x = (x_{i}, x_{- i})

,

\forall i \in N

. Furthermore,

X_{- i} = \prod_{k = 1, k \neq i}^{N} X_{k}

represents all mixed strategies spaces of agents apart from the i-th agent, and X is dependent on the simplex of total number of actions

n = n_{1} + \dots + n_{N}

. In time-dependent scenarios, the index t will be used to represent the probability distribution

x_{i} (t) = {(x_{i l} (t))}_{l \in A_{i}}

,

t \geq 0

of the selection action l for the agent i.

Let

f_{i}^{'} : A_{i} \times A_{- i} \to R

be the payoff function of the i-th agent, while

f_{i}^{'} (l, a_{- i})

is dependent on the selection

a_{- i} \in A_{- i}

. The notation

(l, a_{- i})

denotes the strategy where the agent i selects the pure strategy

l \in A_{i}

, and all other agents choose their components of

a_{- i}

. Furthermore, the expected payoff of the i-th agent for a mixed strategy profile

x \in X

is defined by

f_{i} : X_{i} \times X_{- i} \to R

, and

f_{i} (x) = \sum_{a \in A} (x_{i l} f_{i}^{'} (l, a_{- i}) \cdot \prod_{k = 1, k \neq i}^{N} x_{k a_{k}})

, giving the expected payoff received by the agent under joint strategies in each state. Moreover,

r_{i l} (x) : = f_{i} (l, x_{- i})

, and let

r_{i l} (x_{- l})

denote the selecting action

l \in A_{i}

at the joint policy profile x for agent i, while

r_{i} (x) : = r_{i l} {(x)}_{l \in A_{i}}

, where

r_{i} (x)

denotes the payoff vector value of agent i. We use the notation

f_{i} (x) = 〈 x_{i}, r_{i} (x) 〉

, where

〈 \cdot, \cdot 〉

denotes the inner product of

R^{n_{i}}

, i.e.,

〈 x_{i}, r_{i} (x) 〉 = \sum_{l^{'} \in A_{i}} x_{i l^{'}} r_{i l^{'}} (x)

. It is easy to see that

\frac{\partial f_{i} (x)}{\partial x_{i l}} = r_{i l} (x)

. When the game is repeatedly played, it results in a sequential decision problem involving a state.

Next, the definitions of the Nash equilibrium in a multi-agent game system are given.

Definition 1

(see [42]). A strategy profile

x^{*} \in X

is the Nash equilibrium of Γ if for any

i \in N

, there exists

x_{i}^{'} \in X_{i}

such that:

f_{i} (x_{i}^{*}, x_{- i}^{*}) \geq f_{i} (x_{i}^{'}, x_{- i}^{*}),

which means that each agent chooses a strategy according to its own optimal response to the other agent’s strategy. In a Nash equilibrium, an individual agent can obtain no incremental benefit from changing strategies, assuming other agents remain constant in their strategies.

Definition 2

(see [43]). A strategy profile

x \in X

is an evolutionary stable strategy (ESS) if for any

x^{'} \in X

,

x^{'} \neq x

, there exists a constant

{\bar{ε}}_{x^{'}} \in (0, 1)

that satisfies the following inequality:

f (x, ε x^{'} + (1 - ε) x) > f (x^{'}, ε x^{'} + (1 - ε) x), \forall ε \in (0, {\bar{ε}}_{x^{'}}),

where f denotes the payoff of mixed strategy,

x^{'}

denotes the mutant strategy,

{\bar{ε}}_{x^{'}}

represents a constant related to

x^{'}

, and

ε x^{'} + (1 - ε) x

represents a mixed strategy containing mutation and stable strategies. Meanwhile, ESS refers to a strategy adopted by most members of the population, and this strategy is superior to other strategies. From a game perspective, this means that when using ESS, the average payoff in the population is higher than the average payoff after invading a population. Generally speaking, we show that a strategy is an ESS if it can resist evolutionary pressure from any mutation strategy that appears.

According to the definition of Nash equilibrium, an individual agent can obtain no incremental benefit from changing strategies, assuming other agents remain constant in their strategies. The relationships between Nash equilibrium and ESS are that

ESS

is contained in

Nash equilibrium

, and vice versa [44,45]. ESS is an asymptotically stable fixed point of replicator dynamics. In particular, in the process of game learning, replicator dynamics [43] describe the evolution of the frequencies of strategies in one population. In this paper, each agent is regarded as a learner, which is described in the following section.

3. Evolutionary Dynamic of Stochastic Q-Learning

In this section, stochastic Q learning is proposed by considering random disturbance factors in the real world, and we derive the evolutionary dynamic equation of stochastic Q learning.

3.1. Replicator Dynamic of Q Learning

RL is a learning method that utilizes experience for constant trial and error, and mutli-agent RL is an extension of this learning method in multi-agent scenarios. RL is also a computational method to learn whereby the agent efforts to maximize the total amount of payoff it receives while interacting with a complex and uncertain environment. It is employed by various software and machines to find the optimal possible behavior or path in a specific situation. Common RL approaches that can be found in [9] are built around an estimated value function. The Q learning consists of the five ingredients of an update sequence of state, action, reward, next state, and next action. The Q learning update equation is as follows [46]:

\begin{matrix} Q_{t + 1} (s, a) \leftarrow (1 - ν) Q_{t} (s, a) + ν (f_{t + 1} + γ max_{a^{'}} Q_{t} (s^{'}, a^{'})), \end{matrix}

(1)

where

Q_{t + 1} (s, a)

denotes the state-value estimate at time

t + 1

,

Q_{t} (s, a)

shows the state-value estimate at time t of the current state,

max_{a^{'}} Q_{t + 1} (s^{'}, a^{'})

denotes the maximum state-value estimate at step

t + 1

with state

s^{'}

and action

a^{'}

,

f_{t + 1}

represents the payoff of state visited at the next step

t + 1

,

ν

is a step size parameter, and

γ

represents the discount factor.

Next, we derive the replicator dynamic equation of Q learning in a continuous time, and Q-values are modeled as Boltzmann probabilities for the selection of action. With an opportunity of revision, the agent randomly selects a strategy and changes its strategy with the probability given by the Boltzmann distribution:

x_{l} (t) = \frac{e^{ς Q_{l} (t)}}{\sum_{k = 1}^{n} e^{ς Q_{k} (t)}},

(2)

where

x_{l} (t)

indicates the probability of selecting strategy l at time t,

ς

represents the exploration parameter, and

Q_{l} (t)

and

Q_{k} (t)

denote the Q-value that selecting action l and k at time t, respectively.

We are more concerned with the continuous time limit of the learning method. Thus, we divide the time into intervals

δ t

. Afterward, suppose that in each interval, the agent samples their actions and computes the average payoff of the action. Next, we use Equation (1) to update the Q-value at the end of each interval. In a continuous time limit

δ t \to 0

, by [20,40], the target dynamics of Q learning with a large population is denoted by:

\begin{matrix} d x_{l} (t) & = & ς ({\dot{Q}}_{l} - \sum_{k = 1}^{n} {\dot{Q}}_{k} x_{k} (t)) x_{l} (t) d t (and {\dot{Q}}_{l} = \frac{d Q_{l}}{d t}) . \end{matrix}

(3)

In order to change from a discrete step to a continuous version, we assume that the amount of time that passes between two repetitions of the Q-value update are given by

δ

with

0 < δ \leq 1

. The variable

Q_{l} (t δ)

denotes the Q-value that selecting action l at time

t δ

. By Equation (1), then, we have:

\begin{matrix} ▵ Q_{l} (t δ) & = ν (f_{l} ((t + 1) δ) + γ max Q_{l} ((t + 1) δ) - Q_{l} (t δ)) \times ((t + 1) δ - t δ) \\ = ν (f_{l} ((t + 1) δ) + γ max Q_{l} ((t + 1) δ) - Q_{l} (t δ)) \times δ, 0 < δ \leq 1 . \end{matrix}

(4)

Similarly, we are interested in the limit

δ \to 0

. By taking the limit of

Q_{l} (t δ)

with

δ \to 0

, we obtain the state of the limit process at some time

t \geq 0

. If use

δ

divide the both side of Equation (4), and take the limit for

δ \to 0

, then we have:

\begin{matrix} d Q = ν (f_{l} + γ max Q_{l} - Q_{l}) d t . \end{matrix}

(5)

Finally, by substituting Equation (5) into Equation (3), it can be obtained that:

\begin{array}{l} d x (t) & = ς (ν f_{l} + ν γ max Q_{l} - ν Q_{l} - \sum_{k = 1}^{n} x_{k} (t) ν (f_{k} + γ max Q_{l} - Q_{k})) x_{l} (t) d t \\ = ς ν (f_{l} - \bar{f} - Q_{l} + \sum_{k = 1}^{n} Q_{k} x_{k} (t)) x_{l} (t) d t (by \bar{f} = \sum_{k = 1}^{n} x_{k} (t) f_{k}) \\ = ς ν (f_{l} - \bar{f} - \sum_{k = 1}^{n} x_{k} (t) Q_{l} + \sum_{k = 1}^{n} Q_{k} x_{k} (t)) x_{l} (t) d t \\ = ς ν (f_{l} - \bar{f} + \sum_{k = 1}^{n} x_{k} (t) (Q_{k} - Q_{l})) x_{l} (t) d t, \end{array}

where

f_{i}

and

\bar{f}

denote, respectively, the agent’s payoff from action l and average payoff in the population. By Equation (2),

\frac{x_{k} (t)}{x_{l} (t)} = \frac{e^{ς Q_{k} (t)}}{e^{ς Q_{l} (t)}}

, and then we obtain:

\begin{matrix} ς ν \sum_{k = 1}^{n} x_{k} (t) (Q_{k} - Q_{l}) = ν \sum_{k = 1}^{n} x_{k} (t) \ln \frac{x_{k} (t)}{x_{l} (t)} . \end{matrix}

Thus, the above deduction gives us:

\begin{matrix} d x_{l} (t) & = & \underset{Selection operator}{\underset{︸}{(ς ν (f_{l} - \bar{f})}} - \underset{Mutation operator}{\underset{︸}{ν (\ln x_{l} (t) - \sum_{k = 1}^{n} x_{k} (t) \ln x_{k} (t))}}) x_{l} (t) d t, \end{matrix}

(6)

where

x_{l}

denotes the probability distribution of action or strategy l. The first term represents the selection mechanism, and the second term denotes the mutation mechanism. The selection operator is prior to some strategies, and the mutation operator ensures the diversity of the MAS. Furthermore, the learning process not only includes selection mechanism but also mutation mechanism. The selection operator favors certain varieties over others, and the mutation operator guarantees diversity in a population. The iterative process of Q learning shows the process of balancing exploration and exploitation, where exploration is essentially a variation, and exploitation is the selection.

3.2. Replicator Dynamic of Stochastic Q Learning

In order to account for the complexity and uncertainty inherent in real-world environments that can influence agents’ strategy choices, we introduce a stochastic perturbation term to characterize the interference caused by uncertain factors on MASs. On the one hand, the agents have the possibility of different strategic choices because of their own interests; on the other hand, the agent has a great speculative psychology to take self-interested behavior. In addition, the emotional changes and moral hazard of the participants will also affect their strategic behavior. Therefore, it is necessary to consider the interference of random disturbance on the multi-agent game. This paper introduces Gaussian white noise to characterize the replicator dynamic equations of multi-agent game [47] by:

\begin{matrix} \begin{matrix} d x_{l} (t) & = & (ς ν (f_{l} - \bar{f}) - ν (\ln x_{l} (t) - \sum_{k = 1}^{n} x_{k} (t) \ln x_{k} (t))) x_{l} (t) d t + σ x_{l} (t) d w (t), \end{matrix} \end{matrix}

(7)

where

w (t)

is standard one-dimensional Brownian motion, a kind of random fluctuation phenomenon, which can reflect how the game subject is affected by random interference factors. Also,

d w (t)

denotes Gaussian white noise, where

t > 0

is satisfied, and

h > 0

is the step. Furthermore, if

Δ w (t) = w (t + h) - w (t)

obeys normal distribution

N (0, \sqrt{h})

and

σ

is noise intensity, then we obtain that the exploration rate is

\frac{ς ν}{ν} = ς

. Therefore, the above Equation (7) is a one-dimensional Brownian motion, representing the evolutionary replicator dynamics formula of all agents under random disturbance.

3.3. Analysis of the Existence and Stability of Equilibrium Solutions

For the Itô stochastic equation in the above Equation (7), assume that the initial time

t = 0

, i.e., there exists

x (0) = 0

at the initial time of the game such that:

\begin{matrix} d x_{l} (t) = (ς ν (f_{l} - \bar{f}) - ν (\ln x_{l} (t) - \sum_{k = 1}^{n} x_{k} (t) \ln x_{k} ∣ (t))) \times 0 + σ x_{l} (t) d w (t) = 0 . \end{matrix}

(8)

We obtain

d w (t) ∣_{t = 0} = w^{'} (t) d t ∣_{t = 0} = 0

, and there exists at least one zero solution, which means that the system will stay in this state without the interference of external white noise. Therefore, the zero solution is the equilibrium point of Equation (8).

However, the system is bound to be disturbed by the internal and external environment, which will affect the stability of the system. Thus, the influence of random factors on system stability must be considered.

Definition 3.

Let the stochastic process

{x (t), t \geq 0}

satisfy the following stochastic differential equation:

d x_{l} (t) = f (t, x_{l} (t)) d t + g (t, x (t)) d w (t), x (t_{0}) = x_{0} .

(9)

Suppose there exist a function

V (t, x)

and a normal number

c_{1}, c_{2}

, such that

c_{1} {| x |}^{p} \leq V (t, x) \leq c_{2} {| x |}^{p}, t \geq 0

.

(a) If there is a normal number

λ

, such that

L V (t, x) \leq - λ V (t, x) \geq 0

, then the zero solution of Equation (9) is exponentially stable at the p-th moment, and:

E ∣ x (t, x_{0}) ∣^{p} < \frac{c_{2}}{c_{1}} ∣ x_{0} ∣^{p} e^{- λ t}, t \geq 0 .

(b) If there is a normal number

λ

, such that

L V (t, x) \geq λ V (t, x) \geq 0

, then the zero solution of Equation (9) is exponentially unstable at the p-th moment, and:

E ∣ x (t, x_{0}) ∣^{p} \geq \frac{c_{2}}{c_{1}} ∣ x_{0} ∣^{p} e^{- λ t}, t \geq 0 .

For Equation (8), let

V_{t} (t, x (t)) = x (t)

,

c_{1} = c_{2} = 1

,

p = 1

,

λ = 1

. By Definition 3, if the zero solution moment index of the equation is stable, then the following conditions are satisfied:

ς ν (f_{l} - \bar{f}) - ν (\ln x_{l} (t) - \sum_{k = 1}^{n} x_{k} (t) \ln x_{k} (t)) x_{l} (t) \leq - x_{l} (t) .

4. Numerical Experiment

In this section, two agents are regarded as randomly selected from the population to play a game, and we take some examples of a two-agent game to analyze the stochastic Q learning replicator dynamic and understand the decision behavior of the agent in MASs.

Assume that the payoff matrices of two-agent game are

C

and

D

, respectively. The replicator dynamics of stochastic Q learning for the first and second agent are, respectively:

\begin{matrix} \{\begin{matrix} \begin{matrix} {\dot{x}}_{l} & = x_{l} ς ν ({(C y)}_{l} - x C y) - x_{l} ν (\ln x_{l} - \sum_{k = 1}^{n} x_{k} \ln x_{k}) + σ_{1} x_{l} (t) d w (t), \\ {\dot{y}}_{l} & = y_{l} ς ν ({(B x)}_{l} - y B x) - y_{l} ν (\ln x_{l} - \sum_{k = 1}^{n} x_{k} \ln x_{k}) + σ_{2} y_{l} (t) d w (t), \end{matrix} \end{matrix} \end{matrix}

(10)

where

x_{l}

represents the probability of selecting the action l of the first agent,

y_{l}

denotes the probability of selecting the action l of the second agent, and

C

and

D

are the payoff matrices for agents 1 and 2, respectively.

The payoff matrices of agents 1 and 2 with two actions are

C

and

D

, respectively, where:

\begin{matrix} C = (\begin{matrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{matrix}), D = (\begin{matrix} d_{11} & d_{12} \\ d_{21} & d_{22} \end{matrix}) . \end{matrix}

Assume that two agents select the probability of the first action are

x_{1}

and

y_{1}

, and choose the probability of the second action are

x_{2} = 1 - x_{1}

and

y_{2} = 1 - y_{1}

, respectively. Let

x_{1} = x

,

x_{2} = 1 - x

,

y_{1} = y

,

y_{2} = 1 - y

. By Equation (10), the replicator dynamic equations of stochastic Q learning are as follows:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ν ς x (1 - x) ({(C y)}_{1} - {(C y)}_{2}) - ν x (1 - x) (\ln x - \ln (1 - x)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ν ς y (1 - y) ({(D x)}_{1} - {(D x)}_{2}) - ν y (1 - y) (\ln y - \ln (1 - y)) + σ_{2} y (t) d w (t) . \end{matrix} \end{matrix} \end{matrix}

A two-agent game can be categorized into three classes by characterizing the payoff matrix.

Example 1.

The first class game has at least one strictly dominant equilibrium when:

\begin{matrix} \begin{matrix} (a_{11} - a_{21}) (a_{12} - a_{22}) > 0 or \\ (d_{11} - d_{12}) (d_{21} - d_{22}) > 0 . \end{matrix} \end{matrix}

The prisoner’s dilemma (PD) game belongs to the first subclass, where the two actions of the two agents are Defection or Cooperation, respectively, and the payoff matrices are as follows:

\begin{matrix} C_{1} = (\begin{matrix} 1 & 5 \\ 0 & 3 \end{matrix}), D_{1} = (\begin{matrix} 1 & 0 \\ 5 & 3 \end{matrix}) . \end{matrix}

Then, the stochastic Q learning replicator dynamic equations of the PD game are as follows:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ς ν x (1 - x) (2 - y) - ν x (1 - x) (\ln x - \ln (1 - x)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ς ν y (1 - y) (- 3 - x) - ν y (1 - y) (\ln y - \ln (1 - y)) + σ_{2} y (t) d w (t) . \end{matrix} \end{matrix} \end{matrix}

In Figure 1a–c, the convergence plot of stochastic Q learning replicator dynamic is plotted in the PD game by filling in three values of

ς

, specifically

3, 5, 8

. When

ς = 3

, the learning paths converge to different coordinates since lower exploration level lead to the agent deviate from these high reward value actions of seeking strategies. When

ς = 5

, the learning paths are not closer to the coordinates

(1, 0)

. When

ς = 8

, the learning paths converge to coordinates

(1, 0)

corresponding to the strategy profile (Defection, Defection), which means that the PD game converges to the Nash equilibrium (i.e.,

[{(1, 0)}^{T}, {(1, 0)}^{T}]

). Figure 1 shows that the convergence of stochastic Q learning dynamic does not depend on the selection of initial point. Furthermore, we can better understand and predict the trajectory path of Q-learner evolution process, the replicator dynamic of the stochastic Q learning can converge to Nash equilibrium for PD game.

Example 2.

The second class game has two pure strategies equilibria and a mixed strategy equilibrium when:

\begin{matrix} \begin{matrix} (a_{11} - a_{21}) (a_{12} - a_{22}) < 0 \\ (d_{11} - d_{12}) (d_{21} - d_{22}) < 0 and \\ (a_{11} - a_{21}) (d_{11} - d_{12}) > 0 . \end{matrix} \end{matrix}

The battle of sexes (BoS) game belongs to the second subclass, where the two actions of the population are Soccer or Battle, separately, and the payoff matrices are as follows:

\begin{matrix} C_{2} = (\begin{matrix} 2 & 0 \\ 0 & 1 \end{matrix}), D_{2} = (\begin{matrix} 1 & 0 \\ 0 & 2 \end{matrix}) . \end{matrix}

Then, the stochastic Q learning replicator dynamic equations of the BoS game are described by:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ς ν x (1 - x) (3 y - 1) - ν x (1 - x) (\ln x - \ln (1 - x)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ς ν y (1 - y) (3 x - 2) - ν y (1 - y) (\ln y - \ln (1 - y)) + σ_{2} y (t) d w (t) . \end{matrix} \end{matrix} \end{matrix}

As shown in Figure 2a–c, the convergence plot of stochastic Q learning replicator dynamic in the BoS game is plotted by filling in three values of

ς

, specifically 1, 3, and 5. When

ς = 1

, two agents will not converge to any Nash equilibrium. When

ς = 3

, learning paths converge to coordinate

(1, 0)

or

(0, 1)

, corresponding to the Nash equilibrium (i.e.,

[{(1, 0)}^{T}, {(1, 0)}^{T}]

) of the BoS game. When

ς = 5

, those learning paths can converge to two pure strategies corresponding to two actions (Soccer, Soccer) and (Battle, Battle).

Example 3.

The third class game has one mixed equilibrium when:

\begin{matrix} \begin{matrix} (a_{11} - a_{21}) (a_{12} - a_{22}) < 0 \\ (d_{11} - d_{12}) (d_{21} - d_{22}) < 0 and \\ (a_{11} - a_{21}) (d_{11} - d_{12}) < 0 . \end{matrix} \end{matrix}

The matching pennies (MP) game belongs to the third subclass, where the two strategies of the two agents are Head or Tail. It is worth noting that if the two payoff matrices of MP are not transposed against each other, then two agents in the MP are selected from two different populations to play the game. The payoff matrices are as follows:

\begin{matrix} C_{3} = (\begin{matrix} 1 & - 1 \\ - 1 & 1 \end{matrix}), D_{3} = (\begin{matrix} - 1 & 1 \\ 1 & - 1 \end{matrix}) . \end{matrix}

Then, the stochastic Q learning replicator dynamics equations can be obtained by:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ς ν x (1 - x) (2 y - 1) - ν x (1 - x) (\ln x - \ln (1 - x)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ς ν y (1 - y) (- 2 x + 1) - ν y (1 - y) (\ln y - \ln (1 - y)) + σ_{2} y (t) d w (t) . \end{matrix} \end{matrix} \end{matrix}

From Figure 3a–c, the characteristic of MP is that the internal trajectory defines closed trajectory around the fixed point. In the first two plots,

ς

is not large enough compared to the internal trajectory closed orbits around the coordinate (0.5, 0.5). It can be seen from the above third plot that at each point, a learning path starts and converges to a fixed point. It is feasible to use stochastic Q learning algorithms to realize the Nash equilibrium (i.e.,

[{(\frac{1}{2}, \frac{1}{2})}^{T}, {(\frac{1}{2}, \frac{1}{2})}^{T}]

) of the MP.

Next, we extend two actions to three-action situations in two-agent games, and analyze and realize the equilibrium of the game. The payoff matrices of agents 1 and 2 with three actions are, respectively, as follows:

\begin{matrix} C = (\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{matrix}), D = (\begin{matrix} d_{11} & d_{12} & d_{13} \\ d_{21} & d_{22} & d_{23} \\ d_{31} & d_{32} & d_{33} \end{matrix}) . \end{matrix}

Suppose that two agents select the probability of three actions, which are

x_{1}

,

y_{1}

,

z_{1}

and

x_{2}

,

y_{2}

,

z_{3}

, respectively. Let

x_{1} = x

,

y_{1} = y

,

z_{1} = z

and

x_{2} = 1 - y - z

,

y_{2} = 1 - x - z

,

z_{2} = 1 - x - y

. By Equation (10), the replicator dynamic equations of stochastic Q learning are as follows:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ς ν x (E_{x} - \bar{E}) - ν x (\ln x - \ln (1 - y - z)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ς ν y (E_{y} - \bar{E}) - ν y (\ln y - \ln (1 - x - z)) + σ_{2} y (t) d w (t), \\ \dot{z} & = ς ν z (E_{z} - \bar{E}) - ν z (\ln z - \ln (1 - x - y)) + σ_{3} z (t) d w (t), \end{matrix} \end{matrix} \end{matrix}

where x, y, and z, respectively, represent the probability distribution of agents on three pure actions, and

x + y + z = 1

;

E_{x}

,

E_{y}

,

E_{z}

are, respectively, the payoff of the first, second, and third actions. The term

\bar{E}

is the average expected payoff in the population.

Example 4.

For the two-agent game with symmetrical

3 \times 3

actions, the payoff matrices are as follows [48]:

\begin{matrix} C_{4} = (\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 2 \\ 0 & 0 & 1 \end{matrix}), D_{4} = (\begin{matrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 2 & 1 \end{matrix}) . \end{matrix}

Then, the stochastic Q learning replicator dynamic equations can be obtained by:

\begin{matrix} \{\begin{matrix} \begin{matrix} \dot{x} & = ς ν x (y - x y - 2 y z - z^{2}) - ν x (\ln (x) - \ln (1 - y - z)) + σ_{1} x (t) d w (t), \\ \dot{y} & = ς ν y (2 z - x y - 2 y z - z^{2}) - ν y (\ln (y) - \ln (1 - x - z)) + σ_{2} y (t) d w (t), \\ \dot{z} & = ς ν z (z - x y - 2 y z - z^{2}) - ν z (\ln (z) - \ln (1 - x - y)) + σ_{3} z (t) d w (t) . \end{matrix} \end{matrix} \end{matrix}

The simulation results in Figure 4 demonstrate the effectiveness of the stability conditions, which means that the conditional boundaries of each agent in the multi-agent game system under a random interference environment are given. The trajectory path of the stochastic Q learning replicator dynamics converges to the strategy

{[x (t), y (t), z (t)]}^{T} = {[1, 0, 0]}^{T}

by setting the appropriate random coefficients, which corresponds to the Nash equilibrium of the game. Hence, the replicator dynamics of stochastic Q learning can converge to Nash equilibrium, where simulation results are consistent with the theoretical results [48].

Remark 1.

In Section 4, Examples 1–3 show that the stochastic Q learning replicator can be used to analyze the equilibrium realization progress of all two-agent game types with two actions. Besides, these results are consistent with theoretical results. On the other hand, Example 4 extends the two-agent game with two actions to three actions, which provides the necessary theory for further extending it to a multi-agent multi-state game.

5. Sensitivity Analysis

In this section, we take modified potential games as an example to investigate sensitivity analysis on exploration parameters. Potential games were introduced by Smeed [49] to analyze congestion games. In potential games, when an agent unilaterally deviates from its action, the change of the potential function value is equal to the cost change of the agent’s deviation. In order to facilitate the numerical simulation experiment, we consider the modified potential game in the stochastic Q learning replicator dynamics of Equation (6), where the agent and the possible action have the same set, and each agent has a modified utility.

Lemma 1.

Given

Γ = < N, X_{i}, f_{i} >_{i \in N}

, considering a modified utility

f_{i}^{L} (x) = ς ν 〈 x_{i}, r_{i} (x) 〉 - ν 〈 x_{i}, \ln x_{i} 〉

,

\forall x \in X

, then the dynamics described by the differential equation

\frac{{\dot{x}}_{i l}}{x_{i l}}

in Equation (6) can be written as:

\frac{{\dot{x}}_{i l}}{x_{i l}} = r_{i l}^{L} (x) - 〈 x_{i}, r_{i}^{L} (x) 〉,

(11)

where

r_{i l}^{L} (x) = \frac{\partial f_{i}^{L} (x)}{\partial x_{i l}} : = ς ν r_{i l} (x) - ν (\ln x_{i l} + 1)

with

x_{l} (t)

cannot be all 0 for any action l since the sum over all actions should be 1, as this is a probability distribution. In particular, the dynamic of Equation (6) is described as the replicator dynamic of the modified setting

Γ^{L} = < N, X_{i}, f_{i}^{L} >_{i \in N}

. The superscript L represents the regularizing term, and

L (x_{i}) = - 〈 x_{i}, \ln x_{i} 〉 : = - \sum_{l^{'} \in A_{i}} x_{i l^{'}} \ln x_{i l^{'}}

,

L (x_{i})

denotes the Shannon entropy of selection distribution

x_{i} \in X_{i}

.

Let

Γ = < N, X_{i}, f_{i} >_{i \in N}

be a potential game. Next, we discuss more about the limited behavior of the stochastic Q learning dynamics. If there exists a function

ψ : A \to R

and a positive weight vector

w = {(w_{i})}_{i \in N}

such that

\forall i \in N

satisfying:

f_{i}^{'} (l, a_{- i}) - f_{i}^{'} (l^{'}, a_{- i}) = w_{i} (ψ (l, a_{- i}) - ψ (l^{'}, a_{- i})), \forall l \neq l^{'} \in A_{i}, a_{- i} \in A_{- i},

then

Γ

is called modified potential games. If

w_{i} = 1, \forall i \in N

, then

Γ

is called exact potential games. Let

Ψ : X \to R

represent the multi-linear extension of

ψ

defined by

Ψ (x) = \sum_{a \in A} ψ (a) \prod_{i = 1}^{N} x_{i a_{i}}, \forall x \in X

, then

Ψ

is called the potential function of

Γ

. In the framework of bounded rationality, with modified potential games and heterogeneous agents, the solution concept corresponds to quantum response equilibrium, which is the prototypical extension of the Nash equilibrium [50].

Theorem 1

(see [41]). If

Γ = < N, X_{i}, f_{i} >_{i \in N}

has a potential function,

Ψ : X \to R

, then the replicator dynamics of Equation (6) converges to a compact connected set of quantum response equilibrium of Γ.

Intuitively, the first term of Equation (6) corresponds to the replicator dynamics of agent i in potential games, which may absorb the weight of agent i; therefore, it is controlled by a potential function. The second term of Equation (6) is independent of the environment, and it is independent of the selection probability distribution of other agents. Thus, the structure of potential game is preserved, and the multiplicative constant for each agent which denotes the agent’s exploration rate is

ς

.

Lemma 2

(see [51]). Let

Γ = < N, X_{i}, f_{i} >_{i \in N}

be a MAS and

Ψ : X \to R

denote a potential function of Γ. Furthermore, consider a modified utility

f_{i}^{L} (x) = ς ν 〈 x_{i}, r_{i} (x) 〉 - ν 〈 x_{i}, \ln x_{i} 〉, \forall x \in X

, with potential function

Ψ^{L}

is defined as:

Ψ^{L} (x) = Ψ (x) + \sum_{i \in N} ς L (x_{i}), \forall x \in X,

(12)

then

Ψ^{L} (x)

is a potential function of modified game

Γ^{L} = < N, X_{i}, f_{i}^{L} >_{i \in N}

. The time derivative

{\dot{Ψ}}^{L} (x)

of the potential function is positive along any sequence of selection distribution generated by the dynamics of Equation (11) apart from the fixed point at point 0.

Although some ideal topological properties of stochastic Q learning dynamics have been established, the effectiveness of exploration in practice is still unclear from the perspective of balanced selection and individual agent performance (utility). Meanwhile, we consider a representative exploration-exploitation method, a cyclical learning rate with a cycle (CLR-1) method, which means that low exploration starts, increases to high exploration in the middle of the cycle, and decays to 0 (i.e., pure exploitation) [52]. In potential games, it is natural to consider the payoff impact on the agent currently selecting action l after adding a new agent that selects action

l^{'}

. In general, this effect can be obtained through the partial derivative

\frac{\partial f_{i}}{\partial x_{j}}

. However, since the payoff in the potential game is only defined on the simplex, this partial derivative may not exist. Furthermore, to visualize the modified potential game of Equation (12), we use the two-dimensional projection technique [53]. Next, we need to embed their selection distribution into

R^{n_{1} + n_{2} - 2}

and delete the simplex restrictions, where agent 1 and agent 2 have

n_{1}

and

n_{2}

actions, respectively. Let the transformation function be

x_{l} : R^{n_{1}} \to R^{n_{1} - 1}

with

y_{l} : = \frac{\log x_{l}}{x_{n_{1}}}

(i.e.,

\sum_{l = 1}^{n_{1}} x_{l} = 1

) for the first agent and

y_{l^{'}} : R^{n_{2}} \to R^{n_{2} - 1}

with

x_{l^{'}} : = \frac{\log y_{l^{'}}}{y_{n_{2}}}

(i.e.,

\sum_{l^{'} = 1}^{n_{2}} y_{l^{'}} = 1

) for the second agent. Then, we select two arbitrary directions in

R^{n_{1} + n_{2} - 2}

to draw the modified potential along the plot the modified potential

Ψ^{H} (x), \forall x \in X

.

Assume that the potential game is generated by a symmetric two-agent potential game, the exploration interval of two agents both are [−15, 15]. From Figure 5, we see that without exploration,

ς = 0

, the potential has different local maximum. As exploration increases, a unique common maximum value is formed in the transformed coordinates near the uniform distribution on (0, 0) when

ς > 1

. Specifically, when the agents modify their exploration speed, the stochastic Q learning dynamic converges to various vertex point of these changing surface, which corresponds to the local maximum of the potential game. However, when the exploration ratio is large, there is still on attractor, which corresponds to the quantum response equilibrium of a potential game.

As shown in Figure 6, we draw up the stochastic Q learning dynamics in a potential game with

n = 11

actions and random payoff in [0, 11]. The number of total iterations sets as 1500. The top two figure shows modified selection distributions, where different colors correspond to different optimal actions. The bottom left figure shows the average potential of a group of

11 \times 11

different trajectories; meanwhile, the shadow region represents a standard deviation that disappears when all trajectories paths converge to the same selection distribution. The bottom right figure indicates that the strategy is chosen according to the CLR-1 method. Starting from a grid of initial conditions close to each pure action profile, the stochastic Q learning dynamics rest at different local optimal before the exploration, converges to the uniform distribution when the exploration rates reach a peak, and then converges to the same optimal value when the exploration is gradually decreased to 0, where the transition point corresponds to the horizontal line and vanishing shaded areas.

6. Summary

In this paper, we have studied a stochastic Q learning method by introducing stochastic factors into Q learning, and derived the replicator dynamic equations of stochastic Q learning. On the one hand, in order to verify our theoretical results, we take the two-agent game with two actions or three actions as some examples to achieve and analyze the Nash equilibrium of multi-agent game systems, and then the convergence of stochastic Q learning does not depend on the selection of initial point. Besides, Q-learner converges to Nash equilibrium points, and the sample path of the learning process is approximated by the path of the differential equation in MAS. Furthermore, by combining RL and EGT, we can deduce that the Q learning replicator dynamics equation includes the mutation and selection mechanism, which not only enhances the diversity of the population, but also facilitates the learning ability of agents. On the other hand, we have investigated the sensitivity analysis on exploration parameter, which will affect the convergence speed of the learning trajectory path for the Q-learner. When the exploration parameter

ς

is smaller, the learning trajectory path does not converge easily. When the exploration parameter

ς

is larger, the learning trajectory path converges to quantum response equilibrium in the potential game. In conclusion, the learning method combining RL and EGT can be used to realize Nash equilibrium in multi-agent games systems, and the trajectory path of Nash equilibrium in multi-agent game systems is analyzed by the trajectory path of stochastic Q learning replicator dynamic equations.

In future research, the dynamics analysis of the decision-making behavior of agents based on stochastic Q learning will be applied to other games, such as multi-state and multi-agent stochastic games, multi-agent games with a leader–follower structure, consensus multi-agent games, and some real-life specific multi-agent game scenarios. In addition, we plan to design different heuristic learning algorithms, extend our research to a variety of learning algorithms, and perform an in-depth analysis of the differences between the mathematical models of various learning algorithms and replicator dynamics. Moreover, future work will also strengthen the theoretical guarantees and their impact on other application fields.

Author Contributions

Conceptualization, L.L. and G.S.; methodology, L.L. and G.S.; software, L.L. and G.S.; validation, L.L. and G.S.; writing—original draft preparation, L.L.; writing review and editing, L.L. and G.S.; visualization, L.L. and G.S.; supervision, L.L. and G.S.; project administration, L.L. and G.S.; funding acquisition, G.S. All authors have read and agreed to the published version of the manuscript.

Funding

Natural Science Foundation of Guizhou Province (Grant No. MS[2025]047); Construction of Engineering Research Centers in Higher Education Institutions in Guizhou Province (Grant No. Qian Education and Technology [2023]041).

Data Availability Statement

For the convenience of readers, we have added a Github Weblink: https://github.com/liuluping123/Stochasticevolutionary1212.git, accessed on 17 November 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Said, Y.B.; Kanzari, D.; Bezzine, M. A behavioral and rational investor modeling to explain subprime crisis: Multi-agent systems simulation in artificial financial markets. In Financial Decision Aid Using Multiple Criteria. Multiple Criteria Decision Making; Masri, H., Pérez-Gladish, B., Zopounidis, C., Eds.; Springer: Cham, Switzerland, 2018; pp. 131–147. [Google Scholar] [CrossRef]
Hon-Snir, S.; Monderer, D.; Sela, A. A learning approach to auctions. J. Econ. Theory 1998, 82, 65–88. [Google Scholar] [CrossRef]
Achbarou, O.; Kiram, M.A.E.; Bourkoukou, O.; Elbouanani, S. A Multi-Agent System Based Distributed Intrusion Detection System for a Cloud Computing, International Conference on Model and Data Engineering; Springer: Cham, Switzerland, 2018; Volume 929, pp. 98–107. [Google Scholar] [CrossRef]
Zhang, D.; Feng, G.; Shi, Y.; Srinivasan, D. Physical safety and cyber security analysis of multi-agent systems: A survey of recent advances. IEEE/CAA J. Autom. Sin. 2021, 8, 319–333. [Google Scholar] [CrossRef]
Yong, B.X.; Brintrup, A. Multi-Agent System for Machine Learning Under Uncertainty in Cyber Physical Manufacturing System, International Workshop on Service Orientation in Holonic and Multi-Agent Manufacturing; Springer: Cham, Switzerland, 2019; Volume 853, pp. 244–257. [Google Scholar] [CrossRef]
Liu, Y.; Huang, P.; Zhang, F.; Zhao, Y. Distributed formation control using artificial potentials and neural network for constrained multiagent systems. IEEE Trans. Control Syst. Technol. 2018, 28, 697–704. [Google Scholar] [CrossRef]
Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
Zhang, H.G.; Jiang, H.; Luo, Y.L.; Xiao, G. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans. Ind. Electron. 2016, 64, 4091–4100. [Google Scholar] [CrossRef]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction; MIT Press: Cambridge, UK, 1998; ISBN 9780262257053. [Google Scholar]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A review of safe reinforcement learning: Methods, theories and applications. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 11216–11235. [Google Scholar] [CrossRef] [PubMed]
Albrecht, S.V.; Christianos, F.; Schäfer, L. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches; MIT Press: Cambridge, UK, 2024; ISBN 9780262049375. [Google Scholar]
Chu, T.S.; Wang, J.; Codecà, L.; Li, Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef]
Nguyen, C.L.; Hoang, D.T.; Gong, S.M.; Niyato, D.; Wang, P.; Liang, Y.C.; Kim, D.I. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Commun. Surv. Tutor. 2019, 21, 3133–3174. [Google Scholar] [CrossRef]
Wang, Y.; Cui, Y.; Li, Y.; Xu, Y. Collaborative optimization of multi-microgrids system with shared energy storage based on multi-agent stochastic game and reinforcement learning. Energy 2023, 280, 128182. [Google Scholar] [CrossRef]
Vázquez-Canteli, J.R.; Nagy, Z. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
Busoniu, L.; Babuska, R.; Schutter, B.D. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 156–172. [Google Scholar] [CrossRef]
Tuyls, K.; Weiss, G. Multiagent learning: Basics, challenges, and prospects. AI Mag. 2012, 33, 41–52. [Google Scholar] [CrossRef]
Mertikopoulos, P.; Sandholm, W.H. Learning in games via reinforcement and regularization. Math. Oper. Res. 2016, 41, 1297–1324. [Google Scholar] [CrossRef]
Sanders, J.B.T.; Farmer, J.D.; Galla, T. The prevalence of chaotic dynamics in games with many players. Sci. Rep. 2018, 8, 4902. [Google Scholar] [CrossRef]
Tuyls, K.; Verbeeck, K.; Lenaerts, T. A selection-mutation model for Q learning in multi-agent systems. In Proceedings of the 2th International Joint Conference on Autonomous Agents and Multiagent Systems, Melbourne, Australia, 14–18 July 2003; pp. 693–700. [Google Scholar] [CrossRef]
Hart, S. Adaptive heuristics. Econometrica 2004, 73, 1401–1430. [Google Scholar] [CrossRef]
Gomes, E.R.; Kowalczyk, R. Dynamic analysis of multiagent Q learning with ε-greedy exploration. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Canada, 14–18 June 2009; pp. 369–376. [Google Scholar] [CrossRef]
Barfuss, W.; Donges, J.F.; Kurths, J. Deterministic limit of temporal difference reinforcement learning for stochastic games. Phys. Rev. E 2019, 99, 043305. [Google Scholar] [CrossRef] [PubMed]
Tsitsiklis, J.N. Asynchronous stochastic approximation and Q learning. Mach. Learn. 1994, 16, 185–202. [Google Scholar] [CrossRef]
Liu, X.M.; Lin, K.K.; Wang, L.; Zhang, H. Stochastic evolutionary game analysis between special committees and CEO: Incentive and Supervision. Dyn. Games Appl. 2021, 11, 538–555. [Google Scholar] [CrossRef]
Sato, Y.; Crutchfield, J.P. Coupled replicator equations for the dynamics of learning in multiagent systems. Phys. Rev. E 2003, 67, 015206. [Google Scholar] [CrossRef] [PubMed]
Blum, A.; Monsour, Y. Learning, regret minimization and equilibria. In Algorithmic Game Theory; Cambridge University Press: Cambridge, UK, 2007; ISBN 9780511800481. [Google Scholar]
Klos, T.; van Ahee, G.J.; Tuyls, K. Evolutionary Dynamics of Regret Minimization, Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2010; pp. 82–96. [Google Scholar] [CrossRef]
Wang, Z.; Mu, C.J.; Hu, S.Y.; Chu, C.; Li, X. Modelling the dynamics of regret minimization in large agent populations: A master equation approach. IJCAI 2022, 22, 534–540. [Google Scholar] [CrossRef]
Wang, Z.; Marko, J.; Lei, S.; Lee, J.H.; Iwasa, Y.; Boccaletti, S. Exploiting a cognitive bias promotes cooperation in social dilemma experiments. Nat. Commun. 2018, 9, 2954. [Google Scholar] [CrossRef]
Wang, Z.; Jusup, M.; Wang, R.W.; Shi, L.; Iwasa, Y.; Moreno, Y.; Kurths, J. Onymity promotes cooperation in social dilemma experiments. Sci. Adv. 2017, 3, e1601444. [Google Scholar] [CrossRef]
Wang, Z.; Jusup, M.; Guo, H.; Shi, L.; Geček, S.; Anand, M.; Perc, M.; Bauch, C.T.; Kurths, J.; Boccalett, S.; et al. Communicating sentiment and outlook reverses inaction against collective risks. Proc. Natl. Acad. Sci. USA 2020, 117, 17650–17655. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Suganthan, P.N. Differential evolution: A survey of the state-of-the-art. IEEE Trans. Evol. Comput. 2010, 15, 4–31. [Google Scholar] [CrossRef]
Bloembergen, D.; Tuyls, K.; Hennes, D.; Kaisers, M. Evolutionary dynamics of multi-agent learning: A survey. Artif. Intell. 2015, 53, 659–697. [Google Scholar] [CrossRef]
Börgers, T.; Sarin, R. Learning through reinforcement and replicator dynamics. J. Econ. Theory 1997, 77, 1–14. [Google Scholar] [CrossRef]
Tuyls, K.; Hoen, P.J.T.; Vanschoenwinkel, B. An evolutionary dynamical analysis of multi-agent learning in iterated games. Auton. Agents Multi-Agent Syst. 2006, 12, 115–153. [Google Scholar] [CrossRef]
Tuyls, K.; Parsons, S. What evolutionary game theory tells us about multiagent learning. Artif. Intell. 2007, 171, 406–416. [Google Scholar] [CrossRef]
Panait, L.; Luke, S. Cooperative multi-agent learning: The state of the art. Auton. Agents Multi-Agent Syst. 2005, 11, 387–434. [Google Scholar] [CrossRef]
Bowling, M.; Veloso, M. Multi-agent learning using a variable learning rate. Artif. Intell. 2002, 136, 215–250. [Google Scholar] [CrossRef]
Kianercy, A.; Galstyan, A. Dynamics of boltzmann Q learning in two-player two-action games. Phys. Rev. E 2012, 85, 041145. [Google Scholar] [CrossRef] [PubMed]
Leonardos, S.; Piliouras, G. Exploration-exploitation in multi-agent learning: Catastrophe theory meets game theory. Artif. Intell. 2022, 304, 103653. [Google Scholar] [CrossRef]
Ma, G.Y.; Zheng, Y.S.; Wang, L. Nash equilibrium topology of multi-agent systems with competitive groups. IEEE Trans. Ind. Electron. 2017, 64, 4956–4966. [Google Scholar] [CrossRef]
Sandholm, W.H. Population Games and Evolutionary Dynamics; MIT Press: Cambridge, UK, 2010; ISBN 978-0-262-19587-4. [Google Scholar]
György, S.; Gabor, F. Evolutionary games on graphs. Phys. Rep. 2007, 446, 97–216. [Google Scholar] [CrossRef]
Martin, A.N. Evolutionary Dynamics: Exploring the Equations of Life; Harvard University Press: Cambridge, UK, 2006; ISBN 978-0-674-41774-8. [Google Scholar]
Wiering, M.A.; van Otterlo, M. Reinforcement Learning: State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Jumarie, G. Approximate solution for some stochastic differential equations involving both Gaussian and Poissonian white noises. Appl. Math. Lett. 2003, 16, 1171–1177. [Google Scholar] [CrossRef]
Ritzberger, K.; Weibull, J.W. Evolutionary selection in normal-form games. Econom. J. Econom. Soc. 1995, 63, 1371–1399. [Google Scholar] [CrossRef]
Smeed, R.J. Studies in the economics of transportation. Econ. J. 1957, 67, 116–118. [Google Scholar] [CrossRef]
McKelvey, R.D.; Palfrey, T.R. Quantal response equilibria for normal form games. Games Econ. Behav. 1995, 10, 6–38. [Google Scholar] [CrossRef]
Coucheney, P.; Gaujal, B.; Mertikopoulos, P. Penalty-regulated dynamics and robust learning procedures in games. Math. Oper. Res. 2015, 40, 611–633. [Google Scholar] [CrossRef]
Smith, L.N.; Topin, N. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications; SPIE: Paris, France, 2019; Volume 11006, pp. 369–386. [Google Scholar] [CrossRef]
Li, H.; Xu, Z.; Taylor, G.; Studer, C.; Goldstein, T. Visualizing the loss landscape of neural nets. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, 3–8 December 2018; Volume 31, pp. 6391–6401. [Google Scholar]

Figure 1. The stochastic Q learning trajectory paths of the PD game with

ς = 3

, 5, and 8 under random interference. The x-axis represents the probability with which the first agent plays Defection and the y-axis represents the probability with which the second agent plays Cooperation. (a)

ς = 3

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

. (b)

ς = 5

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

. (c)

ς = 8

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

.

Figure 1. The stochastic Q learning trajectory paths of the PD game with

ς = 3

, 5, and 8 under random interference. The x-axis represents the probability with which the first agent plays Defection and the y-axis represents the probability with which the second agent plays Cooperation. (a)

ς = 3

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

. (b)

ς = 5

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

. (c)

ς = 8

,

ν = 1

,

σ_{1} = 0.1

,

σ = 0.1

,

w = 0.01

.

Figure 2. The stochastic Q learning trajectory paths of the BoS game with

ς = 1, 3,

and 5 under random interference. The x-axis denotes the probability of the first agent playing Soccer and the y-axis denotes the probability of the second agent playing Ballet. (a)

ς = 1

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (b)

ς = 3

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (c)

ς = 5

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

.

Figure 2. The stochastic Q learning trajectory paths of the BoS game with

ς = 1, 3,

and 5 under random interference. The x-axis denotes the probability of the first agent playing Soccer and the y-axis denotes the probability of the second agent playing Ballet. (a)

ς = 1

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (b)

ς = 3

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (c)

ς = 5

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

.

Figure 3. The stochastic Q learning trajectory paths of the MP game with

ς = 0.1, 3

, and 5 under random interference. The x-axis denotes the probability of the first agent playing Head and the y-axis denotes the probability of the second agent playing Tail. (a)

ς = 6

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (b)

ς = 8

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (c)

ς = 12

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

.

Figure 3. The stochastic Q learning trajectory paths of the MP game with

ς = 0.1, 3

, and 5 under random interference. The x-axis denotes the probability of the first agent playing Head and the y-axis denotes the probability of the second agent playing Tail. (a)

ς = 6

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (b)

ς = 8

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

. (c)

ς = 12

,

ν = 1

,

σ_{1} = 0.1

,

σ_{2} = 0.5

,

w = 0.01

.

Figure 4. Three-action dynamic evolution path with different noise intensity

σ

. (a)

σ_{1} = 0.2

,

σ_{2} = 0.6

,

σ_{3} = 0.8

,

ς = 6

,

ν = 1

. (b)

σ_{1} = 0.5

,

σ_{2} = 0.8

,

σ_{3} = 0.3

,

ς = 6

,

ν = 1

. (c)

σ_{1} = 0.3

,

σ_{2} = 0.8

,

σ_{3} = 0.7

,

ς = 6

,

ν = 1

.

Figure 4. Three-action dynamic evolution path with different noise intensity

σ

. (a)

σ_{1} = 0.2

,

σ_{2} = 0.6

,

σ_{3} = 0.8

,

ς = 6

,

ν = 1

. (b)

σ_{1} = 0.5

,

σ_{2} = 0.8

,

σ_{3} = 0.3

,

ς = 6

,

ν = 1

. (c)

σ_{1} = 0.3

,

σ_{2} = 0.8

,

σ_{3} = 0.7

,

ς = 6

,

ν = 1

.

Figure 5. When

ς = 0

,

ς = 1

, and

ς = 5

, the difference exploration rates of stochastic Q learning method with random payoff in [0, 1] for the game, the phase transition of

Ψ^{L}

of the potential game with different exploration rates of the stochastic Q learning method. (a)

ς = 0

. (b)

ς = 1

. (c)

ς = 5

.

Figure 5. When

ς = 0

,

ς = 1

, and

ς = 5

, the difference exploration rates of stochastic Q learning method with random payoff in [0, 1] for the game, the phase transition of

Ψ^{L}

of the potential game with different exploration rates of the stochastic Q learning method. (a)

ς = 0

. (b)

ς = 1

. (c)

ς = 5

.

Figure 6. When

n = 11

actions in the potential game, the stochastic Q learning dynamic is used for exploration and exploitation with

n = 11

actions and random payoff in [0, 11].

Figure 6. When

n = 11

actions in the potential game, the stochastic Q learning dynamic is used for exploration and exploitation with

n = 11

actions and random payoff in [0, 11].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Sun, G. Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems. Axioms 2025, 14, 311. https://doi.org/10.3390/axioms14040311

AMA Style

Liu L, Sun G. Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems. Axioms. 2025; 14(4):311. https://doi.org/10.3390/axioms14040311

Chicago/Turabian Style

Liu, Luping, and Gang Sun. 2025. "Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems" Axioms 14, no. 4: 311. https://doi.org/10.3390/axioms14040311

APA Style

Liu, L., & Sun, G. (2025). Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems. Axioms, 14(4), 311. https://doi.org/10.3390/axioms14040311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evolutionary Dynamics of Stochastic Q Learning in Multi-Agent Systems

Abstract

1. Introduction

2. Model and Prerequisites

3. Evolutionary Dynamic of Stochastic Q-Learning

3.1. Replicator Dynamic of Q Learning

3.2. Replicator Dynamic of Stochastic Q Learning

3.3. Analysis of the Existence and Stability of Equilibrium Solutions

4. Numerical Experiment

5. Sensitivity Analysis

6. Summary

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI