1. Introduction
Multi-agent systems (MASs) are a method for solving distributed problems, which are composed of multiple agents interacting with each other in the same environment, and mainly rely on centralized and distributed frameworks to realize equilibrium learning of agents. MASs have been widely used in various fields including financial markets [
1], auctions [
2], cloud computing [
3], smart grids [
4], machine learning [
5] and formation control [
6]. In MASs, the capability of learning is crucial for an agent to behave appropriately in face of unknown opponents and dynamic environments in order to optimize the performance of the agent. Reinforcement learning (RL) has provided driving forces for modeling MASs interaction problems [
7,
8], which has a certain value and significance for research [
9,
10,
11,
12,
13,
14,
15].
The research of RL in a single-agent framework has been notably investigated; however, the understanding of learning ability in MASs has remained an open problem. When multiple agents simultaneously interact with each other, the reward of each agent depends not only on their own actions but also on the actions of other agents [
16,
17]. Evolutionary game theory (EGT) plays an important role in better understanding the development of social economy and nature. In recent years, RL has recently attracted a lot of attention due to its connection to EGT, and various
Q learning variants have been widely studied in game and economics [
18,
19], respectively. Tuyls et al. [
20] first considered the replicator dynamic of
Q learning with Boltzmann distribution, and analyzed the asymptotic behavior of a two-agent game. Hart [
21] studied the dynamic process of learning with adaptive heuristics by introducing a large class of simple behavior rules. Gomes et al. [
22] proposed multi-agent
Q learning with
-greedy exploration, and analyzed the expected behavior of two-agent game with the help of different replicator dynamic equations. Barfuss et al. [
23] derived the deterministic limit of
SARSA-learning,
Q learning, and actor–critic learning for stochastic games, respectively, and their dynamics diagrams on multi-agent, multi-state environments reveal a variety of different dynamics regimes. Tsitsiklis [
24] introduced asynchronous stochastic approximation into a
Q learning algorithm, which was applied to solve Markov decision problems and provided some convergence results under more general condition. Liu et al. [
25] studied stochastic evolutionary game between special committees and chief executive officer (CEO) incentive and supervision, and analyzed the boundary conditions of the stability based on stochastic replicator dynamic equations. In addition, the research results between replicator dynamic and evolutionary game have been extended to a number of algorithms such as collective learning [
26], regret minimization learning [
27,
28,
29], learning automata and
Q learning [
20], and evolution of cooperation [
29,
30,
31,
32]. Besides, the dynamics of RL have been established, which provided a useful theoretical framework for achieving the equilibrium learning behavior of MASs [
20,
22,
26,
33,
34,
35,
36,
37], and multi-agent learning algorithm is designed to prevent collective learning from falling into local optimum [
17,
26]. The high complexity and uncertainty inherent in real-world environments that influence agents’ strategy options, however, introduce few uncertain factors into the decision-making behavior of multi-agent dynamic systems.
In recent years, game theory has provided a reliable mathematical framework for the research of multi-agent interaction problems. Furthermore, the optimal balance problem between exploring and exploiting in MASs which has been providing the basic driving factor of RL, deep learning and EGT [
38]. From the perspective of behavioral economics, the learning of agents use time-varying parameters to explore optimal solutions and bounded rational decision-making agents, who try to coordinate with other agents to maximize their profits [
39]. In other words, exploration and exploitation are critical to balancing the learning speed and quality of strategies obtained in the learning algorithm. Also, Kianercy et al. [
40] considered the dynamics of Boltzman
Q learning in a two-action two agents game, and studied the sensitivity of rest-point structure with regard to exploration parameters. Leonardos et al. [
41] studied the smooth
Q learning in MASs, which is used for balancing exploration and exploitation costs in potential games. The balance between exploring and exploiting in multi-agent RL has always been a challenging issue, but there is little research on this topic. How to understand the balance between exploration and exploitation has a certain value and significance for research.
Inspired by above research work, the capability of learning is crucial for an agent to behave appropriately in face of unknown opponents and dynamic environments. This paper analyzes the decision behavior of MAS from the viewpoint of EGT. We study a stochastic Q learning by introducing stochastic factors into Q learning. The stochastic Q learning has theoretical foundation as the optimal model for studying exploration and exploitation, which captures the cost between game and exploration rate, and it ensures convergence of the set of Nash equilibrium in MAS with heterogeneous learning agents. Besides, stochastic Q learning reveals an interesting connection between exploring–exploiting in RL and selecting–mutating in EGT, and the stochastic Q learning method is used to analyze the equilibria of MASs. Furthermore, we take the modified potential games as an example to investigate sensitivity analysis on exploration parameters, and further study the relationship between the exploration rate of agents and the game equilibria. The ideas of stochastic Q learning provide a new perspective for us to understand and fine-tune the learning process of MASs.
The rest of this paper is structured as follows.
Section 2 shows the model of MAS and some necessary prerequisites.
Section 3 introduces the stochastic
Q learning method and derives the replicator dynamic of stochastic
Q learning.
Section 4 takes the two-agent game using some examples to realize the equilibrium of MAS; we then extend the two-action game to a three-action game, and further analyze the replicator dynamic of the stochastic
Q learning algorithm.
Section 5 studies the relationship between the exploration rate of agents and the game equilibrium by showing the results of sensitivity analysis.
Section 6 contains a brief and concise summary.
2. Model and Prerequisites
In this section, we introduce the game model of MAS and some prerequisites.
In the framewrk of EGT, an MAS model is described as a tuple , where is a finite set of all agents and N stands for the number of all agents. The strategy or action set of the i-th agent is expressed as . Also, we denote as the strategies set of all agents, and , . We will denote with the strategies set of all other agents apart from the i-th agent. To analyze the evolutionary track of strategies choices of agents, let denote the set of mixed strategy for the i-th agent. is all mixed strategies spaces for all agents, with , . Furthermore, represents all mixed strategies spaces of agents apart from the i-th agent, and X is dependent on the simplex of total number of actions . In time-dependent scenarios, the index t will be used to represent the probability distribution , of the selection action l for the agent i.
Let be the payoff function of the i-th agent, while is dependent on the selection . The notation denotes the strategy where the agent i selects the pure strategy , and all other agents choose their components of . Furthermore, the expected payoff of the i-th agent for a mixed strategy profile is defined by , and , giving the expected payoff received by the agent under joint strategies in each state. Moreover, , and let denote the selecting action at the joint policy profile x for agent i, while , where denotes the payoff vector value of agent i. We use the notation , where denotes the inner product of , i.e., . It is easy to see that . When the game is repeatedly played, it results in a sequential decision problem involving a state.
Next, the definitions of the Nash equilibrium in a multi-agent game system are given.
Definition 1 (see [
42]).
A strategy profile is the Nash equilibrium of Γ if for any , there exists such that:which means that each agent chooses a strategy according to its own optimal response to the other agent’s strategy. In a Nash equilibrium, an individual agent can obtain no incremental benefit from changing strategies, assuming other agents remain constant in their strategies. Definition 2 (see [
43]).
A strategy profile is an evolutionary stable strategy (ESS) if for any , , there exists a constant that satisfies the following inequality:where f denotes the payoff of mixed strategy, denotes the mutant strategy, represents a constant related to , and represents a mixed strategy containing mutation and stable strategies. Meanwhile, ESS refers to a strategy adopted by most members of the population, and this strategy is superior to other strategies. From a game perspective, this means that when using ESS, the average payoff in the population is higher than the average payoff after invading a population. Generally speaking, we show that a strategy is an ESS if it can resist evolutionary pressure from any mutation strategy that appears. According to the definition of Nash equilibrium, an individual agent can obtain no incremental benefit from changing strategies, assuming other agents remain constant in their strategies. The relationships between Nash equilibrium and ESS are that
is contained in
, and vice versa [
44,
45]. ESS is an asymptotically stable fixed point of replicator dynamics. In particular, in the process of game learning, replicator dynamics [
43] describe the evolution of the frequencies of strategies in one population. In this paper, each agent is regarded as a learner, which is described in the following section.
3. Evolutionary Dynamic of Stochastic Q-Learning
In this section, stochastic Q learning is proposed by considering random disturbance factors in the real world, and we derive the evolutionary dynamic equation of stochastic Q learning.
3.1. Replicator Dynamic of Q Learning
RL is a learning method that utilizes experience for constant trial and error, and mutli-agent RL is an extension of this learning method in multi-agent scenarios. RL is also a computational method to learn whereby the agent efforts to maximize the total amount of payoff it receives while interacting with a complex and uncertain environment. It is employed by various software and machines to find the optimal possible behavior or path in a specific situation. Common RL approaches that can be found in [
9] are built around an estimated value function. The
Q learning consists of the five ingredients of an update sequence of state, action, reward, next state, and next action. The
Q learning update equation is as follows [
46]:
where
denotes the state-value estimate at time
,
shows the state-value estimate at time
t of the current state,
denotes the maximum state-value estimate at step
with state
and action
,
represents the payoff of state visited at the next step
,
is a step size parameter, and
represents the discount factor.
Next, we derive the replicator dynamic equation of
Q learning in a continuous time, and
Q-values are modeled as Boltzmann probabilities for the selection of action. With an opportunity of revision, the agent randomly selects a strategy and changes its strategy with the probability given by the Boltzmann distribution:
where
indicates the probability of selecting strategy
l at time
t,
represents the exploration parameter, and
and
denote the
Q-value that selecting action
l and
k at time
t, respectively.
We are more concerned with the continuous time limit of the learning method. Thus, we divide the time into intervals
. Afterward, suppose that in each interval, the agent samples their actions and computes the average payoff of the action. Next, we use Equation (
1) to update the
Q-value at the end of each interval. In a continuous time limit
, by [
20,
40], the target dynamics of
Q learning with a large population is denoted by:
In order to change from a discrete step to a continuous version, we assume that the amount of time that passes between two repetitions of the
Q-value update are given by
with
. The variable
denotes the
Q-value that selecting action
l at time
. By Equation (
1), then, we have:
Similarly, we are interested in the limit
. By taking the limit of
with
, we obtain the state of the limit process at some time
. If use
divide the both side of Equation (
4), and take the limit for
, then we have:
Finally, by substituting Equation (
5) into Equation (
3), it can be obtained that:
where
and
denote, respectively, the agent’s payoff from action
l and average payoff in the population. By Equation (
2),
, and then we obtain:
Thus, the above deduction gives us:
where
denotes the probability distribution of action or strategy
l. The first term represents the selection mechanism, and the second term denotes the mutation mechanism. The selection operator is prior to some strategies, and the mutation operator ensures the diversity of the MAS. Furthermore, the learning process not only includes selection mechanism but also mutation mechanism. The selection operator favors certain varieties over others, and the mutation operator guarantees diversity in a population. The iterative process of
Q learning shows the process of balancing exploration and exploitation, where exploration is essentially a variation, and exploitation is the selection.
3.2. Replicator Dynamic of Stochastic Q Learning
In order to account for the complexity and uncertainty inherent in real-world environments that can influence agents’ strategy choices, we introduce a stochastic perturbation term to characterize the interference caused by uncertain factors on MASs. On the one hand, the agents have the possibility of different strategic choices because of their own interests; on the other hand, the agent has a great speculative psychology to take self-interested behavior. In addition, the emotional changes and moral hazard of the participants will also affect their strategic behavior. Therefore, it is necessary to consider the interference of random disturbance on the multi-agent game. This paper introduces Gaussian white noise to characterize the replicator dynamic equations of multi-agent game [
47] by:
where
is standard one-dimensional Brownian motion, a kind of random fluctuation phenomenon, which can reflect how the game subject is affected by random interference factors. Also,
denotes Gaussian white noise, where
is satisfied, and
is the step. Furthermore, if
obeys normal distribution
and
is noise intensity, then we obtain that the exploration rate is
. Therefore, the above Equation (
7) is a one-dimensional Brownian motion, representing the evolutionary replicator dynamics formula of all agents under random disturbance.
3.3. Analysis of the Existence and Stability of Equilibrium Solutions
For the Itô stochastic equation in the above Equation (
7), assume that the initial time
, i.e., there exists
at the initial time of the game such that:
We obtain
, and there exists at least one zero solution, which means that the system will stay in this state without the interference of external white noise. Therefore, the zero solution is the equilibrium point of Equation (
8).
However, the system is bound to be disturbed by the internal and external environment, which will affect the stability of the system. Thus, the influence of random factors on system stability must be considered.
Definition 3. Let the stochastic process satisfy the following stochastic differential equation: Suppose there exist a function and a normal number , such that .
(a) If there is a normal number
, such that
, then the zero solution of Equation (
9) is exponentially stable at the
p-th moment, and:
(b) If there is a normal number
, such that
, then the zero solution of Equation (
9) is exponentially unstable at the
p-th moment, and:
For Equation (
8), let
,
,
,
. By Definition 3, if the zero solution moment index of the equation is stable, then the following conditions are satisfied:
4. Numerical Experiment
In this section, two agents are regarded as randomly selected from the population to play a game, and we take some examples of a two-agent game to analyze the stochastic Q learning replicator dynamic and understand the decision behavior of the agent in MASs.
Assume that the payoff matrices of two-agent game are
and
, respectively. The replicator dynamics of stochastic
Q learning for the
first and
second agent are, respectively:
where
represents the probability of selecting the action
l of the first agent,
denotes the probability of selecting the action
l of the second agent, and
and
are the payoff matrices for agents 1 and 2, respectively.
The payoff matrices of agents 1 and 2 with two actions are
and
, respectively, where:
Assume that two agents select the probability of the
first action are
and
, and choose the probability of the
second action are
and
, respectively. Let
,
,
,
. By Equation (
10), the replicator dynamic equations of stochastic
Q learning are as follows:
A two-agent game can be categorized into three classes by characterizing the payoff matrix.
Example 1. The first class game has at least one strictly dominant equilibrium when: The prisoner’s dilemma (PD) game belongs to the first subclass, where the two actions of the two agents are Defection or Cooperation, respectively, and the payoff matrices are as follows: Then, the stochastic Q learning replicator dynamic equations of the PD game are as follows: In
Figure 1a–c, the convergence plot of stochastic
Q learning replicator dynamic is plotted in the PD game by filling in three values of
, specifically
. When
, the learning paths converge to different coordinates since lower exploration level lead to the agent deviate from these high reward value actions of seeking strategies. When
, the learning paths are not closer to the coordinates
. When
, the learning paths converge to coordinates
corresponding to the strategy profile (Defection, Defection), which means that the PD game converges to the Nash equilibrium (i.e.,
).
Figure 1 shows that the convergence of stochastic
Q learning dynamic does not depend on the selection of initial point. Furthermore, we can better understand and predict the trajectory path of
Q-learner evolution process, the replicator dynamic of the stochastic
Q learning can converge to Nash equilibrium for PD game.
Example 2. The second class game has two pure strategies equilibria and a mixed strategy equilibrium when: The battle of sexes (BoS) game belongs to the second subclass, where the two actions of the population are Soccer or Battle, separately, and the payoff matrices are as follows: Then, the stochastic Q learning replicator dynamic equations of the BoS game are described by: As shown in
Figure 2a–c, the convergence plot of stochastic
Q learning replicator dynamic in the BoS game is plotted by filling in three values of
, specifically 1, 3, and 5. When
, two agents will not converge to any Nash equilibrium. When
, learning paths converge to coordinate
or
, corresponding to the Nash equilibrium (i.e.,
) of the BoS game. When
, those learning paths can converge to two pure strategies corresponding to two actions (Soccer, Soccer) and (Battle, Battle).
Example 3. The third class game has one mixed equilibrium when: The matching pennies (MP) game belongs to the third subclass, where the two strategies of the two agents are Head or Tail. It is worth noting that if the two payoff matrices of MP are not transposed against each other, then two agents in the MP are selected from two different populations to play the game. The payoff matrices are as follows: Then, the stochastic Q learning replicator dynamics equations can be obtained by: From
Figure 3a–c, the characteristic of MP is that the internal trajectory defines closed trajectory around the fixed point. In the first two plots,
is not large enough compared to the internal trajectory closed orbits around the coordinate (0.5, 0.5). It can be seen from the above third plot that at each point, a learning path starts and converges to a fixed point. It is feasible to use stochastic
Q learning algorithms to realize the Nash equilibrium (i.e.,
) of the MP.
Next, we extend two actions to three-action situations in two-agent games, and analyze and realize the equilibrium of the game. The payoff matrices of agents 1 and 2 with three actions are, respectively, as follows:
Suppose that two agents select the probability of three actions, which are
,
,
and
,
,
, respectively. Let
,
,
and
,
,
. By Equation (
10), the replicator dynamic equations of stochastic
Q learning are as follows:
where
x,
y, and
z, respectively, represent the probability distribution of agents on three pure actions, and
;
,
,
are, respectively, the payoff of the first, second, and third actions. The term
is the average expected payoff in the population.
Example 4. For the two-agent game with symmetrical actions, the payoff matrices are as follows [48]: Then, the stochastic Q learning replicator dynamic equations can be obtained by: The simulation results in
Figure 4 demonstrate the effectiveness of the stability conditions, which means that the conditional boundaries of each agent in the multi-agent game system under a random interference environment are given. The trajectory path of the stochastic
Q learning replicator dynamics converges to the strategy
by setting the appropriate random coefficients, which corresponds to the Nash equilibrium of the game. Hence, the replicator dynamics of stochastic
Q learning can converge to Nash equilibrium, where simulation results are consistent with the theoretical results [
48].
Remark 1. In Section 4, Examples 1–3 show that the stochastic Q learning replicator can be used to analyze the equilibrium realization progress of all two-agent game types with two actions. Besides, these results are consistent with theoretical results. On the other hand, Example 4 extends the two-agent game with two actions to three actions, which provides the necessary theory for further extending it to a multi-agent multi-state game. 5. Sensitivity Analysis
In this section, we take modified potential games as an example to investigate sensitivity analysis on exploration parameters. Potential games were introduced by Smeed [
49] to analyze congestion games. In potential games, when an agent unilaterally deviates from its action, the change of the potential function value is equal to the cost change of the agent’s deviation. In order to facilitate the numerical simulation experiment, we consider the modified potential game in the stochastic
Q learning replicator dynamics of Equation (
6), where the agent and the possible action have the same set, and each agent has a modified utility.
Lemma 1. Given , considering a modified utility , , then the dynamics described by the differential equation in Equation (6) can be written as:where with cannot be all 0 for any action l since the sum over all actions should be 1, as this is a probability distribution. In particular, the dynamic of Equation (6) is described as the replicator dynamic of the modified setting . The superscript L represents the regularizing term, and , denotes the Shannon entropy of selection distribution . Let
be a potential game. Next, we discuss more about the limited behavior of the stochastic
Q learning dynamics. If there exists a function
and a positive weight vector
such that
satisfying:
then
is called modified potential games. If
, then
is called exact potential games. Let
represent the multi-linear extension of
defined by
, then
is called the potential function of
. In the framework of bounded rationality, with modified potential games and heterogeneous agents, the solution concept corresponds to quantum response equilibrium, which is the prototypical extension of the Nash equilibrium [
50].
Theorem 1 (see [
41]).
If has a potential function, , then the replicator dynamics of Equation (6) converges to a compact connected set of quantum response equilibrium of Γ. Intuitively, the first term of Equation (
6) corresponds to the replicator dynamics of agent
i in potential games, which may absorb the weight of agent
i; therefore, it is controlled by a potential function. The second term of Equation (
6) is independent of the environment, and it is independent of the selection probability distribution of other agents. Thus, the structure of potential game is preserved, and the multiplicative constant for each agent which denotes the agent’s exploration rate is
.
Lemma 2 (see [
51]).
Let be a MAS and denote a potential function of Γ. Furthermore, consider a modified utility , with potential function is defined as:then is a potential function of modified game . The time derivative of the potential function is positive along any sequence of selection distribution generated by the dynamics of Equation (11) apart from the fixed point at point 0. Although some ideal topological properties of stochastic
Q learning dynamics have been established, the effectiveness of exploration in practice is still unclear from the perspective of balanced selection and individual agent performance (utility). Meanwhile, we consider a representative exploration-exploitation method, a cyclical learning rate with a cycle (CLR-1) method, which means that low exploration starts, increases to high exploration in the middle of the cycle, and decays to 0 (i.e., pure exploitation) [
52]. In potential games, it is natural to consider the payoff impact on the agent currently selecting action
l after adding a new agent that selects action
. In general, this effect can be obtained through the partial derivative
. However, since the payoff in the potential game is only defined on the simplex, this partial derivative may not exist. Furthermore, to visualize the modified potential game of Equation (
12), we use the two-dimensional projection technique [
53]. Next, we need to embed their selection distribution into
and delete the simplex restrictions, where agent 1 and agent 2 have
and
actions, respectively. Let the transformation function be
with
(i.e.,
) for the first agent and
with
(i.e.,
) for the second agent. Then, we select two arbitrary directions in
to draw the modified potential along the plot the modified potential
.
Assume that the potential game is generated by a symmetric two-agent potential game, the exploration interval of two agents both are [−15, 15]. From
Figure 5, we see that without exploration,
, the potential has different local maximum. As exploration increases, a unique common maximum value is formed in the transformed coordinates near the uniform distribution on (0, 0) when
. Specifically, when the agents modify their exploration speed, the stochastic
Q learning dynamic converges to various vertex point of these changing surface, which corresponds to the local maximum of the potential game. However, when the exploration ratio is large, there is still on attractor, which corresponds to the quantum response equilibrium of a potential game.
As shown in
Figure 6, we draw up the stochastic
Q learning dynamics in a potential game with
actions and random payoff in [0, 11]. The number of total iterations sets as 1500. The top two figure shows modified selection distributions, where different colors correspond to different optimal actions. The bottom left figure shows the average potential of a group of
different trajectories; meanwhile, the shadow region represents a standard deviation that disappears when all trajectories paths converge to the same selection distribution. The bottom right figure indicates that the strategy is chosen according to the CLR-1 method. Starting from a grid of initial conditions close to each pure action profile, the stochastic
Q learning dynamics rest at different local optimal before the exploration, converges to the uniform distribution when the exploration rates reach a peak, and then converges to the same optimal value when the exploration is gradually decreased to 0, where the transition point corresponds to the horizontal line and vanishing shaded areas.
6. Summary
In this paper, we have studied a stochastic Q learning method by introducing stochastic factors into Q learning, and derived the replicator dynamic equations of stochastic Q learning. On the one hand, in order to verify our theoretical results, we take the two-agent game with two actions or three actions as some examples to achieve and analyze the Nash equilibrium of multi-agent game systems, and then the convergence of stochastic Q learning does not depend on the selection of initial point. Besides, Q-learner converges to Nash equilibrium points, and the sample path of the learning process is approximated by the path of the differential equation in MAS. Furthermore, by combining RL and EGT, we can deduce that the Q learning replicator dynamics equation includes the mutation and selection mechanism, which not only enhances the diversity of the population, but also facilitates the learning ability of agents. On the other hand, we have investigated the sensitivity analysis on exploration parameter, which will affect the convergence speed of the learning trajectory path for the Q-learner. When the exploration parameter is smaller, the learning trajectory path does not converge easily. When the exploration parameter is larger, the learning trajectory path converges to quantum response equilibrium in the potential game. In conclusion, the learning method combining RL and EGT can be used to realize Nash equilibrium in multi-agent games systems, and the trajectory path of Nash equilibrium in multi-agent game systems is analyzed by the trajectory path of stochastic Q learning replicator dynamic equations.
In future research, the dynamics analysis of the decision-making behavior of agents based on stochastic Q learning will be applied to other games, such as multi-state and multi-agent stochastic games, multi-agent games with a leader–follower structure, consensus multi-agent games, and some real-life specific multi-agent game scenarios. In addition, we plan to design different heuristic learning algorithms, extend our research to a variety of learning algorithms, and perform an in-depth analysis of the differences between the mathematical models of various learning algorithms and replicator dynamics. Moreover, future work will also strengthen the theoretical guarantees and their impact on other application fields.