1. Introduction
Multi-agent reinforcement learning (MARL) has seen revolutionary breakthroughs with its successful applications to multi-agent cooperative tasks, such as robot swarms control [
1], coordination of robots [
2], autonomous vehicle coordination [
3], computer games [
4], and multi-agent differential games [
5,
6,
7]. Since agents need to make a series of proper decisions to cooperate better to complete the tasks, many researchers have extended the deep reinforcement learning methods to the multi-agent field to solve these problems. An intuitive approach is to treat each agent as a separate individual and train the decentralized policies of agents with deep Q-Learning [
8], i.e., independent Q learning (IQL) [
9]; however, IQL cannot address the non-stationarity introduced by the fast-changing policies of agents during training; these policies could not even converge with enough exploration and local observations. Agents need additional information about each other to be aware of the others’ policy changes; therefore, the
centralized training and decentralized execution (CTDE) paradigm [
10], which allows agents to access global information during training and execute based only on their local histories, is proposed to alleviate the non-stationary problem and stabilize training in some multi-agent cooperation scenarios.
Many CTDE algorithms, such as MADDPG [
11], MAAC [
12], and QMIX [
13], have been proposed for different multi-agent tasks. These algorithms enable agents to treat the concatenated local observations as global state information, merge them with attention mechanism, or incorporate them into the weights and bias of the delicately designed neural networks [
14]. All these methods aim to enhance the observational representation of the individual agent by fusing more global information. Through the above methods, agents can integrate more information about policies and local observations from each other, so that they can make more appropriate decisions to cooperate better; however, the size of the joint action space increases with the number of agents. Even with the help of more accurate state information, it is extremely difficult to directly search for the optimal joint policy in a huge state-action space with multiple agents. This may lead us to take millions of samples to train the policies.
On the other hand, there are some ways to ensure that the training of agents progresses steadily from the perspective of policy updating. Considering the excellent properties of monotonically improving the policy, IPPO extends PPO [
15] to the multi-agent setting to train the agents efficiently. In the case of the maximum entropy optimization target, IPPO ensures that the policy of each agent can be monotonically learned to find the optimal joint policy; however, MAPPO [
16], which incorporates global information with IPPO, may not perform as well as IPPO. Although the global information can enrich the representation of observations, it also brings information redundancy to the centralized critic. This counter-intuitive phenomenon is described as
policy overfitting, which would mislead the update of policy in the wrong direction with all agents sharing the parameters of networks. The inherent defect of
policy overfitting raised by the centralized critic is difficult to improve by modifying the traditional Markov decision process (MDP) or changing the exploration method of policy networks. It also makes it harder to train the agents in an actor–critic architecture since there are policy and value networks that need to be trained.
To deal with the
policy overfitting and reducing the information redundancy brought by the centralized critic, we propose a novel noise injection method to regularize the policies of actor–critic MARL algorithms. We focus on MAPPO, develop two patterns of the noise injection method applied to the advantage function, which is, respectively, inspired by the noisy net [
17] and parameter space noise [
18]. Different from replacing the exploration mechanism with a noisy policy network, as in [
17,
18], we inject the noise directly into the centralized value network to enrich the representation ability. Furthermore, we theoretically analyze the reason for
policy overfitting in multi-agent actor–critic, and explain that the problem comes from updating the agents’ policy by batch-sampling updating. Our experimental results on the Matrix Game and challenging StarCraft II micromanagement benchmark tasks (SMAC) [
4] show that the injected noise can augment the variance of the centralized value function, and indirectly increase the entropy of agents’ policies to obtain more exploration during training. In general, our main contributions are summarized in the following:
We analyze the reason for policy overfitting in actor–critic MARL algorithms with the centralized value function, which is caused by the batch sampling mechanism in the training stage;
We propose two patterns of noise injection to deal with the policy overfitting problem, and experimentally prove the noise injected into the centralized value function can maintain the entropy of agents’ policies during training to alleviate the information redundancy and enhance performance;
The experiments show our proposed method is able to archive comparable or much better performance than
state-of-the-art results in most hard scenarios of SMAC compared to the current most trustworthy actor–critic MARL methods. Our code is open source and can be found at
https://github.com/hijkzzz/noisy-mappo (accessd on 7 January 2022) for experimental verification and for future works.
2. Related Works
Multi-agent reinforcement learning approaches are mainly divided into four categories: analysis of emergent behaviors, learning communications, learning cooperation, and agent modeling agents [
19]. Among them, the agent cooperation is the top priority in this field. Recently, many MARL algorithms under the CTDE paradigm have been proposed to alleviate the non-stationarity during training. The straightforward idea is to consider all the agents in the same environment as a whole when we obtain all local observations and additional state information. This concept brings out plenty of algorithms under the CTDE paradigm. Value decomposition networks (VDN) [
20] is the first attempt to factorize the joint critic into each individual agent. QMIX [
13] takes a step forward and factorizes the centralized critic function by ensuring the consistency of
argmax operator between
and each
, which is able to effectively reduce the search space of the joint policy. QTRAN [
21] learns the discrepancy between
and
and compensates for this discrepancy through a state-based value function, which would elaborately factorize the centralized critic function and train all the agents in an end-to-end fashion. Then, QPLEX [
22] factorizes the centralized critic into advantage value for each agent with transformed dueling architecture, and LICA [
23] extends QMIX to continuous action spaces with the entropy of joint policy to constrain the training of agents.
On the other hand, actor–critic style algorithms are also shining in MARL. MADDPG [
11], as a representative, extends the DDPG [
24] to a multi-agent setting and trains agents with a centralized value function. MAAC [
12] enriches the information of the centralized function in MADDPG with a self-attention mechanism [
25]. COMA [
26] trains agents with a centralized critic with counterfactual advantages, which is the embodiment of the credit assignment from the perspective of the expectation of a value function. FACMAC [
27] combines the consistent constraint of QMIX with actor–critic algorithm for agents to improve the training efficiency of agents. Furthermore, researchers introduced independent PPO [
16] to the multi-agent domain to monotonically update the policies of agents. They also equip the global state information with IPPO and propose MAPPO [
16]. Yu [
28] feeds agent-specific features to the centralized critic network of MAPPO, and proposes an information filtering method to mask the useless features, which would significantly improve MAPPO’s performance in some cooperative scenarios. Still, Kuba [
29] theoretically analyzes the information redundancy of centralized critic in MAPPO/MATRPO, then indicates the parameter sharing between agents will mislead the update of policies and deteriorate the performance.
Moreover, agents would also suffer from inefficient exploration in MARL setting even with enough state information. MAVEN [
30] equips the value-based methods with
committed exploration to persist the joint exploratory policy over an entire episode. ROMA [
31] introduces a role concept-based regularizer to train agents more efficiently. Pan [
32] proposes to use a synthetical
softmax operator to update the
Q-function under the CTDE paradigm, and [
33] encourages the agent to maintain a common goal while all agents are exploring. Moreover, the traditional exploration, such as
-greedy method, can be incorporated into a noise-injection network [
17,
18]. All the noises are injected into the parameters of policy network since designers expect to disturb the actions to explore other states, then make the joint value function jump out of the local optimal region.
All the methods above expect to efficiently train the policies with adequate information both from agents and environments, or other novel ways to increase exploration; however, these methods do not solve the information redundancy problem brought by centralizing the critics of agents together. Inspired by the noisy network [
17], we explore the effect of the specific pattern of Gaussian noise directly injected in centralized advantage function of agents, which will correct the direction of policy update and regularize the training.
3. Preliminaries
Dec-POMDP We consider a cooperative task, which can be described as a decentralized partially observable Markov decision process (Dec-POMDP) [
34], which is formally defined as a tuple
).
represents the global state space, and
is partially observation for each agent
i at state
s.
is the state transition probability in the environment given the joint action
. Each agent shares the same reward function
, and chooses sequential actions under partial observations.
N denotes the number of agents and
is the discount factor. The whole team of agents attempt to learn a joint policy
that maximizes their expected discounted return in a complete trajectory as
Multi-Agent Policy Gradient (MAPG) Policy gradient (PG) is the cornerstone of actor–critic RL algorithms, which makes policy
closer to the actions which contain larger advantage values by gradient ascending. Since we can easily extend PG to multi-agent setting, the policies of
N agents are trained with a shared advantage function which is introduced by the global state
s as
where
is estimated by a centralized value function with access to the global state
s during training, and
is the discount factor. The joint actions conform to the distribution of joint policies and the objective function is related to the gradient update direction fo all agents.
Multi-Agent Proximal Policy Optimization (MAPPO) Though it is easy to directly apply PPO to each agent in cooperative scenarios, the independent PPO [
16] may also encounter non-stationarity since the policies of agents are updated simultaneously. MAPPO extends IPPO’s independent critics to a centralized function with additional global information, and the learning target is derived as
where
is the importance sampling weight for each agent,
denotes all the policies except for agent
i, and
.
is a range-limiting function, which limits the ratio
in the interval
.
Noise-injection Methods for Exploration Alongside the
-greedy mechanism, the parametric noise in the weights of the policy network can also aid efficient exploration for agents. These parameters of noise are learned with gradient descent along with the remaining network weights [
17], and induce the stochasticity of agents’ policies to explore. The corresponding noisy linear layer is defined as:
where the learnable parameters
and
replace
w and
b in the original layer
, respectively. Meanwhile, the noisy-injection method with non-learnable parameters is also proposed [
18]. It is worth noting that these methods aim to directly intervene in the selection of actions at execution, forcing agents to explore other states. Both methods affect the policy rather than critic network, which is different from our goal to alleviate the information redundancy of joint value function.
5. Experiments
In this section, we evaluate the performance of baseline algorithms IPPO and MAPPO [
16] and their noisy centralized critic variants on non-monotonic Matrix Game and SMAC. We also compare the performance of NV-MAPPO and NA-MAPPO with the results of the agent-specific features enhanced MAPPO [
28] to show the superiority of our noisy critic network. All the variant algorithms are implemented in the PyMARL framework [
4], and all the hyperparameters would be kept the same as the baseline algorithms for the sake of fairness. We plot the median results for all experiments over 5 independent runs with random seeds and shade the 25–75% quartile.
Specifically, we give an out line of the two test environments, then list the necessary parameters of the algorithms in the following part of this section; the evaluation results and ablation studies will be presented at the end.
5.1. Testbeds
5.1.1. Non-Monotonic Matrix Game
The non-monotonic Matrix Game is a simple environment to test the cooperative ability of just two agents with three actions each, and the goal of cooperative agents is to take the optimal joint-action and capture the highest reward. The symmetric matrix game has the optimal joint action (
), all the agents share the same state information, and the pay-off matrix is shown in
Table 1.
5.1.2. SMAC
It is a common and popular practice to test the training effect of multiple agents in a game environment. StarCraft II, as typical real-time strategy (RTS) game, offers a great opportunity to tackle different cooperative challenges in the multi-agent domain. SMAC [
4] makes use of Blizzard’s StarCraft II machine learning API and DeepMind’s PySC2 as an interface for the autonomous agents to interact with the game environment. Each of our training agents can be controlled by an individual army unit in the testing scenarios, which is described as the
decentralized micromanagement problem in StarCraft II. As in
Figure 3, all the agents are trained to battle an opposing army of the game’s built-in scripted AI, which can be set to different difficulty levels. Each agent can move in four discrete actions and take attack actions to cause damage to enemies within
shooting range.
5.2. Experimental Setup
We implemented all the algorithms in the PyMARL framework [
35], and used the same network architecture and hyperparameters for those contrastive methods. Since we selected MAPPO as a representative of the MARL actor–critic style algorithm, the hyperparameters are heavily based on the recent paper [
28], which fine-tunes the MAPPO (MAPPO:
https://github.com/marlbenchmark/on-policy (accessed on 5 March 2021)). Yu [
28] proposes several agent-specific features screening methods to enhance the performance of vanilla MAPPO. It is worth noting that we strip off these artificial features of MAPPO in [
28] as testing baseline. Each agent is equipped with a DRQN [
36] with a recurrent layer, which has a 64-dimensional hidden state. We set the discount factor
and
for all testing scenarios (there is a noticeable difference between the different versions of SMAC, we use SC2.4.10 version of SMAC through all the testing scenarios).
In order to speed up the convergence of policies, we adopted 8 individual processes to collect training trajectory. The clip coefficient and scaling parameter in GAE module is 0.95. Still, we add an entropy term to the objective function and the entropy coefficient is 0.01. We paused the training every 10,000 time steps and test 32 episodes to test the cooperative ability. The difficult of game was set to the Very Difficult level as default, and we plot the median results for all experiments over 5 independent runs with random seeds and shade the 25–75% quartile.
We list the common hyperparameters of NV-MAPPO and NA-MAPPO in
Table 2. For
noise shuffle interval, we consider
and choose 100; for
we consider
, and for
we consider
, then we select proper value for each scenario. As the positive correlation between the variance of noise and the exploration, we encourage researchers to select bigger
in some other kind of cooperative scenarios to strengthen exploration of agents. Other training hyperparameters related to the test scenarios in SMAC can be found in
Table A1 which is summarized in
Appendix C, as well as the specific hyperparameters, i.e.,
and
for each scenario.
5.3. Results
First, we evaluated NV-MAPPO on a non-monotonic Matrix Game to verify the expressive capacity of the centralized value function. These two different payoff matrices in
Table 1 are both used to test the cooperative ability that deals with the
Relative Overgeneralization problem. The results in
Figure 4a,b show the superiority of the NV-MAPPO since it could find the optimal joint-action faster even when we implement the fine-tuned version of QMIX (QMIX:
https://github.com/hijkzzz/pymarl2 (accessed on 8 August 2021)) [
37] (represented by QMIX+ in figures below), which also achieves the
state-of-the-art level in SMAC and other cooperative tasks.
Next, we compared the performance of MAPG and MAPPO with their noisy variants. As shown in
Figure 5, both NV-MAPG and NV-MAPPO obtain superb performance in most scenarios, which indicates the dominant advantages of the proposed noise-injection method. Furthermore, we emphatically evaluate NV-MAPPO and NA-MAPPO on three hard-exploration scenarios, and both methods would surpass the baseline with a large margin as shown in
Figure 6. We also found that the noisy advantage method could cause some instability in some scenarios, i.e., the results of the noisy advantage method have large variances during training. We speculate that it is too aggressive to directly inject the noises into the advantage function of each agent, which may introduce additional bias in training. Still, the performance of NA-MAPPO is comparable to NV-MAPPO in most scenarios of SMAC.
Moreover, we found that the performance of MAPPO degraded a lot when those agent-specific features in [
28] are stripped off. We expect to know whether the performance of NV-MAPPO would come up to the
state-of-the-art results of SMAC reported in [
28]. We tested our methods on most of the Hard and Super-Hard scenarios and list the median results in
Table 3. The results demonstrate that the performance of NV-MAPPO significantly exceeds that of MAPPO on most Hard scenarios, such as
5m_vs_6m (
+65%),
3s5z_vs_3s6z (
+31%),
6h_vs_8z (
+87%), and
corridor (
+97%). Even without a shared advantage critic, NV-IPPO still achieves extraordinary results in Super-Hard scenarios
3s5z_vs_3s6z (
96%) and
6h_vs_8z (
94%). We think that the noise injected into the independent critic of agents would also disturb the gradient direction and prevent IPPO from overfitting due to non-stationarity. The average performance of NV-MAPPO calculated from Hard and Super-Hard scenarios still surpasses that of MAPPO with agent-specific features.
5.4. Ablations
All the results above indicate that the noisy advantage function would distinctly improve the performance of actor–critic style MARL algorithms. Since the proposed noise is relied upon to disturb the policy gradient direction, it would generally increase the fluctuation of centralized values. Furthermore, we calculate the standard deviation of centralized value function
on each agent dimension during training. As the box plot shown in
Figure 7a, it seems that the fluctuations of the centralized critic are more dramatic in some scenarios that need more exploration and careful cooperation. Combined with the results in
Table 3, we could find there is a positive correlation between the performance improvement and the standard deviation
, i.e., the larger fluctuation of
would bring the greater performance improvement. This conclusion reveals that the significantly improved results of NV-MAPPO indeed come from the noise we inject into the centralized critic.
Since the proposed noises are not injected into policy network similar to [
17,
18], they will not directly affect the actions that agents take. Therefore, we explore the reason why these noises could regularize the policies of agents under CTDE paradigm. We calculate the average entropy of the policies of the agents in scenario
3s5z_vs_3s6z, which has the largest variance across all the testing environments. As shown in
Figure 7b, the entropy of MAPPO’s policies continuously drops as the training goes on; however, for NV-MAPPO, the entropy would drop at the beginning of training as usual, and then keep fluctuating in a small range. This peculiarity would maintain the entropy in agents’ policies higher than that of MAPPO, and hence give rise to more exploration. We think that the fluctuation of centralized critic induced by injected noise will indirectly regularize the policies through gradient backpropagation along with actor–critic architecture. This could explain why the proposed noise-injection method would greatly improve the performance of MAPPO in Super-Hard scenarios that are hard to explore the optimal joint actions of agents.
In the end, we mentioned before that the fixed sampling noise would show some instability of agents in some Hard or Super-Hard cooperative scenarios. Here, we compare the performance of the periodical shuffled Gaussian noise with the fixed sampling noise of NV-MAPPO on
5m_vs_6m and Super-Hard
corridor, 6h_vs_8z. As shown in
Figure 8, the performance of these two noise-sampling methods is comparable, but the shuffled Gaussian noise is more stable during training. We implemented the shuffled sampling noise across all the cooperative scenarios in this paper.