*3.1. Multi-Agent Reinforcement Learning (MARL)*

The process of MARL is regarded as a decentralized partially observable Markov decision process (Dec-POMDP) [33]. In MARL, each agent *i* observes the environment state *s* and obtains a local observation *oi*. Then, it selects an action according to its policy *πi*. The environment executes the joint actions *a* = (*a*1, ··· , *aN*) and transforms *s* to the next state *s* . After execution, each agent acquires a reward *ri* = R*i*(*s*, *a*) and a next observation *o i* from the environment. Each agent aims to optimize its policy to maximize its total expected return *Ri* = ∑*<sup>T</sup> <sup>t</sup>*=<sup>0</sup> *γ<sup>t</sup> ri*(*t*), where *T* is a final timeslot, and *γ* ∈ [0, 1] is the discount factor.

Q-learning [19] and policy gradient [34] are two popular RL methods. The idea of Qlearning is to estimate an state-action value function *Q*(*s*, *a*) = E[*R*] and select the optimal action to maximize *Q*(·). Deep Q-network (DQN) [35], a Q-learning-based algorithm, uses a DNN as a function approximator and trains it by minimizing the loss:

$$\mathcal{L}(\theta) = \mathbb{E}\_{\mathbf{s}, a, r, s'} [(y - Q(\mathbf{s}, a|\theta))^2] \tag{1}$$

where *θ* is the parameter of the DNN. The target value *y* is defined as *y* = *r* + *γ* max*<sup>a</sup> Q* (*s* , *a* )[35], where *Q* is the target network, whose parameters are periodically updated from *θ*. DQN also applies a replay buffer to stabilize learning.

Policy gradient directly optimizes the policy *<sup>π</sup>* to maximize <sup>J</sup> (*θπ*) = <sup>E</sup>[*R*] and updates parameters based on the gradient [34]:

$$\nabla\_{\theta^{\mathfrak{F}}} \mathcal{J}(\theta^{\pi}) = \mathbb{E}\_{\mathbf{s} \sim p^{\pi}, a \sim \pi} [\nabla\_{\theta^{\mathfrak{F}}} \log \pi(a|\mathbf{s}, \theta^{\pi}) Q(\mathbf{s}, a)] \tag{2}$$

where *p<sup>π</sup>* is the state distribution. *Q*(*s*, *a*) can be estimated by samples [36] or a function approximator, such as DQN, which leads to the actor–critic algorithm [37].

#### *3.2. Hierarchical Graph Attention Network (HGAT)*

HGAT is an effective method for processing hierarchically structured data represented as a graph and introduced into MARL to extract the relationships among agents. By stacking multiple GATs hierarchically, HGAT firstly aggregates embedding vectors *e<sup>l</sup> ij* from

neighboring agents in each group *l* into *e<sup>l</sup> <sup>i</sup>* and subsequently aggregates *<sup>e</sup><sup>l</sup> <sup>i</sup>* from all groups into *e <sup>i</sup>*. The aggregated embedding vector *e <sup>i</sup>* represents the hierarchical relationships among different groups of neighbors.

#### **4. System Model and Problem Statement**

In this section, we describe the settings of a multi-agent cooperative scenario, UAV recon and a competitive scenario, predator-prey.

#### *4.1. UAV Recon*

As shown in Figure 1a, we deploy *N* UAVs into a hot-spot area to scout *n* point-ofinterests (PoIs) for *T* timeslots, where PoIs are randomly distributed. As we consider our UAVs to move at the same altitude, the area of our mission is two-dimensional. Each UAV has a circled recon area whose radius is considered as a recon range. If the Euclidean distance between a UAV and a PoI is less than the recon range, we consider the PoI to be covered.

In the beginning, each UAV is deployed in a random position. At each timeslot *t*, each UAV *i* determines its acceleration *acci* ∈ {(*acc*, 0),(−*acc*, 0),(0, *acc*),(0, −*acc*),(0, 0)} as its action. The action space of *i* is discrete. The energy consumption of *i* is defined as:

$$E\_i = E\_h + \frac{\upsilon\_i}{\upsilon\_{\max}} E\_m \tag{3}$$

where *vi* is the velocity of *i* and *vmax* is the maximum velocity of UAVs. *Eh* and *Em* are the energy consumption for hovering and movement, respectively.

**Figure 1.** Illustrations of (**a**) *UAV Recon* and (**b**) *Predator-Prey*.

In our scenario, our goals of UAVs are to cover more PoIs more fairly with less energy consumption. To evaluate the quality of tasks, we consider three metrics: coverage *C*, fairness *F*, and energy consumption *E*. The score of *C* denotes the proportion of covered PoIs, which is defined as:

$$C = \frac{n\_{\mathbb{C}}(t)}{n} \tag{4}$$

where *nC*(*t*) is the number of covered PoIs at timeslot *t*.

The score of fairness denotes how fair all PoIs are covered. Here, we use Jain's fairness index [38] to define the score of *F* as:

$$F = \frac{(\sum\_{j=1}^{n} c\_j)^2}{n \sum\_{j=1}^{n} c\_j^2} \tag{5}$$

while *cj* is the coverage time of PoI *j*.

Finally, UAVs need to control energy consumption in tasks. We define the score of *E* as:

$$E = \frac{1}{N} \sum\_{i=1}^{N} E\_i \tag{6}$$

When executing recon missions, each UAV needs to observe local states of other UAVs and PoIs to determine its action. The local state of UAV *i* is defined as *si* = (*Pi*, *vi*), where *Pi* and *vi* are the position and the velocity of *i*, respectively. Each PoI *j*'s local state *sj* = (*Pj*). If a PoI is in UAV *i*'s observation range, we consider the PoI is observed by *i*. If another UAV *j* is in *i*'s communication range, we consider *i* can communicate with *j*. To train UAV's policy, we define a heuristic reward *ri* as:

$$r\_i = \frac{\eta\_1 \times r\_{indv} + \eta\_2 \times r\_{shared}}{E\_i} - p\_i \tag{7}$$

where *pi* is a penalty factor. When UAV *i* flies across the border, it is penalized by *pi*. *rindv* = −1 if no PoIs is covered by *i* individually, otherwise *rindv* = *nindv*, where *nindv* means the number of PoIs that are only covered by *i*. *rshared* = 0 if *i* does not share PoIs with others, otherwise *rshared* <sup>=</sup> *nshared Nshare* , where *nshared* denotes the number of PoIs which are covered by *Nshare* neighboring UAVs. *η*<sup>1</sup> and *η*<sup>2</sup> are the importance factor of *rindividual* and *rshared*, respectively. We empirically set *η*<sup>1</sup> *η*<sup>2</sup> to encourage UAVs to cover more PoIs and avoid overlapping with others.

#### *4.2. Predator-Prey*

As shown in Figure 1b, we deploy *Npredator* predators to eliminate *Nprey* prey.

Both of them are controlled by a DRL-based method. If the distance between a predator and a prey is less than predators' attack range, we consider the prey to be eliminated. The goal of the predators is to eliminate all prey, while the goal of the prey is to escape from the predators. The speed of the predators is slower than the prey speed, so they need to cooperate with each other when chasing prey.

The action space of the predators and the prey is the same as the UAVs in the UAV recon scenario. The local state of each predator or prey is defined as *si* = (*Pi*, *vi*), where *Pi* and *vi* are the position and the velocity of a predator or prey, respectively. We consider that each predator and prey can observe the local state of adversaries inside its observation range, while it can communicate with companions inside its communication range. The eliminated preys can neither be observed nor communicate with others. To evaluate the performance of predators and prey, we define the score as:

$$S = \frac{T - T\_{\text{eliminate}}}{T} \tag{8}$$

where *T* is the total timeslots of an episode, while *Teliminate* is the timeslot when all prey are eliminated.

When predator *i* eliminates prey *j*, *i* will obtain a positive reward, while *j* will obtain a negative reward. When all prey are eliminated, the predators will get an additional reward.

#### **5. HGAT-Based Multi-Agent Coordination Control Method**

To achieve the goals of two scenarios described in Section IV, we present a multi-agent coordination control method based on HGAT for mixed cooperative–competitive environments. In our method, the global state of the environment is regarded as a graph, containing the local state of agents and the relationship among them. Each agent summarizes the information from the environment by HGAT and subsequently computes the Q-value and action in a value-based or actor–critic framework.

#### *5.1. HGAT-Based Observation Aggregation and Inter-Agent Communication*

In the multi-agent system, the environment involves multiple kinds of entities, including agents, PoIs, etc. As they are heterogeneous, agents need to treat their local states and model their relationships separately. Thus, we categorize all entities into different groups at the first step of execution in cooperative or competitive scenarios. As shown in Figure 2, *M* entities (containing *N* agents) are clustered into *K* groups and represent the environment's state as graphs. The agents construct an observation graph *G*O and a communication graph *G*<sup>C</sup> respectively based on their observation ranges O1, ··· , O*<sup>N</sup>* and communication ranges C1, ··· , C*N*. The edges of *G*<sup>O</sup> represent that the entities can be observed by agents, while the edges of *G*C represent that two of the agents can communicate with each other. The adjacency matrix of *G*O and *G*C are **A**O and **A**C , respectively. *i*'s observation is defined as *oi* = *sj*|*j* ∈ O*<sup>i</sup>* . Its received messages from the others is *mi* = *mji*|*j* ∈ C*<sup>i</sup>* , where *sj* is agent *j*'s local state and *mji* is the message that *j* sends to *i*.

At each timeslot, the agents use the network shown in Figure 3 to determine their actions according to *s*, **A**O, and **A**<sup>C</sup> received from the environment, where *s* = (*s*1, , ··· ,*sM*). The parameters of the network are shared among the agents in the same group. The network contains three components, a set of encoders, two stacked HGAT layers, and a recurrent unit, which consists of a gated recurrent unit (GRU) layer and a fully connected layer. GRU is a variant of the recurrent neural network (RNN). To summarize the information in each agent *i*'s observation *oi*, the first HGAT layer processes *oi* into a high-dimensional aggregated embedding vector *e <sup>i</sup>* as shown in Figure 4. Firstly, the encoder which consists of a fully connected layer transforms the local states from each group *l* into embedding vectors as *ej* = *f <sup>l</sup> <sup>e</sup>*(*sj*), where *f <sup>l</sup> <sup>e</sup>* means the encoder for group *l*. *ej* is the embedding vector for entity *j* in group *<sup>l</sup>*. Then, it aggregates *ej* as *<sup>e</sup><sup>l</sup> <sup>i</sup>* = ∑*<sup>j</sup> αij***W***<sup>l</sup> vej* [32], where **W***<sup>l</sup> <sup>v</sup>* is a matrix that transforms

*ej* into a "value". The attention weight *αij* represents the importance of the embedding vector *ej* from *j* to *i*, which is calculated by softmax as *αij* ∝ exp(*ej* <sup>T</sup>**W***<sup>l</sup> k* T **W***qei*) [32] if *a*<sup>O</sup> *i*,*j* in **A**<sup>O</sup> is 1, otherwise *αij* = 0. **W***<sup>k</sup>* and **W***<sup>q</sup>* transform a embedding vector into a "key" and a "query", respectively. **A**<sup>O</sup> is used for selection so that only the local states from O*<sup>i</sup>* are summarized. To improve the performance, we use the multiple attention heads here. Finally, *e<sup>l</sup> <sup>i</sup>* from all groups are aggregated into *e <sup>i</sup>* by a fully connected layer *fG*, as:

$$e'\_i = f\_{\overline{G}}(||\_{l=1}^{\mathbb{K}} e'\_i^l) \tag{9}$$

where represents the concatenation operation. We do not apply another GAT for aggregating, such as HAMA, as our approach has less computing overhead.

**Figure 2.** The clustering of agents and their topology.

**Figure 3.** The overall structure of the network. *s<sup>l</sup>* and *e<sup>l</sup>* represent the local states and embedding vectors of agents in group *l*. **A***<sup>i</sup>* denotes the *i*th row of **A**.

After calculating *e i* , agent *i* sends it as a message *mij* to each neighboring agent *j* in C(*i*). Inter-agent communication helps agents to share their observations with neighbors, which brings a better performance in coordination. To summarize each agent *i*'s received messages *mi*, the second HGAT layer processes *mi* and aggregates it into another embedding vector *e <sup>i</sup>* by the same means as shown in Figure 4. The adjacency matrix used here is **A**<sup>C</sup> instead of **A**O. Our method is capable of inner-group and inter-group communication and can easily extend to a multi-hop by stacking new HGAT layers.

**Figure 4.** The architecture of an HGAT layer.

#### *5.2. Implementation in a Value-Based Framework*

This implement is based on DQN. Each agent *i* maintains hidden states *hi* for the recurrent unit and calculates its Q-values by a Q-network, as shown in Figure 3. Similar to DQN, our method also employs a target network with the same structure.

We introduce the skip-connection strategy by concatenating *e <sup>i</sup>* and *e <sup>i</sup>* as an input of the recurrent unit when computing the Q-value, so agents can use the information both from their observation and others'. The Q-value is calculated as:

$$Q^l(o\_{i'}m\_{i'}, a\_{i'}h\_i) \approx f^l\_R(e'\_{i'}e'\_{i'}, h\_i) \tag{10}$$

where *Q<sup>l</sup>* represents the Q-network of group *l* where *i* belongs, *f <sup>l</sup> <sup>R</sup>* means the recurrent unit in *Q<sup>l</sup>* , and *ai* is the action determined by *i* according to Q-values. We apply -greedy policy [35] to balance the exploitation and exploration as:

$$a\_i = \begin{cases} \arg\max\_{a \in \mathcal{A}\_i} Q^l(o\_i, m\_i, a\_i h\_i), & \text{with probability } 1 - \varepsilon\\ \ random(\mathcal{A}\_i), & \text{with probability } \varepsilon \end{cases} \tag{11}$$

where A*<sup>i</sup>* is the action space of *i*.

After executing the joint actions *a* = (*a*1, ··· , *aN*), the environment transforms the current state to the next and sends the next local states *s* , the next adjacency matrix **A**O and **A**C , and the reward *ri* to each agent *i*. The experience (*s*, **A**O, **A**<sup>C</sup> , *a*,*r*, *s* , **A**O, **A**C , *h*, *h* ) is stored in a shared replay buffer *B*, where *r* = (*r*1, ··· ,*rN*), *h* = (*h*1, ··· , *hN*), and *h* = (*h* <sup>1</sup>, ··· , *h <sup>N</sup>*). *h <sup>i</sup>* is the next hidden state that the Q-network outputs when agent *i* calculates Q-values. *hi* is initialized to zero at the beginning of an episode.

To training the Q-network of each group, we sample *H* experiences from *B* as a minibatch and minimize the loss:

$$\mathcal{L}(\boldsymbol{\theta}\_{l}^{Q}) = \frac{1}{N\_{l}} \sum\_{i=1}^{N\_{l}} \mathbb{E}[(\boldsymbol{y}\_{i} - \boldsymbol{Q}^{l}(\boldsymbol{o}\_{i\prime} \boldsymbol{m}\_{i\prime} \boldsymbol{a}\_{i\prime} \boldsymbol{h}\_{i} | \boldsymbol{\theta}\_{l}^{Q}))^{2}] \tag{12}$$

where *Nl* means the number of agents in group *<sup>l</sup>* and *<sup>θ</sup><sup>Q</sup> <sup>l</sup>* denote the parameters of *<sup>Q</sup><sup>l</sup>* . *yi* is the target value that calculated by the target network *Q<sup>l</sup>* , as:

$$\log j\_i = r\_i + \gamma \max\_{a' \in \mathcal{A}\_i} \mathbb{Q}'^l(o'\_{i'}, m'\_{i'}, a', h'\_i | \theta^{Q'}\_{l}) \tag{13}$$

where *o <sup>i</sup>* and *m <sup>i</sup>* are *<sup>i</sup>*'s next observation and next received messages, respectively. *<sup>θ</sup><sup>Q</sup> l* denote the parameters of *Q<sup>l</sup>* , which are periodically updated from *θ<sup>Q</sup> l* .

#### *5.3. Implementation in an Actor–Critic Framework*

Our method can also be implemented on the actor–critic framework. In this implementation, each agent *i* has an actor network and a critic network, maintaining hidden states *h<sup>π</sup> <sup>i</sup>* and *<sup>h</sup><sup>Q</sup> <sup>i</sup>* . After obtaining *s*, **A**<sup>O</sup> and **A**<sup>C</sup> , agent *i* in group *l* computes the probability of actions as:

$$\pi^l(o\_{i\prime}, m\_{i\prime}, h\_i^{\pi}) \approx f\_{\mathbb{R}}^{\pi^l}(\varepsilon^{\prime \pi}\_{i\prime}, \varepsilon^{\prime\prime \pi}\_{i\prime}, h\_i^{\pi}) \tag{14}$$

where *π<sup>l</sup>* represents the actor network of group *l* and *f <sup>π</sup><sup>l</sup> <sup>R</sup>* represents the recurrent unit in *π<sup>l</sup>* . We employ the -categorical policy here. Agent *i* determines an action based on *πl* (*oi*, *mi*, *h<sup>π</sup> <sup>i</sup>* ) with probability 1 − and makes a random choice with probability . The critic network *Q<sup>l</sup>* subsequently calculates Q-values, such as the value-based framework. The hidden states *h<sup>π</sup> <sup>i</sup>* and *<sup>h</sup><sup>Q</sup> <sup>i</sup>* and the next hidden states *<sup>h</sup><sup>π</sup> <sup>i</sup>* and *<sup>h</sup><sup>Q</sup> <sup>i</sup>* are stored in the replay buffer, where *h<sup>π</sup> <sup>i</sup>* and *<sup>h</sup><sup>Q</sup> <sup>i</sup>* are the outputs of *<sup>π</sup><sup>l</sup>* and *<sup>Q</sup><sup>l</sup>* , respectively.

The critic network of each group is trained by minimizing the loss <sup>L</sup>(*θ<sup>Q</sup> <sup>l</sup>* ), which is computed as (13). As the actor–critic framework selects actions according to *π<sup>l</sup>* (*oi*, *mi*, *h<sup>π</sup> i* ) instead of the maximum Q-value, we use the expectation of the next state's Q-value to calculate the target value *yi* as:

$$y\_i = r\_i + \gamma \sum\_{a' \in \mathcal{A}\_i} \pi^{\prime \mathbf{l}}(a' | o\_{i\prime}^{\prime}, m\_{i\prime}^{\prime} h\_{i\prime}^{\prime \pi}, \theta\_{\mathbf{l}}^{\pi^{\prime}}) \mathcal{Q}^{\prime \mathbf{l}}(o\_{i\prime}^{\prime} m\_{i\prime}^{\prime} a^{\prime}, h\_{i\prime}^{\prime \mathbf{Q}} | \theta\_{\mathbf{l}}^{\mathbf{Q}^{\prime}}) \tag{15}$$

where *θ<sup>π</sup> <sup>l</sup>* and *<sup>θ</sup><sup>Q</sup> <sup>l</sup>* are the parameters of target network *<sup>π</sup><sup>l</sup>* and *<sup>Q</sup><sup>l</sup>* , respectively.

The actor network of each group is trained according to the gradient:

$$\nabla\_{\theta\_l^{\text{eff}}} (\mathcal{J}(\theta\_l^{\text{\pi}})) = \frac{1}{N\_l} \sum\_{i=1}^{N\_l} \mathbb{E} \left[ \log \pi^l (a\_i | o\_{i\cdot}, m\_{i\cdot}, h\_i^{\text{\pi}}, \theta\_l^{\text{\pi}}) \left( \mathcal{Q}^l (o\_{i\cdot}, m\_i, a\_i, h\_i^{\text{\text{Q}}} | \theta\_l^{\text{Q}}) - b\_i \right) \right] \tag{16}$$

where the baseline *bi* is designed to reduce variance and stabilize training [39], which is defined as:

$$b\_i = \sum\_{a \in \mathcal{A}\_i} \pi^l(a|o\_{i\prime}, m\_{i\prime}, h\_i^{\pi}, \theta\_l^{\pi}) \mathcal{Q}^l(o\_{i\prime}, m\_{i\prime}, a\_\prime h\_i^{\mathcal{Q}}|\theta\_l^{\mathcal{Q}}) \tag{17}$$

After training, *θ<sup>π</sup> <sup>l</sup>* and *<sup>θ</sup><sup>Q</sup> <sup>l</sup>* are updated as *<sup>θ</sup><sup>π</sup> <sup>l</sup>* <sup>←</sup> *τθ<sup>π</sup> <sup>l</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>τ</sup>*)*θ<sup>π</sup> <sup>l</sup>* , and *<sup>θ</sup><sup>Q</sup> <sup>l</sup>* <sup>←</sup> *τθ<sup>Q</sup> <sup>l</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>τ</sup>*)*θ<sup>Q</sup> <sup>l</sup>* , respectively [40].

Our method can be extended to continuous action space by estimating the expectation of *bi* with Monte Carlo samples or a learnable state value function *V*(*oi*, *mi*) [23].

#### **6. Simulation**

#### *6.1. Set Up*

To evaluate the performance of our method, we conduct a series of simulations on an Ubuntu 18.04 server with 2 NVIDIA RTX 3080 GPUs. We implement a value-based (VB) version and an actor–critic (AC) version of our method in PyTorch. Each fully connected layer and GRU layer contains 256 units. The activation functions in encoders and HGAT layers are ReLU [41]. The number of attention heads is 4. Empirically, we set the learning rate of the optimizer to 0.001, and the discount factor *γ* to 0.95. The replay buffer size is 50 K and the size of a minibatch is 128. is set to 0.3. For the value-based version, The target networks are updated every five training steps. For the actor–critic version, we set *τ* to 0.01. The networks are trained every 100 timeslots and update their parameters four times in a training step.

We compare our method with four MARL baselines, including DGN, DQN, HAMA, and MADDPG. For non-HGAT-based approaches, each agent concatenates all local states in its observation into a vector, while padding 0 for unobserved entities. The parameters of networks are shared among agents in all baselines except MADDPG. We use the Gumbel-Softmax reparameterization trick [42] in HAMA and MADDPG to make them trainable

in discrete action spaces. DGN is based on our proposed algorithm [31], which applies a GAT layer for inter-agent communication. We train our method and each baseline for 100 K episodes and test them for 10 K episodes.

#### *6.2. UAV Recon*

As summarized in Table 1, we deploy several UAVs in a 200 × 200 area where 120 PoIs are distributed. The penalty factor *p* in (7) is set to 1. We evaluate the performance of our method in the test stage under different number of UAVs and compare it with baselines.

**Table 1.** Experiment parameters of UAV recon.


Figure 5 shows the performance of each method in terms of coverage, fairness, and energy consumption under different numbers of UAVs. Note that both two versions of our method are trained with 20 UAVs and transferred to a different scale of UAV swarms. From Figure 5a,b, we observe that our method outperforms all baselines in terms of coverage and fairness. Compared with DGN and DQN, our method employs HGAT to extract features from observation, which is more effective than processing raw observation vectors directly. Therefore, our method helps UAVs to search PoIs and better optimize their flight trajectories. Although HAMA also applies HGAT, UAVs cannot cooperate as effectively as our method, owing to the lack of communication. In our method, the UAVs communicate with others and process received messages by another HGAT layer. Furthermore, the recurrent unit helps UAVs to learn from the hidden states, which induces a better performance. In MADDPG, each UAV trains an individual network and concatenates observations and actions of all agents into a high-dimensional vector as an input of the critic. As the networks in MADDPG expands exponentially to the scale of the agents, it is hard to be trained effectively and efficiently in large-scale multi-agent systems. As a consequence, the MADDPG consumes more time to train but obtains the lowest score.

Figure 5c indicates that our method consumes less energy than DGN and DQN. As their flight trajectories are better, UAVs can cover more PoIs fairly while consuming less energy. The energy consumption of HAMA is considerable with our method in low-scale environments and increases when the number of UAVs is up to 40. MADDPG fails to improve coverage and fairness, so it tends to save on energy to maximize its reward.

To test the capability of transferred learning, we compare the transferred policies with those trained under the same settings of testing. As shown in Figure 6, the performance does not deteriorate when the policy is transferred to execute with 10, 30, or 40 UAVs, which indicates that our method is highly transferable under various numbers of UAVs.

**Figure 5.** Simulation results of all methods on coverage, fairness, and energy consumption under different number of UAVs.

**Figure 6.** Simulation results of transfer learning on coverage, fairness, and energy consumption under different number of UAVs.

#### *6.3. Predator-Prey*

As summarized in Table 2, we deploy five predators in a 100 × 100 area to eliminate five prey. We set the attack reward of predators and prey to 10 and −10, respectively. The additional reward is set as *radditional* = 10× *S*. We train the policy by the value-based version of our method and test it by competing with other policies trained by different methods.


**Table 2.** Experiment parameters of predator-prey.

Table 3 indicates that our method shows its superiority over all baselines in both roles of predator and prey. By introducing GRU and inter-agent communication, the predators obtain more information from hidden states and neighbors to decide which prey to capture. It is more flexible for predators to determine whether to chase prey individually or cooperatively. Similarly, GRU and inter-agent communication also bring more information to prey, so they can choose from various strategies to survive. For example, prey can escape from predators by their faster speed or sacrifice one of them to distract predators.


**Table 3.** The mean and standard deviation of scores in predator-prey.
