*Article* **A Multi-Population Mean-Field Game Approach for Large-Scale Agents Cooperative Attack-Defense Evolution in High-Dimensional Environments †**

**Guofang Wang 1,2,3, Ziming Li 1,2, Wang Yao 2,3,4,***<sup>∗</sup>* **and Sikai Xia 1,2**


**Abstract:** As one of the important issues of multi-agent collaboration, the large-scale agents' cooperative attack–defense evolution requires a large number of agents to make stress-effective strategies to achieve their goals in complex environments. Multi-agent attack and defense in high-dimensional environments (3D obstacle scenarios) present the challenge of being able to accurately control highdimensional state quantities. Moreover, the large scale makes the dynamic interactions in the attack and defense problems increase dramatically, which, using traditional optimal control techniques, can cause a dimensional explosion. How to model and solve the cooperative attack–defense evolution problem of large-scale agents in high-dimensional environments have become a challenge. We jointly considered energy consumption, inter-group attack and defense, intra-group collision avoidance, and obstacle avoidance in their cost functions. Meanwhile, the high-dimensional state dynamics were used to describe the motion of agents under environmental interference. Then, we formulated the cooperative attack–defense evolution of large-scale agents in high-dimensional environments as a multi-population high-dimensional stochastic mean-field game (MPHD-MFG), which significantly reduced the communication frequency and computational complexity. We tractably solved the MPHD-MFG with a generative-adversarial-network (GAN)-based method using the MFGs' underlying variational primal–dual structure. Based on our approach, we carried out an integrative experiment in which we analytically showed the fast convergence of our cooperative attack–defense evolution algorithm by the convergence of the Hamilton–Jacobi–Bellman equation's residual errors. The experiment also showed that a large number of drones can avoid obstacles and smoothly evolve their attack and defense behaviors while minimizing their energy consumption. In addition, the comparison with the baseline methods showed that our approach is advanced.

**Keywords:** large-scale agents; attack and defense; multi-population mean-field game; high-dimensional solution space; neural networks

**MSC:** 91A16; 93-10; 49N80

#### **1. Introduction**

Cooperative control among multiple agent has always been an important research topic in swarm intelligence [1]. Game theory, as a branch of modern mathematics, studies optimal decision-making problems under conflicting adversarial conditions and can provide a

**Citation:** Wang, G.; Li, Z.; Yao, W.; Xia, S. A Multi-Population Mean-Field Game Approach for Large-Scale Agents Cooperative Attack-Defense Evolution in High-Dimensional Environments. *Mathematics* **2022**, *10*, 4075. https:// doi.org/10.3390/math10214075

Academic Editor: Ioannis G. Tsoulos

Received: 28 September 2022 Accepted: 31 October 2022 Published: 2 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

theoretical basis for multi-agent cooperative decision-making problems. This paper is oriented toward the large-scale agents' cooperative attack–defense evolution, one of the important issues of multi-agent collaboration, which requires a large number of agents to make stress-effective strategies to achieve their goals in complex environments [2]. The multi-player continuous attack–defense game, as the mainstream research method for multi-agent attack–defense evolution, is a differential game between two adversarial teams of cooperative players playing in an area with targets. The attacker attempts to reach the set destination. The goal of the defender is to delay or stop the attacker by catching it [3]. The attack–defense game problem can be described as an optimal decision-making problem under complex multi-constraint conditions [4].

The classic attack–defense games have long been studied in multi-agent cooperative control. The multiple-pursuer-one-evader problem has been well documented. In [5], the Voronoi diagram construct was used for the capture of an evader within a bounded domain. Based on differential game theory and optimal control theory, differential game models have been established to solve optimal strategies by setting some rules and assumptions. In [6], based on the geometric relationship between two pursuers and an evader, a differential game model was established through coordinate transformation to solve the optimal cooperative strategy. Paper [7] studied the optimal guidance law of two missiles intercepting a single target based on the hypothesis of the missile hit sequence. Reference [8] proposed an online decision-making technique based on deep reinforcement learning (DRL) for solving the environmental sensing and decision-making problems in the chase and escape games. For the case of N-pursuers and M-evaders, more complex interactions need to be analyzed. Paper [9] studied a multiplayer border-defense problem and extended classical differential game theory to simultaneously address weapon assignments and multiplayer pursuit–evasion scenarios. Papers [10,11] proposed some methods based on linear programming, and applied deep reinforcement learning methods to deal with a simplified version of the RoboFlag competition [4]. An approach to the task allocation of many agents was proposed in [12], where Bakolas and Tsiotras used the Voronoi diagram construct to control the system. To solve general attack–defense games, the Hamilton–Jacobi–Isaacs (HJI) approach is ideal when the game is low-dimensional. However, because its complexity increases exponentially with the number of agents, the HJI approach is only tractable for the two-player game [13]. However, the above methods cannot be directly applied to the large-scale attack–defense game we are concerned with, because these traditional optimization and control technologies deal with the dynamic interactions between individuals separately. Moreover, to conduct more accurate real-time control for agents, the state variables used to characterize their kinematics are usually high-dimensional. Thus, with the increase in the agents' number, the modeling process of cooperative attack–defense problems tends to be complex, and the difficulty of solving the optimal strategy will increase significantly.

To solve the communication and calculation difficulties caused by agents' interactions on a large scale, mean-field games (MFGs) were proposed by Lasry and Lions [14–16] and Huang, Malhame, and Caines [17–19] independently. MFGs have been widely used in industrial engineering [20–22], crowd motion [23–26], swarm robotics [27,28], epidemic modeling [29,30], and data science [31–33]. In the MFG, each agent can obtain the evolution of the global or macroscopic information (mean-field) by solving the Fokker–Planck–Kolmogorov (FPK) equation. The optimal strategy of each agent is found by solving the Hamilton–Jacobi– Bellman (HJB) equation [34]. Recently, a machine-learning-based method, named APAC-net, has been proposed to solve high-dimensional stochastic MFGs [35]. Intuitively, the largescale attack–defense problems are more consistent with the multi-population model. The multi-population mean-field game is a critical subclass of mean-field games (MFGs). It is a theoretically feasible multi-agent model for simulating and analyzing the game between multiple heterogeneous populations of interacting massive agents. We proposed a numerical solution method (CA-Net) for multi-population high-dimensional stochastic MFGs in [36], and studied the large-scale attack–defense problem in a 3D blank scenario based on CA-Net

in [37]. In this paper, we focus on a 3D obstacle scene. The presence of multiple obstacles in the 3D space brings qualitative changes to the attack and defense decisions of large-scale agents, i.e., a "diversion" phenomenon. How to model and solve the cooperative attack–defense evolution problem of large-scale agents in 3D obstacle scenarios has become a challenge.

Inspired by the above-mentioned cutting-edge works, in this paper, we jointly considered energy consumption, inter-group attack and defense, intra-group collision avoidance, and obstacle avoidance in their cost functions. Meanwhile, the high-dimensional state dynamics were used to describe the motion of agents under environmental interference. Then, we made the following main contributions:


Our approach accomplishes a breakthrough from few-to-few to mass-to-mass in terms of attack–defense game theory in 3D obstacle scenarios.

In Section 2, we model the cooperative attack–defense evolution of large-scale agents in a 3D obstacle scene and formulate it as an MPHD-MFG. In Section 3, we propose ECA-Net to tractably solve the MPHD-MFG. Section 4 shows the performance of our algorithm with numerical results, and in Section 5, we draw our conclusions.

#### **2. Modeling and Formulating**

The system model consists of the objective function and the kinematics equation, which describe how large-scale agents evolve paths through attack–defense relationships. We considered a 3D obstacle attack–defense scenario with *N* agents, as shown in Figure 1. The blue side N<sup>1</sup> as the attacker with *N*<sup>1</sup> agents hopes to break through the red side's interception and successfully reach the destination; the red side N<sup>2</sup> as the defender with *N*<sup>2</sup> agents hopes to complete the interception against the blue side in the given area to prevent the blue side from penetrating. The state of agent *i* is denoted by **x***<sup>p</sup> <sup>i</sup>* (*t*) <sup>∈</sup> <sup>R</sup>*n*; *<sup>p</sup>* <sup>=</sup> <sup>1</sup> represents the blue side; *<sup>p</sup>* = 2 represents the red side, *<sup>i</sup>* ∈ N<sup>1</sup> or N2, and *<sup>N</sup>*<sup>1</sup> + *<sup>N</sup>*<sup>2</sup> = *<sup>N</sup>*.

**Figure 1.** Vertical view of large-scale agents attacking and defending in 3D obstacle scene.

#### *2.1. Kinematics Equation*

Each agent *<sup>i</sup>* ∈ N = N<sup>1</sup> ∪ N<sup>2</sup> belongs to a population *<sup>p</sup>*(*i*) ∈ Q = {1, 2} and is characterized by a state **x***<sup>p</sup> <sup>i</sup>* (*t*) <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* at time *<sup>t</sup>* <sup>∈</sup> [0, *<sup>T</sup>*]. We considered that the continuoustime state **x***<sup>p</sup> <sup>i</sup>* (*t*) of player *i* of population *p* has the dynamic–kinematic equation of the following form

$$\mathbf{dx}\_i^p = \mathbf{h}\_i^p(\mathbf{x}\_i^p, \mathbf{x}\_i^{p^-}, \mathbf{x}\_i^{-p^-}, \mathbf{u}\_i^p)\mathbf{d}t,\tag{1}$$

where **u***<sup>p</sup> <sup>i</sup>* : [0, *<sup>T</sup>*] → U*<sup>p</sup>* <sup>⊆</sup> <sup>R</sup>*<sup>m</sup>* is the control input (strategy) implemented by agent *<sup>i</sup>* in view of the state **x***<sup>p</sup> <sup>i</sup>* at time *<sup>t</sup>*, **<sup>x</sup>***p*<sup>−</sup> *<sup>i</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>×</sup> [0, *<sup>T</sup>*] → P2(R*n*) is the states of all other intra-group agents, **x** −*p*<sup>−</sup> *<sup>i</sup>* describes the states of all other inter-group agents, and **<sup>h</sup>***<sup>p</sup> i* : <sup>R</sup>*<sup>n</sup>* × P2(R*n*)<sup>2</sup> × U*<sup>p</sup>* <sup>→</sup> <sup>R</sup>*<sup>n</sup>* is the nonlinear evolution function, *<sup>p</sup>* <sup>=</sup> 1, 2.

#### *2.2. Objective Function*

For player *i* of population *p*, *p* = 1, 2, we considered the cost function of the following form:

$$\begin{split} \mathbf{J}\_{i}^{p}(\mathbf{x}\_{i}^{p},\mathbf{x}\_{i}^{p^{-}},\mathbf{x}\_{i}^{-p^{-}},\mathbf{u}\_{i}^{p}) &= \int\_{0}^{T} L\_{i}^{p}(\mathbf{x}\_{i}^{p}(t),\mathbf{u}\_{i}^{p}(t)) + F\_{i}^{p}(\mathbf{x}\_{i}^{p}(t),\mathbf{x}\_{i}^{p^{-}}(\mathbf{x}\_{i}^{p}(t),t),\mathbf{x}\_{i}^{-p^{-}}(\mathbf{x}\_{i}^{-p}(t),t)) dt \\ &+ G\_{i}^{p}(\mathbf{x}\_{i}^{p}(T),\mathbf{x}\_{i}^{p^{-}}(\mathbf{x}\_{i}^{p}(T),T),\mathbf{x}\_{i}^{-p^{-}}(\mathbf{x}\_{i}^{-p}(T),T)), \end{split} \tag{2}$$

where *L<sup>p</sup> <sup>i</sup>* : <sup>R</sup>*<sup>n</sup>* × U*<sup>p</sup>* <sup>→</sup> <sup>R</sup> is the running cost incurred by agent *<sup>i</sup>* based solely on its actions, *Fp <sup>i</sup>* : <sup>R</sup>*<sup>n</sup>* × P2(R*n*)<sup>2</sup> <sup>→</sup> <sup>R</sup> is the running cost incurred by agent *<sup>i</sup>* based on its interactions with the rest of the same population and the other population, and *G<sup>p</sup> <sup>i</sup>* : <sup>R</sup>*<sup>n</sup>* × P2(R*n*)<sup>2</sup> <sup>→</sup> <sup>R</sup> is a terminal cost incurred by agent *i* based on its final state and the final states of the whole of the related populations.

#### 2.2.1. Blue-Side Control Problem

In the 3D obstacle space, the blue side's agents traveling from a common source hope to break through the red side's interception while avoiding obstacles and successfully reach the destination, where a straight line segment in the area above the red side's initial distribution is set as the destination. At time *<sup>t</sup>* <sup>≥</sup> 0, the *<sup>i</sup>*-th agent controls its strategy **<sup>u</sup>**<sup>1</sup> *i* to minimize its: (1) motion energy, (2) red side capture, (3) blue side inter-agent collision, (4) collision of the blue side with the obstacles, and (5) travel time during the remaining travel to the destination. The running cost *L*<sup>1</sup> *<sup>i</sup>* is given by

$$L\_i^1 = \underbrace{c\_1 \|\mathbf{u}\_i^1(t)\|^2}\_{\text{(1) motion energy minimization}} \text{ }\tag{3}$$

where *c*<sup>1</sup> is a constant. The running cost *L*<sup>1</sup> *<sup>i</sup>* denotes the control effort implemented by agent *i*. The interaction cost *F*<sup>1</sup> *<sup>i</sup>* is given by

$$\begin{split} F\_{i}^{1} &= \underbrace{c\_{2} \left( -\frac{1}{N\_{2}} \sum\_{k=1}^{N\_{2}} \left\| \mathbf{p}\_{i}^{1}(t) - (\mathbf{p}\_{k}^{2}(t) + c\_{\mathsf{F}}\mathbf{e}) \right\| \right)}\_{\text{(2) red side capture minimization}} + \underbrace{c\_{3} \left( \frac{1}{N\_{1}} \sum\_{j \neq i}^{N\_{1}} \mathbf{1}\_{\|\mathbf{p}\_{i}^{1}(t) - \mathbf{p}\_{j}^{1}(t)\| \leq \alpha\_{0}} \right)}\_{\text{(3) blue side interface} \; \overbrace{\mathbf{p}\_{i} \in \Omega\_{\text{obs}, trn}}^{N\_{1}} \; \begin{split} &c\_{2} \left( \frac{1}{N\_{1}} \sum\_{j \neq i}^{N\_{1}} \mathbf{1}\_{\|\mathbf{p}\_{i}^{1}(t) - \mathbf{p}\_{j}^{1}(t)\| \leq \alpha\_{0}} \right) \\ &\underbrace{c\_{3} \left( \frac{1}{N\_{1}} \sum\_{i=1}^{N\_{1}} \left\{ \sum\_{a=1}^{A} \gamma\_{\text{obs}, a} \frac{1}{\zeta\_{\text{os}}(\mathbf{p}\_{i},t)} \right. \right. \end{split}}\_{\text{(4) blue side objective}} + \underbrace{\mathbf{p}\_{i} \left( \mathbf{p}\_{i} \in \Omega\_{\text{obs}, trn} \right)}\_{\text{(4) blue side objective}} \end{split} \tag{4}$$

(4) blue side obstacle collision avoidance

$$
\Omega\_{obs,trn} = \bigcup\_{a=1}^{A} \Omega\_{obs,trn,a} \tag{5}
$$

$$\Omega\_{\text{obs,trn},a}: \quad Q\_{\text{f}}(\mathbf{x}, y, z) = \frac{1}{3v\_{1,a}^2} (\mathbf{x} - \mathbf{x}\_{0,a})^2 + \frac{1}{3v\_{2,a}^2} (y - y\_{0,a})^2 + \frac{1}{3v\_{3,a}^2} (z - z\_{0,a})^2 \le 1.1, \quad a = 1, \dots, A,\tag{6}$$

$$\Omega\_{\text{obs},a} = \{ (\mathbf{x}, y, z) | |\mathbf{x} - \mathbf{x}\_{0,a}| \le v\_{1,a}, |y - y\_{0,a}| \le v\_{2,a}, |z - z\_{0,a}| \le v\_{3,a} \}, \quad a = 1, \dots, A. \tag{7}$$

where **p** = (*x*, *y*, *z*) is the agent's usual Euclidean spatial coordinate position, **e** is the unit vector of the spherical coordinate system, *er* is the capture radius of the red side agent, *e*<sup>0</sup> is the safe distance between agents, *c*2, *c*3, *c*<sup>4</sup> are constants, *γobs*,*a*, *a* = 1, ... , *A*, is the repulsive force gain coefficient between the agent *i* and other obstacles, (*x*0,*a*, *y*0,*a*, *z*0,*a*) is the center of the corresponding obstacle, and (±*v*1,*a*, ±*v*2,*a*, ±*v*3,*a*) are its vertices, which are parallel with the *x*, *y*, *z* axes, respectively. The interaction cost *F*<sup>1</sup> *<sup>i</sup>* denotes the sum of the trajectory collision loss of the intra-group, inter-group, and with obstacles about agent *i*.

**Remark 1.** *In our 3D obstacle scene, obstacles were rectangular solids* Ω*obs. For training, the cuboid obstacle repulsion is formulated as a differentiable unit ellipsoid repulsion function* <sup>1</sup> *<sup>Q</sup> [38]. To better reduce the collision between agents and obstacles, the radius of* Ω*obs*,*trn is 10% larger than that of the unit circumscribed ellipsoid of* Ω*obs. Our algorithm was trained by using ellipsoidal repulsion, which can produce gradient information smoothly in obstacles, stimulating the model to learn trajectories to avoid obstacles [39].*

The terminal cost *G*<sup>1</sup> *<sup>i</sup>* is given by

$$G\_i^1 = \underbrace{c\_5 \|\mathbf{x}\_i^1(T) - \mathbf{x}\_0\|\_2}\_{\text{(5) travel time minimization}} \text{ }\tag{8}$$

where *c*<sup>5</sup> is a constant, **x**<sup>1</sup> *<sup>i</sup>* (*T*) is the final state of agent *i*, and **x**<sup>0</sup> is the target state in which we want the agents to reach the objective. The terminal cost *G*<sup>1</sup> *<sup>i</sup>* denotes the distance between agent *i*'s terminal state and the desired state.

In summary, by defining the cost function as (2), the optimal control problem faced by agent *i* of the blue side is given by

$$\begin{aligned} \underset{\mathbf{u}\_i^1 \in \mathcal{U}\_1}{\text{inf}} \quad & J\_i^1(\mathbf{x}\_{i'}^1, \mathbf{x}\_i^{1-}, \mathbf{x}\_i^{2-}, \mathbf{u}\_i^1) \\ \text{s.t.} \quad & \mathbf{dx}\_i^1 = \mathbf{h}\_i^1(\mathbf{x}\_{i'}^1, \mathbf{x}\_i^{1-}, \mathbf{x}\_i^{2-}, \mathbf{u}\_i^1) \text{d.t.} \end{aligned} \tag{9}$$

#### 2.2.2. Red Side Control Problem

The red side's agents traveling from a common source avoid obstacles and hope to complete the interception against the blue side to prevent the blue side from penetrating. At time *<sup>t</sup>* <sup>≥</sup> 0, the *<sup>i</sup>*-th agent controls its strategy **<sup>u</sup>**<sup>2</sup> *<sup>i</sup>* , to minimize its: (1) motion energy, (2) distance to the blue side, (3) red side inter-agent collision, and (4) the collision of the red side with the obstacles during the remaining travel time. The running cost *L*<sup>2</sup> *<sup>i</sup>* is given by

$$L\_i^2 = \underbrace{l\_1 \|\mathbf{u}\_i^2(t)\|^2}\_{\text{(1) motion energy minimization}} \text{ }\tag{10}$$

where *l*<sup>1</sup> is a constant. The running cost *L*<sup>2</sup> *<sup>i</sup>* denotes the control effort implemented by agent *i*. The interaction cost *F*<sup>2</sup> *<sup>i</sup>* is given by

$$\begin{split} F\_{i}^{2} &= \underbrace{I\_{2}\left(\frac{1}{N\_{1}}\sum\_{k=1}^{N\_{1}}\left\|\mathbf{p}\_{i}^{2}(t) - \mathbf{p}\_{k}^{1}(t)\right\|\right)}\_{\text{(2) distance to blue side minimization}} + \underbrace{I\_{3}\left(\frac{1}{N\_{2}}\sum\_{j\neq i}^{N\_{2}}\mathbf{1}\_{\|\mathbf{p}\_{i}^{2}(t) - \mathbf{p}\_{j}^{2}(t)\|\leq\varepsilon\alpha}\right)}\_{\text{(3) relative  $\mathbf{p}\_{i}$ -free  $\mathbf{p}\_{i}$ -agent collision according}} \\ &+ \underbrace{I\_{4}\left(\frac{1}{N\_{2}}\sum\_{i=1}^{N\_{2}}\left\{\underbrace{\sum\_{a=1}^{A}\gamma\_{obs,a}\frac{1}{\mathbf{Q}\_{a}(\mathbf{p}\_{i},t)}}\_{\text{(4) relative  $\mathbf{p}\_{i}$ -flat}}\right)}\_{\text{(4) red side obstacle combination}}\underbrace{\mathbf{p}\_{i}\in\Omega\_{\text{obs},trn}}\_{\text{(11) distance}}\right)}\_{\text{(4) red side obstacle combination}}\end{split} \tag{11}$$

where *l*2, *l*3, *l*<sup>4</sup> are constants. The interaction cost *F*<sup>2</sup> *<sup>i</sup>* denotes the sum of the trajectory collision loss of the intra-group, inter-group, and with obstacles about agent *i*.

In summary, by defining the cost function as (2), the optimal control problem faced by agent *i* of the red side is given by

$$\begin{array}{ll}\displaystyle\inf\_{\mathbf{u}\_{i}^{2}\in\mathcal{U}\_{2}} & f\_{i}^{2}(\mathbf{x}\_{i}^{2}, \mathbf{x}\_{i}^{2-}, \mathbf{x}\_{i}^{1-}, \mathbf{u}\_{i}^{2})\\\\\text{s.t.} & \mathbf{dx}\_{i}^{2} = \mathbf{h}\_{i}^{2}(\mathbf{x}\_{i}^{2}, \mathbf{x}\_{i}^{2-}, \mathbf{x}\_{i}^{1-}, \mathbf{u}\_{i}^{2}) \mathbf{d}t. \end{array} \tag{12}$$

To this end, the cooperative attack–defense evolution for large-scale agents in the 3D obstacle scene is formulated as a non-cooperative differential game. Thus, an *N*-player non-cooperative game is formed. Its obvious solution is the Nash equilibrium, that is no agent can unilaterally reduce its cost under this control decision [16].

#### *2.3. Multi-Population High-Dimensional Mean-Field Game*

With the increase of the number *N* of game participants, the complexity of solving differential games will increase significantly. For the current agent in the cooperative attack– defense problem (16), the neighborhood interactions of intra-group collision avoidance, the inter-group interactions of attack and defense, and the ecological interactions of obstacle avoidance will lead to the need for a large amount of communication and computing resources. Traditional optimization and control methods usually deal with the increasing interactions separately, leading to the dimension explosion problem. The mean-field game is able to overcome the communication and computational difficulties associated with a large scale, and its core technology is to inscribe a large number of interacting swarm intelligence problems as coupled sets of partial differential equations (PDEs). Therefore, we propose the multi-population high-dimensional stochastic MFG (MPHD-MFG) reformulation of the cooperative attack–defense problem (16). Under the framework of the MPHD-MFG, a generic player only reacts to the collective behaviors (mean field) of all players instead of the behavior of each player, which greatly reduces the amount of communication and computation. Here, "mean-field" means the states' probability distribution. Now, we can drop the index *i* since players are indistinguishable within each population of the MPHD-MFG. Let *<sup>ρ</sup>p*(**x***p*, *<sup>t</sup>*) : <sup>R</sup>*<sup>n</sup>* <sup>×</sup> [0, *<sup>T</sup>*] → P2(R*n*) denote the probability density function of state **x***<sup>p</sup>* at time *t*, then the cost function (2) is transformed into

$$\begin{split} \mathbb{P}\left(\mathbf{x}^{p},\rho^{p},\rho^{-p},\mathbf{u}^{p}\right) &= \\ \mathbb{E}\left[\int\_{0}^{T} L^{p}(\mathbf{x}^{p}(t),\mathbf{u}^{p}(t)) + F^{p}(\mathbf{x}^{p}(t),\rho^{p}(\mathbf{x}^{p}(t),t),\rho^{-p}(\mathbf{x}^{-p}(t),t)) \mathrm{d}t + G^{p}(\mathbf{x}^{p}(T),\rho^{p}(\mathbf{x}^{p}(T),T),\rho^{-p}(\mathbf{x}^{-p}(T),T))\right] \\ &= \int\_{0}^{T} \left\{\int\_{\Omega} L^{p}(\mathbf{x}^{p}(t),\mathbf{u}^{p}(t))\rho^{p}(\mathbf{x}^{p},t) \mathrm{d}\mathbf{x}\right\} \mathrm{d}t + \int\_{0}^{T} \left\{\int\_{\Omega} F^{p}(\mathbf{x}^{p}(t),\rho^{p}(\mathbf{x}^{p}(t),t),\rho^{-p}(\mathbf{x}^{-p}(t),t))\rho^{p}(\mathbf{x}^{p},t) \mathrm{d}\mathbf{x}\right\} \mathrm{d}t \\ &\qquad + \int\_{\Omega} G^{p}(\mathbf{x}^{p}(T),\rho^{p}(\mathbf{x}^{p}(T),T),\rho^{-p}(\mathbf{x}^{-p}(T),T))\rho^{p}(\mathbf{x}^{p},t) \mathrm{d}\mathbf{x} \\ &= \int\_{0}^{T} \left\{\int\_{\Omega} L^{p}(\mathbf{x}^{p}(t),\mathbf{u}^{p}(t))\rho^{p}(\mathbf{x}^{p},t)\mathrm{d}\mathbf{x}\right\} \mathrm{d}t + \int\_{0}^{T} F^{p}(\mathbf{x}^{p}(t),\rho^{p}(\mathbf{x}^{p}(t),t),\rho^{-p$$

where <sup>Ω</sup> <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the state space containing all possible states of the generic agent and variational derivatives of functionals F, G for *ρ* are the interaction and terminal costs *F* and *G*, respectively. Meanwhile, the state dynamic–kinematic equation in (1) is transformed into

$$\mathbf{d}\mathbf{x}^{p} = \mathbf{h}^{p}(\mathbf{x}^{p}, \rho^{p}, \rho^{-p}, \mathbf{u}^{p})\mathbf{d}t + \sigma^{p}\mathbf{d}\mathbf{W}\_{t}^{p} \tag{14}$$

where *<sup>σ</sup><sup>p</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>m</sup>* denotes a fixed coefficient matrix of a population-dependent volatility term and **W***<sup>p</sup>* means an *m*-dimensional Wiener process springs from the environment, in which each component **W***<sup>p</sup> <sup>k</sup>* is independent of **<sup>W</sup>***<sup>p</sup> <sup>l</sup>* for all *k* = *l*, *p* = 1, 2. According to Ito's lemma [40], (14) can be expressed in terms of the mean field *ρp*(**x***p*, *t*) and then will be equivalent to the Fokker–Planck (FPK) equation given by

$$
\partial\_t \rho^p - \frac{\sigma^{p^2}}{2} \Delta \rho^p + \nabla \cdot (\rho^p \mathbf{h}^p) = 0. \tag{15}
$$

With cost function (13) and FPK Equation (15), the multi-population high-dimensional MFG, which describes the cooperative attack–defense evolution of a large number of agents, is now summarized as

$$\begin{aligned} \inf\_{\rho^p, \mathbf{u}^p} J^p(\mathbf{x}^p, \rho^p, \rho^{-p}, \mathbf{u}^p) \\ \text{s.t. } \partial \rho^p - \frac{\sigma^{p2}}{2} \Delta \rho^p + \nabla \cdot (\rho^p \mathbf{h}^p) &= 0, \quad \rho^p(\mathbf{x}^p, 0) = \rho\_0^p(\mathbf{x}^p) \\ \partial\_t \rho^{-p} - \frac{\sigma^{-p2}}{2} \Delta \rho^{-p} + \nabla \cdot (\rho^{-p} \mathbf{h}^{-p}) &= 0, \quad \rho^{-p}(\mathbf{x}^{-p}, 0) = \rho\_0^{-p}(\mathbf{x}^{-p}) \\ p &= 1, 2, \end{aligned} \tag{16}$$

where *ρ<sup>p</sup>* <sup>0</sup> is the initial probability distribution of population *p*'s agents. To this end, the cooperative attack–defense evolution for large-scale agents in the 3D obstacle scene is formulated as a multi-population high-dimensional MFG. For every population *p* = 1, 2, each agent *<sup>i</sup>* of population *<sup>p</sup>*(*i*) forecasts a distribution {*ρp*(·, *<sup>t</sup>*)}*<sup>T</sup> <sup>t</sup>*=<sup>0</sup> and aims at minimizing its cost, which eventually reaches the Nash equilibrium, where no agent can decrease its individual cost by changing its control strategy unilaterally. The formulaic representation, for every **<sup>x</sup>***<sup>p</sup>* <sup>∈</sup> <sup>R</sup>*n*:

$$J^p(\mathbf{x}^p, \rho^p, \rho^{-p}, \hat{\mathbf{u}}^p) \le J^p(\mathbf{x}^p, \rho^p, \rho^{-p}, \mathbf{u}^p), \quad \forall \mathbf{u}^p : [0, T] \to \mathcal{U}\_{p\prime} \tag{17}$$

where **u**ˆ*<sup>p</sup>* is the agent's equilibrium strategy at state **x***p*. Here, we assumed that the agent is small, and its unilateral actions will not change the density *ρp*. Reference [41] provides a sufficient condition for the solution to the multi-population MFG PDEs, and Reference [42] provides the necessary one.

**Remark 2.** *Under appropriate assumptions, the MFG's solution will offer an approximate Nash equilibrium (-NE) for the corresponding game with a large, but finite number of agents [41].*

#### **3. GAN-Based Approach for MPHD-MFG**

In this section, we put forward a generative-adversarial-network (GAN)-based method for solving the multi-population high-dimensional MFG in (16). Inspired by Wasserstein GANs [43], APAC-Net [35], and CA-Net [36,37], we formulated (16) as a convex-concave saddle-point problem using MFG's variational primal–dual structure. Then, we propose an extended coupled alternating neural network (ECA-Net) algorithm to solve the MPHD-MFG in (16).

#### *3.1. Variational Primal–Dual Structure of MPHD-MFG*

We reveal the underlying primal–dual structure of the MPHD-MFG, then deduce the convex–concave saddle-point problem equivalent to (16). Denote Φ*<sup>p</sup>* : Φ*p*(**x***p*, *t*) = inf**u***<sup>p</sup> J <sup>p</sup>* (**x***p*, *ρp*, *ρ*−*p*, **u***p*) as the Lagrange multiplier; we can add the differential constraint (15) (FPK equation) into the cost function (13) to obtain the extended cost function:

sup Φ*<sup>p</sup>* inf *ρp*,**u***<sup>p</sup>* -+ *<sup>T</sup>* 0 + Ω *Lp*(**x***p*(*t*), **u***p*(*t*))*ρp*(**x***p*, *t*)d**x**d*t* + + *<sup>T</sup>* 0 <sup>F</sup> *<sup>p</sup>*(**x***p*(*t*), *<sup>ρ</sup>p*(**x***p*(*t*), *<sup>t</sup>*), *<sup>ρ</sup>*−*p*(**x**−*p*(*t*), *<sup>t</sup>*))d*<sup>t</sup>* <sup>+</sup> <sup>G</sup>*p*(**x***p*(*T*), *<sup>ρ</sup>p*(**x***p*(*T*), *<sup>T</sup>*), *<sup>ρ</sup>*−*p*(**x**−*p*(*T*), *<sup>T</sup>*)) − + *<sup>T</sup>* 0 + Ω <sup>Φ</sup>*p*(**x***p*, *<sup>t</sup>*)(*∂tρ<sup>p</sup>* <sup>−</sup> *<sup>σ</sup>p*<sup>2</sup> <sup>2</sup> <sup>Δ</sup>*ρ<sup>p</sup>* <sup>+</sup> ∇ · (*ρp*(**x***p*, *<sup>t</sup>*)**h***p*(**x***p*, *<sup>ρ</sup>p*, *<sup>ρ</sup>*−*p*, **<sup>u</sup>***p*)))d**x***p*d*<sup>t</sup>* − + *<sup>T</sup>* 0 + Ω <sup>Φ</sup>−*p*(**x**−*p*, *<sup>t</sup>*)(*∂tρ*−*<sup>p</sup>* <sup>−</sup> *<sup>σ</sup>*−*p*<sup>2</sup> <sup>2</sup> <sup>Δ</sup>*ρ*−*<sup>p</sup>* <sup>+</sup> ∇ · (*ρ*−*p*(**x**−*p*, *<sup>t</sup>*)**h**−*p*(**x**−*p*, *<sup>ρ</sup>*−*p*, *<sup>ρ</sup>p*, **<sup>u</sup>**−*p*)))d**x**−*p*d*<sup>t</sup>* A . (18)

The Hamiltonian *Hp* : <sup>R</sup>*<sup>n</sup>* × P2(R*n*)<sup>2</sup> <sup>×</sup> <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>, *<sup>p</sup>* <sup>=</sup> 1, 2 is defined as

$$H\_p(\mathbf{x}^p, \boldsymbol{\rho}^p, \boldsymbol{\rho}^{-p}, z^p) = \sup\_{\mathbf{u}^p} \left\{-L\_p(\mathbf{x}^p, \mathbf{u}^p) - z^{p^\top} \mathbf{h}^p(\mathbf{x}^p, \boldsymbol{\rho}^p, \boldsymbol{\rho}^{-p}, \mathbf{u}^p) \right\}. \tag{19}$$

Utilizing (19) and integrating by parts, we can rewrite (18) as

inf *<sup>ρ</sup><sup>p</sup>* sup Φ*<sup>p</sup>* + *<sup>T</sup>* 0 + Ω (*∂t*Φ*<sup>p</sup>* <sup>+</sup> *<sup>σ</sup>p*<sup>2</sup> <sup>2</sup> ΔΦ*<sup>p</sup>* <sup>−</sup> *Hp*(**x***p*, <sup>∇</sup>Φ*p*))*ρp*(**x***p*, *<sup>t</sup>*)d**x***p*d*<sup>t</sup>* <sup>+</sup> + *<sup>T</sup>* 0 <sup>F</sup> *<sup>p</sup>*(**x***p*(*t*), *<sup>ρ</sup>p*(**x***p*(*t*), *<sup>t</sup>*), *<sup>ρ</sup>*−*p*(**x**−*p*(*t*), *<sup>t</sup>*))d*<sup>t</sup>* <sup>+</sup> <sup>G</sup>*p*(**x***p*(*T*), *<sup>ρ</sup>p*(**x***p*(*T*), *<sup>T</sup>*), *<sup>ρ</sup>*−*p*(**x**−*p*(*T*), *<sup>T</sup>*)) + <sup>+</sup> Ω <sup>Φ</sup>*p*(**x***p*, 0)*ρ<sup>p</sup>* <sup>0</sup> (**x***p*)d**x***<sup>p</sup>* <sup>−</sup> + Ω Φ*p*(**x***p*, *T*)*ρp*(**x***p*, *T*)d**x***<sup>p</sup>* + + *<sup>T</sup>* 0 + Ω (*∂t*Φ−*<sup>p</sup>* <sup>+</sup> *<sup>σ</sup>*−*p*<sup>2</sup> <sup>2</sup> ΔΦ−*<sup>p</sup>* <sup>+</sup> <sup>∇</sup>Φ−*<sup>p</sup>* · **<sup>h</sup>**−*p*)*ρ*−*p*(**x**−*p*, *<sup>t</sup>*)d**x**−*p*d*<sup>t</sup>* <sup>+</sup> + Ω <sup>Φ</sup>−*p*(**x**−*p*, 0)*<sup>ρ</sup>* −*p* <sup>0</sup> (**x**−*p*)d**x**−*<sup>p</sup>* − + Ω <sup>Φ</sup>−*p*(**x**−*p*, *<sup>T</sup>*)*ρ*−*p*(**x**−*p*, *<sup>T</sup>*)d**x**−*<sup>p</sup>* . (20)

Here, our derivation path follows that of [44–46]. The formulation (20) is the cornerstone of our approach.

#### *3.2. ECA-Net for Cooperative Attack–Defense Evolution*

We solved (20) by training a GAN-based neural network. The solving network of the multi-population MFG in the 3D obstacle scene is an extended nonlinear coupled alternating neural network formed by multiple generators and multiple discriminators, named ECA-Net. We coupled the obstacle avoidance loss term in the design of the loss function of the ECA-Net algorithm. The structure and training process of ECA-Net are shown in Figure 2.

**Figure 2.** Visualization of the structure and training process of ECA-Net. Its training process is divided into two pairs of coupled alternating training parts—generators and discriminators.

First, we initialized two pairs of neural networks *N<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***p*, *<sup>t</sup>*) and *<sup>N</sup><sup>p</sup> <sup>θ</sup><sup>p</sup>* (**y***p*, *<sup>t</sup>*), *<sup>p</sup>* = 1, 2. Let

$$\boldsymbol{\Phi}\_{\boldsymbol{\omega}^{\mathcal{P}}}^{\mathcal{P}}(\mathbf{x}^{\mathcal{P}},t) = (1-t)N\_{\boldsymbol{\omega}^{\mathcal{P}}}^{\mathcal{P}}(\mathbf{x}^{\mathcal{P}},t) + t\mathbb{G}^{\mathcal{P}}(\mathbf{x}^{\mathcal{P}}), \quad \mathbb{G}\_{\boldsymbol{\theta}^{\mathcal{P}}}^{\mathcal{P}}(\mathbf{y}^{\mathcal{P}},t) = (1-t)\mathbf{y}^{\mathcal{P}} + tN\_{\boldsymbol{\theta}^{\mathcal{P}}}^{\mathcal{P}}(\mathbf{y}^{\mathcal{P}},t), \tag{21}$$

where **<sup>y</sup>***<sup>p</sup>* <sup>∼</sup> *<sup>ρ</sup><sup>p</sup>* <sup>0</sup> is a sample drawn from the initial distribution. The formulation of <sup>Φ</sup>*<sup>p</sup> ω<sup>p</sup>* (the value function for the generic agent of population *p*) and G, *p <sup>θ</sup><sup>p</sup>* (the density distribution of population *<sup>p</sup>*) in (21) automatically encodes the terminal condition *<sup>G</sup>p*(·, *<sup>T</sup>*) and initial distribution *ρ<sup>p</sup>* <sup>0</sup> (·), respectively. Moreover, ECA-Net encodes the underlying structure of the MPHD-MFG by (20) and (21), exempting the neural network from learning the entire game solution from scratch.

Our approach for training this neural network includes parallel–alternate training of two pairs of G, *p <sup>θ</sup><sup>p</sup>* and <sup>Φ</sup>*<sup>p</sup> <sup>ω</sup><sup>p</sup>* , *p* = 1, 2. Intuitively, to gain the equilibrium of the MPHD-MFG of the cooperative attack–defense problem, we trained an extended coupled alternating neural network (ECA-Net) about multi-group distributions and agent controls. Specifically, we trained Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* by first sampling a batch {**y***<sup>p</sup> <sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> from given initial distribution *<sup>ρ</sup><sup>p</sup>* <sup>0</sup> , another batch {**y**−*<sup>p</sup> <sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> from given initial density *ρ* −*p* <sup>0</sup> , and {*tb*}*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> uniformly from [0, *T*]. Then, we computed the push-forward states **x***<sup>p</sup> <sup>b</sup>* = G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* , *tb*), **x** −*p <sup>b</sup>* = G, −*p <sup>θ</sup>*−*<sup>p</sup>* (**y**−*<sup>p</sup> <sup>b</sup>* , *tb*) for *b* = 1, ... , *B*. The main loss term for training the discriminator Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* is given by

$$\text{loss}\_{\Phi^{p}} = \frac{1}{B} \sum\_{b=1}^{B} \Phi\_{\omega^{p}}^{p}(\mathbf{x}\_{b^{\*}}^{p}, 0) + \frac{1}{B} \sum\_{b=1}^{B} \left\{ \partial\_{1} \Phi\_{\omega^{p}}^{p}(\mathbf{x}\_{b^{\*}}^{p}, t\_{b}) + \frac{\sigma^{p2}}{2} \Delta \Phi\_{\omega^{p}}^{p}(\mathbf{x}\_{b^{\*}}^{p}, t\_{b}) - H\_{p}(\nabla\_{\mathbf{x}^{p}} \Phi\_{\omega^{p}}^{p}(\mathbf{x}\_{b^{\*}}^{p}, t\_{b})) \right\}. \tag{22}$$

We need to consider adding a dual term for the distribution of the neighboring population:

$$\text{penalty}\_{\text{neighbours}} = \eta \left\{ \frac{1}{B} \sum\_{b=1}^{B} \left[ \partial\_{t} \Phi^{-p}\_{\omega^{-p}} (\mathbf{x}\_{b}^{-p}, t\_{b}) + \frac{\sigma^{-p^{2}}}{2} \Delta \Phi^{-p}\_{\omega^{-p}} (\mathbf{x}\_{b}^{-p}, t\_{b}) + \nabla\_{\mathbf{x}^{-p}} \Phi^{-p}\_{\omega^{-p}} (\mathbf{x}\_{b}^{-p}, t\_{b}) \cdot \mathbf{h}^{-p} (\mathbf{x}\_{b}^{-p}, \mathbf{x}\_{b}^{p}, t\_{b}) \right] \right\} \tag{23}$$

to correct for the density update of the other population (which tends to obey the FPK equation) [36]. We can optionally add a regularization term:

$$\text{penalty}\_{\text{HJB}} = \lambda \left\{ \frac{1}{B} \sum\_{b=1}^{B} \left\| \partial\_{\mathbf{i}} \Phi^{p}\_{\omega \boldsymbol{\nu}} (\mathbf{x}^{p}\_{\mathbf{b}}, \mathbf{t}\_{b}) + \frac{\sigma^{p2}}{2} \Delta \Phi^{p}\_{\omega \boldsymbol{\nu}} (\mathbf{x}^{p}\_{\mathbf{b}}, \mathbf{t}\_{b}) - H\_{\boldsymbol{\mathcal{P}}} (\nabla\_{\mathbf{x}^{\boldsymbol{\nu}}} \Phi^{p}\_{\omega \boldsymbol{\nu}} (\mathbf{x}^{p}\_{\mathbf{b}}, \mathbf{t}\_{b})) + F^{p} (\mathbf{x}^{p}\_{\mathbf{b}}, \mathbf{x}^{-p}\_{\mathbf{b}}, \mathbf{t}\_{b}) \right\| \right\} \tag{24}$$

to penalize deviations from the HJB equations. Finally, we backpropagated the total loss total to update the weights of the discriminator <sup>Φ</sup>*<sup>p</sup> <sup>ω</sup><sup>p</sup>* .

To train the generator, we again sampled {**y***<sup>p</sup> <sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=1, {**y**−*<sup>p</sup> <sup>b</sup>* }*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> and {*tb*}*<sup>B</sup> <sup>b</sup>*=1, *p* = 1, 2 as before and computed

$$\text{loss}\_{\mathbf{G}^{p}} = \frac{1}{B} \sum\_{b=1}^{B} \left\{ \partial\_{\mathbf{f}} \Phi^{p}\_{\omega^{p}} (\mathbf{G}^{p}\_{\mathbf{f}\theta} (\mathbf{y}^{p}\_{\mathbf{b}}), t\_{b}) + \frac{\sigma^{p2}}{2} \Delta \Phi^{p}\_{\omega\rho} (\mathbf{G}^{p}\_{\mathbf{b}\theta} (\mathbf{y}^{p}\_{\mathbf{b}}), t\_{b}) - H\_{\mathbb{P}} (\nabla\_{\mathbf{w}^{\mathbb{P}}} \Phi^{p}\_{\omega\rho} (\mathbf{G}^{p}\_{\mathbf{b}\theta} (\mathbf{y}^{p}\_{\mathbf{b}}), t\_{b})) \\ + F^{p} (\mathbf{G}^{p}\_{\mathbf{b}\theta} (\mathbf{y}^{p}\_{\mathbf{b}}), \mathbf{G}^{-p}\_{\mathbf{b}\theta} (\mathbf{y}^{-p}\_{\mathbf{b}}), t\_{b}) \right\}. \tag{25}$$

Finally, we backpropagated the total loss *ζ*total to update the weights of the generator G, *p θp* .

In conclusion, in each time slot *t* ∈ [0, *T*], the ECA-Net will be trained. The generator G, *p <sup>θ</sup><sup>p</sup>* will generate the state distribution of population *<sup>p</sup>* at time *<sup>t</sup>*, and the discriminator <sup>Φ</sup>*<sup>p</sup> ω<sup>p</sup>* will obtain the result of the value function of population *p* at time *t*, *p* = 1, 2. Please refer to Algorithm 1 for the detailed operation flow.

**Algorithm 1** ECA-Net for cooperative attack–defense evolution.

**Require:** *σ<sup>p</sup>* diffusion parameter, *G<sup>p</sup>* terminal cost, *Hp* Hamiltonian, *F<sup>p</sup>* interaction term, *p* = 1, 2.

**Require:** Initialize neural networks *N<sup>p</sup> <sup>ω</sup><sup>p</sup>* and *<sup>N</sup><sup>p</sup> <sup>θ</sup><sup>p</sup>* , batch size *B*.

**Require:** Set Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* and G, *p <sup>θ</sup><sup>p</sup>* as in (21).

**While** not converged, **do**

**train** Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* :

Sample batch {(**y***<sup>p</sup> <sup>b</sup>* , *tb*)}*<sup>B</sup> <sup>b</sup>*=1, {(**y**−*<sup>p</sup> <sup>b</sup>* , *tb*)}*<sup>B</sup> <sup>b</sup>*=1, where **<sup>y</sup>***<sup>p</sup> <sup>b</sup>* <sup>∼</sup> *<sup>ρ</sup><sup>p</sup>* <sup>0</sup> , **<sup>y</sup>**−*<sup>p</sup> <sup>b</sup>* ∼ *ρ* −*p* <sup>0</sup> and *tb* ∼ Unif(0, *T*). **x***p <sup>b</sup>* ← G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* , *tb*), **x** −*p <sup>b</sup>* ← G, −*p <sup>θ</sup>*−*<sup>p</sup>* (**y**−*<sup>p</sup> <sup>b</sup>* , *tb*) for *b* = 1, . . . , *B*. - *p* <sup>0</sup> <sup>←</sup> <sup>1</sup> *<sup>B</sup>* <sup>∑</sup>*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> <sup>Φ</sup>*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , 0) - *p <sup>t</sup>* <sup>←</sup> <sup>1</sup> *<sup>B</sup>* <sup>∑</sup>*<sup>B</sup> b*=1 *<sup>∂</sup>t*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*) + *<sup>σ</sup>p*<sup>2</sup> <sup>2</sup> ΔΦ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*) <sup>−</sup> *Hp*(∇**x***<sup>p</sup>*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*)) - *p* <sup>n</sup> ← *η* 1 *<sup>B</sup>* <sup>∑</sup>*<sup>B</sup> b*=1 *<sup>∂</sup>t*Φ−*<sup>p</sup> <sup>ω</sup>*−*<sup>p</sup>* (**x** −*p <sup>b</sup>* , *tb*) + *<sup>σ</sup>*−*p*<sup>2</sup> <sup>2</sup> ΔΦ−*<sup>p</sup> <sup>ω</sup>*−*<sup>p</sup>* (**x** −*p <sup>b</sup>* , *tb*) + <sup>∇</sup>**x**−*<sup>p</sup>*Φ−*<sup>p</sup> <sup>ω</sup>*−*<sup>p</sup>* (**x** −*p <sup>b</sup>* , *tb*) · **<sup>h</sup>**−*p*(**<sup>x</sup>** −*p <sup>b</sup>* , **<sup>x</sup>***<sup>p</sup> <sup>b</sup>* , *tb*) - *p* HJB ← *λ* 1 *<sup>B</sup>* <sup>∑</sup>*<sup>B</sup> <sup>b</sup>*=<sup>1</sup> *∂t*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*) + *<sup>σ</sup>p*<sup>2</sup> <sup>2</sup> ΔΦ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*) <sup>−</sup> *Hp*(∇**x***<sup>p</sup>*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (**x***<sup>p</sup> <sup>b</sup>* , *tb*)) + *<sup>F</sup>p*(**x***<sup>p</sup> <sup>b</sup>* , **x** −*p <sup>b</sup>* , *tb*) - *p* total ← - *p* <sup>0</sup> + - *p <sup>t</sup>* + - *p* <sup>n</sup> + - *p* HJB Backpropagate total loss total = ∑*<sup>p</sup>* - *p* total to *<sup>ω</sup>* = (*ωp*)*p*=1,2 weights.

**train** G, *p <sup>θ</sup><sup>p</sup>* :

Sample batch {(**y***<sup>p</sup> <sup>b</sup>* , *tb*)}*<sup>B</sup> <sup>b</sup>*=1, {(**y**−*<sup>p</sup> <sup>b</sup>* , *tb*)}*<sup>B</sup> <sup>b</sup>*=1, where **<sup>y</sup>***<sup>p</sup> <sup>b</sup>* <sup>∼</sup> *<sup>ρ</sup><sup>p</sup>* <sup>0</sup> , **<sup>y</sup>**−*<sup>p</sup> <sup>b</sup>* ∼ *ρ* −*p* <sup>0</sup> and *tb* ∼ Unif(0, *T*). *ζ p <sup>t</sup>* <sup>←</sup> <sup>1</sup> *<sup>B</sup>* <sup>∑</sup>*<sup>B</sup> b*=1 *<sup>∂</sup>t*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* ), *tb*) + *<sup>σ</sup>p*<sup>2</sup> <sup>2</sup> ΔΦ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* ), *tb*) <sup>−</sup> *Hp*(∇**x***<sup>p</sup>*Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* (G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* ), *tb*)) +*Fp*(G, *p <sup>θ</sup><sup>p</sup>* (**y***<sup>p</sup> <sup>b</sup>* ), G, −*p <sup>θ</sup>*−*<sup>p</sup>* (**y**−*<sup>p</sup> <sup>b</sup>* ), *tb*) Backpropagate total loss *<sup>ζ</sup>*total <sup>=</sup> <sup>∑</sup>*<sup>p</sup> <sup>ζ</sup> <sup>p</sup> <sup>t</sup>* to *<sup>θ</sup>* = (*θp*)*p*=1,2 weights.

**end while**

#### **4. Simulation Results**

In this section, we carry out an integrative experiment based on Algorithm 1 and demonstrate the feasibility and effectiveness of our approach via the following numerical simulation results. To prove the advanced nature of our approach, we finally compare its performance with that of baseline methods.

#### *4.1. Experimental Setup*

For the attack–defense obstacle scenario in Figure 3, the initial density of the two populations is discrete uniform distributions *ρ p* <sup>0</sup> , *p* = 1, 2. The initial spatial coordinates (*x*, *y*, *z*) of the blue and red sides are located in the cuboid areas ((−5, 5),(−9, −7),(−9, −7)) and ((−5, 5),(7, 9),(−9, −7)), respectively. Note that we set all other initial coordinates to zero initial angular position, initial velocity, and initial angular velocity were all set to zero. The coordinates of the terminal line were set to (·, 8, 8). Two obstacles were placed between the attacking and defending groups. The obstacles were represented by cuboids, specified by the coordinates of their vertices. The first obstacle was placed at ((−3, −1),(−2, 2),(−10, 10)), and the second obstacle was placed at ((1, 3),(−2, 2),(−10, 10)).

**Figure 3.** The 3D attack and defense experimental scenario.

We examined with this high-dimensional scene where the dynamics was that of quadrotor crafts. The dynamic–kinematic equation of the generic quadrotor craft *i* of population *p*, *p* = 1, 2, is given by

$$\begin{cases} \begin{aligned} \dot{x}\_p &= \upsilon\_{x\_p} \\ \dot{y}\_p &= \upsilon\_{y\_p} \\ \dot{z}\_p &= \upsilon\_{z\_p} \\ \dot{\theta}\_p &= \upsilon\_{\theta\_p} \\ \dot{\theta}\_p &= \upsilon\_{\theta\_p} \\ \dot{\psi}\_{x\_p} &= \frac{\mu\_p}{m\_p} (\sin(\phi\_p)\sin(\psi\_p) + \cos(\phi\_p)\cos(\psi\_p)\sin(\theta\_p)) \\ \psi\_{y\_p} &= \frac{\mu\_p}{m\_p} (-\cos(\psi\_p)\sin(\phi\_p) + \cos(\phi\_p)\sin(\theta\_p)\sin(\psi\_p)) \\ \psi\_{z\_p} &= \frac{\mu\_p}{m\_p} (\cos(\theta\_p)\cos(\phi\_p)) - g \\ \psi\_{\theta\_T} &= \tilde{\upsilon}\_{\theta\_T} \\ \psi\_{\theta\_T} &= \tilde{\upsilon}\_{\theta\_T} \end{aligned} \tag{26}$$

which we compactly denote as **x**˙*<sup>p</sup>* = **h***p*(**x***p*, **u***p*), where **h***<sup>p</sup>* is a 12-dimensional vector function in the right-hand side of (26), **x***<sup>p</sup>* = [*xp*, *yp*, *zp*, *ψp*, *θp*, *φp*, *vxp* , *vyp* , *vzp* , *vψ<sup>p</sup>* , *vθ<sup>p</sup>* , *vφ<sup>p</sup>* ] <sup>∈</sup> <sup>R</sup><sup>12</sup> is the state with velocities **<sup>v</sup>***<sup>p</sup>* = [*vxp* , *vyp* , *vzp* , *<sup>v</sup>ψ<sup>p</sup>* , *<sup>v</sup>θ<sup>p</sup>* , *<sup>v</sup>φ<sup>p</sup>* ] ∈ <sup>R</sup>6, and **<sup>u</sup>***<sup>p</sup>* <sup>=</sup> [*up*, *τ*˜*ψ<sup>p</sup>* , *τ*˜*θ<sup>p</sup>* , *τ*˜*φ<sup>p</sup>* ] ∈ <sup>R</sup><sup>4</sup> is the control. In the stochastic case, we added a noise term to the dynamics: d**x***<sup>p</sup>* = **h***p*(**x***p*, **u***p*)d*t* + *σp*d**W***<sup>p</sup> <sup>t</sup>* , where **W** means a standard Brownian motion, meaning the quadcopter suffers from noisy measurements. The cost functions of the blue side and red side are given in Section 2.2.

For the model hyperparameters, we set *c*<sup>1</sup> = 0.5 (in (3)), *c*<sup>2</sup> = 1, *c*<sup>3</sup> = 20, *c*<sup>4</sup> = 5 (in (4)), *c*<sup>5</sup> = 5 (in (8)), *l*<sup>1</sup> = 0.5 (in (10)), and *l*<sup>2</sup> = 1, *l*<sup>3</sup> = 20, *l*<sup>4</sup> = 5 (in (11)). For ECA-Net, both networks have three linear hidden layers with 100 hidden units in each layer. Residual

neural networks (ResNets) were used for both networks, with a skip connection weight of 0.5. The tanh activation function was used in Φ*<sup>p</sup> <sup>ω</sup><sup>p</sup>* , while the ReLU activation function was used in G, *p <sup>θ</sup><sup>p</sup>* . For training, we used ADAM with *β* = (0.5, 0.9), learning rate 2*e* − 4 for Φ*p <sup>ω</sup><sup>p</sup>* , learning rate 5*e* − 5 for G, *p <sup>θ</sup><sup>p</sup>* , weight decay of 1*e* − 4 for both networks, batch size 50, *λ* = 2 (the HJB penalty coefficient), and *η* = 2.5*e* − 3 (the neighbors' penalty coefficient) in Algorithm 1. As in standard machine learning methods, all the plots in Section 4.3 and Appendices A.1 and A.2 were generated using validation data (data not used in training), to ensure the general adaptability of ECA-Net.

#### *4.2. Convergence Analysis*

The convergence of the MPHD-MFG method and the ECA-Net algorithm can be observed by checking the convergence of the HJB residual errors, along with the convergence of the total loss. In Figure 4a, we plot the HJB residual errors of the red and blue sides, i.e., - *p* HJB in Algorithm 1, which measures the deviation from the objective function (13) and shows the convergence of the theoretical model, the MPHD-MFG. Without an efficient strategy control, the HJB residuals under different stochasticity parameters (*ν* = *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> = 0, 0.04, 0.08) were relatively high. The HJB residuals dropped fast after we applied a series of controls. After around 2 <sup>×</sup> <sup>10</sup><sup>5</sup> iterations, the error curves tended to be bounded and stable when we obtained the optimal control for the drones. In Figure 4b, we plot the total loss of the red and blue sides, i.e., - *p* total in Algorithm 1. After around 2 <sup>×</sup> <sup>10</sup><sup>5</sup> iterations, the total loss value curves under different stochasticity parameters (*ν* = *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> = 0, 0.04, 0.08) tended to be bounded and stable, which means that the weight update of the neural network was completed, proving the good convergence of the solution algorithm, ECA-Net.

**Figure 4.** Convergence analysis. (**a**) Convergence of HJB residual; (**b**) Convergence of total loss.

#### *4.3. Performance Analysis*

We set the same number for the blue side and red side, *N*<sup>1</sup> = *N*<sup>2</sup> = 100. Now, we obtained the model from the above training and predicted the trajectories of the red and blue sides. Assuming that the blue side is not aggressive and the red side is aggressive, if the blue UAV is surrounded by two or more red UAVs within *er*, the blue individual will be captured. We give the evolutionary trajectories of the red and blue sides under different stochasticity parameters (*ν* = *<sup>σ</sup>*<sup>2</sup> <sup>2</sup> = 0, 0.04, 0.08) and different capture radii (*er* = 0.7, 0.9, 1.1), as shown in Figure 5. When stochasticity parameter *ν* was fixed, observing Figure 5a–c over time *t*, respectively, the UAVs successfully avoided obstacles while minimizing their energy consumption; the smaller the capture radius of the red side, the higher the survival rate of the blue side is. At the same capture radius *er* and at the same moment *t*, the larger the stochasticity parameter, the more scattered the distribution of the UAVs is, which increases the difficulty for the red side to completely capture the blue side. For example, in

the case of *er* = 1.1, comparing the first line of Figure 5a–c, the larger *ν* is, the higher the survival rate of the blue side. In particular, Figure 5c shows the diversion phenomenon. At *ν* = 0.08, the third moment, the UAVs are flying through the obstacles. In order to avoid collision with the obstacles, the UAVs choose three different paths, i.e., a diversion occurs, which demonstrates the adaptive nature of the UAV. In addition, the corresponding 3D run diagram of Figure 5 is placed in Appendix A.1. Appendix A.2 shows and analyzes the offensive and defensive effects of the UAV swarms in the asymmetric case.

**Figure 5.** *Cont*.

**Figure 5.** Vertical view of the large-scale UAV attack and defense behaviors under different stochasticity parameters and different capture radii. (**a**) Snapshots of the distributions of the UAVs' locations under different capture radii when *ν* = 0; (**b**) Snapshots of the distributions of the UAVs' locations under different capture radii when *ν* = 0.04; (**c**) Snapshots of the distributions of the UAVs' locations under different capture radii when *ν* = 0.08.

#### *4.4. Comparison with Baselines*

We verified the progressiveness of our approach by comparing its performance with that of some typical baseline methods for attack–defense games in Table 1. From Table 1, it can be seen that our approach handled the most complex application scene and simplified the large-scale communication. For more details about the following baseline methods, please refer to the related References [7–9,37], etc.


**Table 1.** Comparison with baseline methods for attack–defense games.

<sup>1</sup> The measurement method of scene complexity is as follows: here, it is a scoring system, [2D, 3D] = [1 , 2 ]; [blank scene, obstacle scene] = [1 , 2 ]; [small, large] = [1 , 2 ]. We accumulated the scores for each literature experiment scene according to each item and finally normalized them. <sup>2</sup> <sup>O</sup>() is infinitesimal of the same order.

#### **5. Conclusions**

In this paper, we formulated the cooperative attack–defense evolution of large-scale agents in high-dimensional environments as a multi-population high-dimensional stochastic mean-field game (MPHD-MFG), which significantly reduced the communication frequency and computational complexity of the swarm intelligence system. Then, we tractably solved the MPHD-MFG with a generative-adversarial-network (GAN)-based method using the MFGs' underlying variational primal–dual structure. Based on our approach, we conducted a comprehensive experiment. The good convergence of the MPHD-MFG method and the ECA-Net algorithm was corroborated by checking the bounded stable convergence of the HJB residual error and the total loss. Through simulations, we saw that a large number of UAVs can avoid obstacles (even showing diversions) and smoothly evolve their

attack and defense behaviors while minimizing their energy consumption.The comparison with the baseline methods showed that our approach is advanced. In the future, we will consider in-depth research in the asymmetric case of 3D obstacle scenarios, for example the evolution of a cooperative attack and defense between multiple (greater than or equal to three) large-scale swarms and the evolution of a cooperative attack and defense under the existence of individual performance (speed, acceleration, turning range) differences between attackers and defenders.

**Author Contributions:** Conceptualization, G.W. and Z.L.; Formal analysis, G.W.; Investigation, Z.L.; Methodology, G.W. and W.Y.; Software, G.W.; Supervision, W.Y. and S.X.; Validation, W.Y. and S.X.; Visualization, G.W.; Writing—original draft, G.W. and Z.L.; Writing—review and editing, W.Y. and S.X. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Science and Technology Innovation 2030-Key Project under Grant 2020AAA0108200.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We thank Zhiming Zheng for his fruitful discussions, valuable opinions, and guidance.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A. The 3D Renderings of Numerical Results and More Experiments**

*Appendix A.1. The 3D Run Diagram Figure A1 about Figure 5*

Here, we give the 3D experimental run diagram Figure A1 about its vertical view Figure 5 for reference. In the following diagrams, time is represented by color. Specifically, purple represents the starting time, red represents the final time, and the intermediate colors represent intermediate times.

**Figure A1.** *Cont*.

**Figure A1.** The 3D run diagram of large-scale UAV attack and defense behaviors under different stochasticity parameters and different capture radii.

#### *Appendix A.2. Asymmetric Case Study*

Here, we set different numbers of the blue side and red side, *N*<sup>1</sup> = 100, *N*<sup>2</sup> = 60 or *N*<sup>1</sup> = 60, *N*<sup>2</sup> = 100. Let *ν* = 0.08,*er* = 1.1 and the capture conditions be consistent with Section 4.3. Now, we can predict its trajectory, as shown in Figures A2 and A3. Taking the given parameters as an example, with the fixed stochasticity parameter and capture radius, this section shows the deduction of the diversion obstacle avoidance and attack–defense behaviors of the red and blue sides with asymmetric numbers.

**Figure A2.** Vertical view of the asymmetric case about large-scale UAV attack and defense behaviors under *ν* = 0.08,*er* = 1.1.

**Figure A3.** The 3D run diagram of the asymmetric case about large-scale UAV attack and defense behaviors under *ν* = 0.08,*er* = 1.1. (**a**) Blue side vs. red side: 100 vs. 60; (**b**) Blue side vs. red side: 60 vs. 100.

#### **References**

