1. Introduction
In recent years, the pursuit-evasion (PE) problem has attracted great attention because of its widespread application background in competitive games, optimization of IoT resources, and military attacks [
1,
2,
3,
4]. However, due to the real-time confrontation between the pursuit and evasion sides, the traditional unilateral control theory cannot solve the problem accurately [
5]. Although the existing algorithms can solve the differential game problem in many scenarios, an offline algorithm cannot make real-time responses to the information of agents of the PE game with strong real-time performance. Thus, this paper focuses on the online PE game problem and realizes the solution of the agent policy according to the concept of adaptive dynamic programming.
The core of solving the PE game problem is to obtain the control policy of each agent on both sides of the game. Isaacs [
6] introduced the modern control theory into the game theory and established the differential game theory. Thereafter, as a branch of the differential game, the PE game of agents has attracted much attention. With the continuous development of aerospace technology and the launch of man-made satellites, the game problems of continuous confrontation between both sides and even multiple players need to be solved urgently [
7,
8]. Friedman [
9] proved the existence of saddle points in differential games, thus enabling them to optimize the strategies of all agents in the PE problem. For the control problem in a linear differential game system [
10,
11] discussed the control method for the cost function of a quadratic form. In contrast, [
12] discussed the uniqueness of the Nash equilibrium point, so that the analytical solution can be obtained for the classical differential game problem.
However, a general system may be more complex, and it could be difficult to obtain its analytical solution. Therefore, compared with analytic methods, scholars usually prefer numerical methods for solving the problem with more complex agents [
13], such as the pursuit-evasion problem of aircraft. Qiuhua et al. [
14] and Pontani and Conway [
15] studied the optimal control strategies and solution methods for two spacecraft pursuit-evasion problems via a multiple shooting method. Xu and Cai [
16] used a genetic algorithm in a game problem to find the Nash equilibrium and obtained the control of two aircraft. There are also applications in multi-agent pursuit-evasion systems in recent research [
17,
18,
19,
20,
21]. Thus, the offline methods for the PE game problem are becoming increasingly sophisticated. However, the policy obtained offline cannot deal with online emergencies, such as temporarily changing agents’ goals. 
Solving PE game problems online is increasingly becoming a focus. Werbos et al. [
22,
23] designed actor-critic structures for implementing algorithms in real time, where the learning mechanisms of the structures are composed of policy evaluation and policy improvement. Bertsekas and Tsitsiklis [
24] introduced RL methods of different forms and the policy iteration (PI) and value iteration (VI) methods for discrete-time (DT) dynamic systems are compared, which initially apply the idea of RL to the problem of a control system. Werbos [
25,
26] developed an RL approach based on VI for feedback control of DT dynamic systems using value function approximation (VFA). It is proven that the VFA method is suitable for finding the optimal control online for DT control problems. However, in the actual scenario, the pursuit and evasion problem mostly formulate the control and game of a continuous system. Vrabie [
27] presented a method of adaptive dynamic programming (ADP) that is useful to circumvent differential games and establish PI algorithms for continuous-time (CT) control problems. Noting that the information of the system might need to be completed, Vrabie and Lewis [
28] considered different forms of systems to obtain online learning methods via optimum control with incomplete information of various systems. The concept of adaptive dynamic programming was further extended to the field of differential games by Vrabie [
29], and the synchronous tuning algorithm was used to achieve the Nash equilibrium. However, the system information about both sides of the game must be obtained completely. Kartal et al. [
30] used the synchronous tuning algorithm in the pursuit-evasion game of the first-order system to obtain the capture conditions of agents in the game and reached the Nash equilibrium. Zhang et al. [
31] and Li et al. [
32] determined the scheme’s feasibility in distributed systems. However, in the general differential game problems, the states of agents are usually not used as direct control variables, and hence a system becomes more complex. Furthermore, solving the pursuit-evasion game of the actual scenario in real time without using the whole information of game systems has been a hot research field.
This paper proposes a novel ADP method for online solving the Nash equilibrium policies of two-player pursuit-evasion differential games. The min-max principle is adopted to confirm the Nash equilibrium of the game. As the agents in the game can form an Internet of Things (IoT) system, the real-time control law of each agent is obtained by taking a linear-quadratic cost function in adaptive dynamic programming. To consider the scene when capture happens, we introduce the Lyapunov function. Since most actual systems are continuous, we use the policy iteration algorithm to make the real-time policy converge to the analytical solution of the Nash equilibrium. Moreover, we employ the value function approximation method to calculate the neural network parameters without solving the Hamilton–Jacobi–Isaacs equation directly. The feasibility of the proposed method is demonstrated through simulation results from different scenarios of the pursuit-evasion game. This paper is inspired by recent research in various fields, such as motion coordination in wafer scanners [
33], soil-structure interaction [
34,
35], driving fatigue feature detection [
36], H∞ consensus for multiagent-based supply chain systems [
37], and reliable and secure communications in wireless-powered NOMA systems [
38]. These studies have contributed significantly to advancing real-time control and optimization methods in various applications.
The contributions of the paper are shown as follows:
- The min-max principle is used to find the analytical solution of Nash equilibrium, and the method’s stability is proven by establishing a Lyapunov function for obtaining the capture conditions of the game. 
- By constructing a form of adaptive dynamic programming, the policies of agents in each cycle are obtained through the PI method, and we prove that it converges to the Nash equilibrium. 
- To avoid the inconvenience of solving the HJI equation, we establish a set of functions to approximate the value function. As the neural network parameters converge, the agent’s solution in policy iteration is obtained. 
The rest of the paper is organized as follows. The dynamic model of the PE game is established in 
Section 2. We discuss the features of Nash equilibrium in 
Section 3, and the capture conditions of agents are concerned as different parameters are set. 
Section 4 executes the adaptive dynamic programming method, which consists of the PI method and the VFA algorithm. The agents’ policies are obtained without directly solving the Riccati equations of the PE game. 
Section 5 presents the simulations of some actual problems. 
Section 6 concludes the paper and discusses the limitations of the research.
  2. Formulation of the Game
Consider a system containing two objects and composing a pursuer-evader couple. The pursuer tries to capture the evader, while the evader tries to escape from being captured.
The pursuit-evasion game in real-time is a typical differential game problem. Here, the motion equation of each participant can be expressed as a couple of differential equations defined in a fixed coordinate system. The game with one pursuer and one evader is a typical zero-sum differential game as the benefits of both sides are mutually exclusive.
      
      where 
, 
, 
, and 
 are the state variables and control variables of the two players. Among them, the state variable contains the state information of the players, and there may be various physical quantities representing the operation of the players according to different game systems. To facilitate the subsequent operations in this paper, the state variables here must contain the location information of the agents in each dimension. The control variables contain the elements which realized to control the agents in each dimension.
In the PE game problem, the relative motion state of agents is very important. So, we let 
 be the difference in the states between the two agents:
The pursuer tries to reduce the distance of two agents, which is embedded in 
, while the evader tries to enlarge it. Substituting Equations (1) and (2) into Equation (3) and calculating its derivative with respect to time, we have:
For formulating a zero-sum pursuer-evader (PE) game, we construct a performance function with integral form as:
      where 
 is a non-negative definite coefficient matrix. 
 and 
 are both positive definite matrices. In the integral function, 
 is the term that measures the relative state of the system (4) and is used to give limits to the distance between agents. 
 and 
 stand for the scales in consumption corresponding to the two agents, which are used for realizing the limitations of the controls.
The value function is given as follows when the agents execute certain policies:
If both pursuer and evader employ their optimal policies along the optimal paths, then the optimal value of the game can be obtained as:
In this paper, the goal is to find out the control policy of each agent. The difficulty of the work lies in finding the numerical solution to each agent’s policy, in which the steps of policy iteration and the selection of the appropriate value function approximation are very important. In reinforcement learning, the policy needs some iterative steps. In a continuous system, we adopt adaptive dynamic programming for solving the agents’ policies. It makes the value function of this paper different from the end value performance index constructed by Jagat and Sinclair [
2] because the end value performance index cannot solve the optimal strategy iteratively. Moreover, the performance index of this paper is selected as a quadratic structure, which pays more attention to the intermediate process of the game, improves the real-time competitiveness of both sides of the game, and facilitates the development of the strategy iteration algorithm as stated in 
Section 4. The distance between two agents is regarded as the tracking error, which means that both pursuer and evader optimize their policies throughout the game process. It is not only in line with the actual situation but also convenient for solving this problem.
  3. Solution of the Pursuit-Evasion Game
In this section, we substitute the dynamic model of the PE game problem to the minimax principle and obtain the analytic Nash equilibrium of the PE game. The cases when capture occur are studied and proven by adopting the Lyapunov function approach.
The PE game of agents is regarded as a kind of differential game, which is settled based on the bilateral optimal control theory. The optimal policies of agents are obtained by using the min-max principle. The differential game refers to a continuous game with a couple of players in continuous-time systems. Each agent tries to achieve its goal and magnify its benefit. The game will end up with every participant achieving the Nash equilibrium policy. Using the minimax theorem, we can ensure that agents’ policy is their corresponding optimal policies. When each agent adopts its optimal policy, the Nash equilibrium is achieved. Currently, the condition when optimal policies are adopted is called the saddle point.
In a 2-player PE game problem, the optimal policy of the pursuer tries to minimize the Hamilton function whereas that of the evader tries to maximize it. Therefore, there exists a couple of policies . When the pursuer adopts  and the evader adopts , the game reaches the Nash equilibrium. We call  the saddle point of the game.
The expressions in Equation (6) are the same as the Bellman equation of a zero-sum game. From Equations (1) and (2) and Leibniz’s formula, we have:
      where 
 is the Hamiltonian, 
 and 
 are admissible control policies of the pursuer and evader, respectively. 
 denotes 
.
We can obtain the optimal control of each agent according to the stationary condition:
.
Additionally, the second derivative of the Hamiltonian to 
 and 
 should satisfy:
The optimal controls of the agents are obtained as:
As the system is invariant for infinite time, the solution of the problem is defined by Equations (13) and (14), in which the value 
 can solve the following equation analytically:
Since the pursuit-evasion behavior between two agents becomes a zero-sum game when both agents adopt their optimal policies, which is called the game theoretic saddle point policy, the game will reach the Nash equilibrium at that condition.
Before proving that the game can achieve the Nash equilibrium as per policies Equations (13) and (14), we need to use the properties of the Hamiltonian function of the system, which is demonstrated in Lemma 1.
 Lemma 1. Suppose  satisfies the HJI Equation (15), which makes the Hamiltonian  comes to 0. Then, (8) transforms to:   Proof of Lemma 1. Suppose 
 satisfies the HJI Equation (15), which makes the Hamiltonian 
 comes to 0. Then, (8) transforms to:
      
If the value function 
 comes to the optimal value, we have:
According to the HJI function Equation (15), the Hamiltonian comes to 0 as the value function reaches the optimal value, and the proof is completed. □
 We can transform the Hamiltonian in the way as demonstrated in Lemma 1 to support the proof of the Nash equilibrium as shown in the following theorem.
 Theorem 1. Consider the dynamics of the agents Equations (1) and (2) with the value function (6). Define
           as a positive definite solution of the HJI Equation (15). Then,
           and
           in Equations (13) and (14) are the Nash equilibrium policies of agents, and
           is the optimal value of the PE game.
  Proof of Theorem 1. Suppose  satisfies the HJI Equation (15), which makes the Hamiltonian  comes to 0. Then, (8) transforms to:
To prove that 
 and 
 are the Nash equilibrium solution, we have to confirm that the value function is maximized when the evader executes 
 in (13). Similarly, the value function is maximized when the pursuer executes 
 in (14), which can be expressed as:
Moreover, let 
 be the value when the pursuer executes 
 and the evader executes 
, we can turn Equations (19) and (20) into inequalities as:
      where 
 is the solution of the Hamilton function (16). Let 
 is the initial state of value function. Here, we assume that the capture will happen within the period 
. This indicates 
. To verify the establishment of inequalities (21) and (22), we add this term into Equation (8) and have:
From Equation (23), obviously we have 
. Upon using lemma 1, (23) becomes:
Let 
 be the integral in Equation (24). We just need to verify that 
 and 
 to prove (21) and (22). Using (24) we get
      
      which accomplishes the proof. □
  Remark 1. It can be seen from Theorem 2 that the value function does not continue to decrease when it reaches the Nash equilibrium, regardless of how the pursuer unilaterally changes its policy. Similarly, no matter how the evader unilaterally changes the policy, the value function will not continue to increase. When  reaches the game theoretic saddle point, if one agent changes its policy unilaterally, which is contrary to its benefit, then the other one will reap the benefit from the change. As the game comes to Nash equilibrium, if the pursuer unilaterally alters the strategy, the evader will be harder to capture. On the contrary, if the evader unilaterally changes its policy, it will be easier for the pursuer to realize capture.
 In the PE game problem, it is noteworthy whether the pursuer can capture the evader. If so, the problem changes to a finite-time game. Such issues are common in the interception field. Then, we will figure out the conditions which lead to the capture in the game.
The following theorem gives the necessary condition for the occurrence of the capture.
 Theorem 2. Let the pursuer and evader meet the same dynamic model as Equations (1) and (2). Further, let Equations (13) and (14) are the controls of the agents in the game, in which  is the analytical solution of the HJI Equation (15). Then, the capture scenario happens only if dynamic (6) is asymptotically stable.
  Proof of Theorem 2. Because 
 solves the HJI Equation (15) analytically, it’s obvious that 
 is positive and 
. Select function 
 as a candidate of the Lyapunov function. The derivative of 
 is given by:
As we can see, the derivative of the value  can be negative under the condition of . That means, if system dynamics (4) is stabilizable and observable, with  holds, then dynamic (6) is asymptotically stable, and the capture occurs. On the other hand, if , which fails to meet the Lyapunov stability condition, then the states of the PE game (4) are likely to diverge. Therefore, this will cause the distance between the two agents to enlarge, making the occurrence of the capture impossible. At this moment, the pursuer cannot capture the evader. □
  Remark 2. It can be predicted that when the dynamic of
           is stable, the distance between two agents in the game will approach 0 as time . Conversely, if  is non-positive, the pursuer probably cannot capture the evader. If the capture takes place, as the distance of the two agents is embedded in state variables, the divergence of the positive matrix  will change the capture time and pattern of the PE game. In value function (6),  and  stand for the summation of control energy consumption for two players. For the pursuer and evader,  and  represents dynamic constraints to their control or performance [13], which is known as control penalty. In this way, larger  or smaller  tends to facilitate the capture scenario occurs.   Remark 3. In [39,40], a non-quadratic form Lyapunov function is proposed to verify the system convergence, and the tracking performance is better than that of the quadratic form Lyapunov function. However, the model construction of the pursuit-evasion game involved in this paper focuses more on the physical meaning of the object. The quadratic form of the state variable can represent the relative error, including the relative distance and the relative speed difference. The quadratic form of the control can represent the power of the object after integration. For other systems with relative order 1, the Lyapunov function can be used in the form of  to improve convergence efficiency and make the tracking error approach 0 faster.    5. Numerical Simulation
In this section, the pursuit-evasion game is numerically simulated. Based on the general motion model, the pursuit and escape the problem of the second-order system is studied, which considers the acceleration of both players along all the dimensions as controls. The position and velocity of the agents are monitored online as the state variables.
Consider the PE game problem in a two-dimensional space whose dynamic model would be:
      where 
, 
, 
, and 
 are the coordinates and velocities of the pursuer in 
 and 
 directions, respectively. Similarly, 
, 
, 
, and 
 are the coordinates and velocities of the evader in 
 and 
 directions, respectively. As for the controls, 
, and 
 are the accelerator couples of the two agents, which stand for the policies of the two agents, respectively.
Here, we subtract model (41) from (42), and obtain the system of difference (43), whose state variables are 
. Among them, 
 and 
 stand for the distance in 
 and 
 direction, respectively. The complete system of difference model is:
The distance between the two agents can be regarded as the capture condition of the PE game problem, which is given as follows:
To determine whether the pursuer can catch up with the evader, set  as the capture radius. When the distance between the two agents is lower than , we can call it an effective capture, which terminates the PE game.
In this process, the velocity of agents is unconstrained and effect-less in the benefits of agents, so matrix  in value function (6) can be .
Generally, the basic functions in the VFA algorithm are made up of the Kronecker product of the quadratic polynomial terms 
. However, for game problems with more states, this definition will make the calculation inefficient. To improve the operation efficiency and obtain the policies of the agents, we construct a single-layer neural network as follows:
Then, the parameters  are updated online through the algorithm introduced in Chapter 4. The initial value of the parameter is selected as . As the input changes in real time, the residual error can be calculated according to Equation (35). Then, as we minimize the quadratic integral residual, we can obtain the updated parameter vector  by Equation (38).
To keep the base function vector in a persistence excitation condition, the excitation function  is defined as . The selection of the excitation function  is set to hold Equation (40).
Other initial states are set in the following simulation as , , , , .
Set the capture radius to 
 and begin to simulate the PE game problem. The locations of each agent vary as shown in 
Figure 1.
In this problem, the policies of the agents are their accelerations. After the game starts, even if the pursuer and the evader initially move in different directions concerning each other, the pursuer can still adjust its policy as soon as possible and accelerate its velocity in the direction of the evader. Meanwhile, the evader can adjust its policy in time to escape from being caught. However, the evader is still captured for more stringent constraints of the control effort. The capture occurs at  and the coordinates of the pursuer and evader are  and  respectively.
The distance between the two agents is shown in 
Figure 2.
As the iteration continues, the parameters of the neural network convergence as the game reaches the Nash equilibrium, which indicates that the policies of both agents have converged to the optimized values. The distance of the two agents when capture scenario happens is 
. In this simulation, the policies of both agents are updated at the end of each iteration cycle. The interval of PI method is set as 
T = 0.05 s. It can be seen in 
Figure 2 that the policy obtained by the PI method can almost converge to the analytical solution of the Nash equilibrium, and the capture is made in the nick of the Nash equilibrium by using the PI method.
Compared with the online PSO algorithm, the policies obtained by the ADP algorithm are closer to the analytical Nash equilibrium, reflecting its better convergence performance compared to the online PSO algorithm. This can be better reflected as the number of iterations increases over time.
According to Algorithm 1, the initial values of the network parameters are given arbitrarily. During the whole process of the PE game, six complete cycles of the PI method are executed. The neural network parameters of the VFA algorithm converge gradually, stabilizing to a set of fixed amounts from the third cycle to the termination. Moreover, to verify that the converged neural network parameters are not locally optimal, we compare the analytical value to our solution, shown in 
Figure 3. Each parameter converges to its analytical solution, which reflects the stability of the algorithm in solving the problem.
As mentioned above, matrix 
 is the soft constraint of the agents, which is determined by the actual structure parameters. Different values may lead to various endings for the PE game problem. Now we change the value of 
 to repeat the simulation above and keep the initial states unchanged. The values of 
 are taken as 0.3, 0.5, and 0.8. Then, using Algorithm 1, the obtained distance of the two agents in the PE game is shown in 
Figure 4.
When  is less than  (when the PE game system remains stable), the closer the two values, the longer the time required for the pursuer to catch the evader. Therefore,  matrix can be regarded as the motion performance limitation index of agents. The smaller the  value, the better its motion performance, and the wider the value of its control.
Moreover, the interval 
 in the PI method at each step can influence the performance of the solution we obtain. The computing ability plays an important role in setting an appropriate 
. Here, we set different PI method intervals 
 and 0.025 s to recompute the game problem. Note that choosing 
, and remaining other initial states and parameters unchanged, we recompute the PE game to obtain the distance of the two agents, shown in 
Figure 5. The parameters 
 are shown in 
Figure 6.
It can be seen from 
Figure 6 that a shorter interval of the PI method can boost the efficiency in the convergence of the Nash equilibrium. However, a shorter interval led to more iteration cycles to terminate the game, which indicates that the calculation cost grows as the amount of iteration cycles increases. 
Figure 7 shows that the neural network parameters converge to the analytical solution corresponding to the Nash equilibrium for the intervals of 
 and 
, respectively. The parameter 
 occasionally deviates from the analytical solution in the simulation when 
, which means that a too small iteration interval may cause the parameter to diverge. It is beneficial to select a moderate iterative interval according to the conditions of the agents.
Now we consider that the mobility of the pursuer is less than that of the evader, i.e., there is no capture in the game process. Currently, the solution to the game problem does not meet the condition of the system stability. Therefore, the policies of both agents, i.e., the scale of control, may diverge. Let 
, and 
. We impose hard constraints to the control on both sides, which is 
. The motion trajectories of both sides of the game are shown in 
Figure 8.
Figure 8 shows that the distance between two agents increases further with time elapse. Therefore, it is also proved in Theorem 2 that when 
, then the capture may not exist. At this moment, the state variables of both agents diverge, and the pursuer cannot catch the evader.