Next Article in Journal
Cumulative Prospect Theory-Driven Pigeon-Inspired Optimization for UAV Swarm Dynamic Decision-Making
Previous Article in Journal
Path Planning Design and Experiment for a Recirculating Aquaculture AGV Based on Hybrid NRBO-ACO with Dueling DQN
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games

1
School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
2
School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(7), 477; https://doi.org/10.3390/drones9070477 (registering DOI)
Submission received: 28 May 2025 / Revised: 2 July 2025 / Accepted: 3 July 2025 / Published: 5 July 2025

Abstract

This paper investigates the robust tracking control of unmanned aerial vehicles (UAVs) against external time-varying disturbances. First, by introducing a virtual position controller, we innovatively decouple the UAV dynamics into independent position and attitude error subsystems, transforming the robust tracking problem into two zero-sum differential games. This approach contrasts with conventional methods by treating disturbances as strategic “players”, enabling a systematic framework to address both external disturbances and model uncertainties. Second, we develop an integral reinforcement learning (IRL) framework that approximates the optimal solution to the Hamilton–Jacobi–Isaacs (HJI) equations without relying on precise system models. This model-free strategy overcomes the limitation of traditional robust control methods that require known disturbance bounds or accurate dynamics, offering superior adaptability to complex environments. Third, the proposed recursive Ridge regression with a forgetting factor ( R 3 F 2 ) algorithm updates actor-critic-disturbance neural network (NN) weights in real time, ensuring both computational efficiency and convergence stability. Theoretical analyses rigorously prove the closed-loop system stability and algorithm convergence, which fills a gap in existing data-driven control studies lacking rigorous stability guarantees. Finally, numerical results validate that the method outperforms state-of-the-art model-based and model-free approaches in tracking accuracy and disturbance rejection, demonstrating its practical utility for engineering applications.

1. Introduction

Unmanned aerial vehicles (UAVs) have drawn increasing research interest and have been applied to many complex tasks, e.g., surveillance, firefighting, environmental monitoring, etc., mainly due to several advantages such as simple mechanisms, flexible mobility, and the ability to vertically hover and take off [1,2,3,4,5,6]. However, it may pose challenges to designing a robust optimal tracking controller for UAVs in a dynamic and complex environment. In typical application scenarios, UAVs inherently exhibit a mismatch between control inputs and degrees of freedom (e.g., four control inputs regulating six degrees of freedom), leading to strong coupling between position and attitude control. Furthermore, external time-varying disturbances (e.g., airflow disturbances), model parameter perturbations, and unmodeled dynamics further exacerbate the control complexity. Various classic robust control methods and technologies have been applied to address the aforementioned issues, e.g., H control [7], backstepping control [8], adaptive control [9], and model predictive control [10]. These kinds of underactuated systems pose dual requirements for control precision and robustness, which makes the existing control methods based on perfect model assumptions difficult to handle. Although the actor-critic algorithm solves the policy interaction problem, it does not involve the handling of physical system dynamics and uncertainties, and therefore cannot be directly applied to the robust tracking control of UAV systems.
To address the issues, a combination of zero-sum differential game and reinforcement learning (RL) has recently drawn increasing research interest [11,12,13]. Game theory provides an effective solution for such problems [11], which models the interaction between the system and uncertainty as a dynamic game process. The external disturbances, parameter perturbations, and other uncertainties are modeled as an “adversary”, while the control strategy is modeled as a “defender” countering them, constructing a dynamic adversarial relationship between optimal control and the worst-case disturbance. Robust control strategies [14,15] designed based on this theoretical framework can obtain optimal control strategies through game equilibrium solutions, considering extreme disturbance scenarios, and are naturally adapted to the dual requirements of underactuated systems for strong robustness and high-precision tracking. Reinforcement learning enables agents to select the optimal action or control strategy based on the observed system responses in order to maximize or minimize the cumulative reward. In recent years, RL and deep reinforcement learning (DRL) technologies have been widely used in UAV control problems. For instance, the authors of [12] proposed a deep neural network (DNN) called TrailNet, which was applied to the tracking control of micro air vehicles (MAVs) in an unstructured outdoor environment (e.g., mountainous areas, forests, jungles, etc.). A deep deterministic policy gradient (DDPG) algorithm for UAV trajectory tracking in a simulation environment was investigated in [16]. In [17], an RL method was applied to the fault-tolerant controller design problem for an unknown nonlinear UAV with actuator faults. In [18], an RL-based MPC approach was developed for UAV navigation control in unknown environments, where both the vehicle control and obstacle avoidance tasks were fulfilled using MPC, whereas the guidance of the MAV through complex environments was completed by RL. In [19], a model-based adaptive dynamic programming (ADP) was designed for solving the UAV robust tracking control problem under coupling uncertainties. Moreover, the DRL-based robust control problem for UAV helicopters was studied in [13].
Integral RL (IRL) [20] is an integral temporal difference-based RL methodology, which is provably effective in solving the optimal control and decision-making problems of nonlinear systems with partially known or completely unknown system dynamics. To relieve the demand for knowing the system model, the IRL scheme aims to design the optimal robust controller by solving the integral Bellman equations via the policy iteration (PI) technique. Among the existing literature, the proposed IRL-based robust tracking control methods typically use three NN-based architectures, as follows: critic neural networks (NNs), actor-critic NNs, and actor-critic-disturbance NNs. For instance, Wang et al. [21] proposed a neural identifier-based intelligent critic control for unknown nonlinear affine dynamics to achieve disturbance attenuation. In [22], nonlinear disturbance-observer-based critic NNs were used to solve the problem of cooperative control for nonlinear dynamical systems under mismatched disturbances. Motivated by [22], Mohammadi et al. [23] proposed a value-iteration-based algorithm for optimal tracking control of nonlinear dynamical systems subject to delays, disturbances, and constraints. In [24], Yang et al. developed a critic NN-based event-driven robust controller for continuous-time (CT) nonlinear input-affine systems. The requirement of input dynamics inevitably degrades the performance of the aforementioned critic NN-based IRL approaches to a certain extent. While the system dynamics model is completely unknown, IRL research still manages to exploit actor-critic and actor-critic-disturbance architectures. In [25], a robust control law was developed for unknown nonlinear systems by combining the actor-critic NN-based IRL algorithm with disturbance compensation schemes. In [26], the robust control problem was converted into a constrained optimal control problem, and then an actor-critic NN-based IRL framework was proposed to address the optimal control problem of constrained nonlinear dynamical systems. The model-free online IRL algorithms for nonlinear robust tracking control based on the actor-critic-disturbance structure were investigated in [27,28,29,30,31].
However, in practical algorithm implementation, traditional reinforcement learning algorithms used to solve zero-sum differential games generally adopt gradient descent and least-squares methods when optimizing the execution–evaluation–disturbance neural network weight parameters. These two methods have significant inherent defects, as follows: (1) the gradient descent highly depends on manually configured learning rates, and is prone to falling into local minima traps in high-dimensional nonlinear problems, leading to slow convergence and enormous computational resource consumption; (2) the least-squares method is based on a linearization assumption to approximate nonlinear systems, making it difficult to adapt to changes in system characteristics in complex dynamic environments, leading to significant model mismatch errors. More critically, both methods implicitly assume that the system’s statistical characteristics remain stable and rely on a large amount of sufficient data samples for parameter estimation. In practical underactuated systems, the time-varying nature of external disturbances, the nonlinearity of system dynamics, and unmodeled dynamic characteristics make these assumptions difficult to hold, leading to the inability to effectively guarantee the convergence and stability of the algorithm. Moreover, in high-dimensional state spaces and multi-objective optimization scenarios, traditional methods struggle even more to effectively cope with the combined challenges of interference and uncertainty, and there is an urgent need for innovative algorithms to break through existing technical bottlenecks.
Based on the above discussion, a data-driven reinforcement learning method based on a zero-sum differential game framework is proposed. By introducing a virtual position controller, the tracking error dynamics model of the underactuated quadrotor is decoupled into a position error subsystem and an attitude error subsystem. Based on this decoupling, the robust optimal tracking control problem is transformed into two parallel zero-sum differential games. To address the unknown system model problem, an IRL algorithm without prior dynamic knowledge is proposed. By replacing the traditional Bellman equation with the integral Bellman equation, the instantaneous reward is extended to the cumulative reward, enhancing robustness to dynamic uncertainties. The main contributions are summarized in three aspects, as follows:
(1)
Compared with the method in [19], which requires a known prior disturbance bound, our approach considers the coupling of nonlinear uncertainty and time-varying external disturbance as adversarial players by formulating the UAV robust optimal tracking control problem as two zero-sum differential games.
(2)
A data-driven IRL algorithm is presented to approximately achieve the Nash equilibrium strategies and the value functions without using the knowledge of the UAV dynamics. We develop an improved actor-critic-disturbance NN weights update law by using the recursive Ridge regression with the forgetting factor ( R 3 F 2 ) estimator, which is more computationally efficient than the least-squares (LS) method adopted in [11,27,28,29,30,31]. The results also show that the weight parameter estimation converges exponentially to a small neighborhood of the true value.
(3)
In order to generate the learning dataset for the off-policy IRL algorithm, we introduce a novel behavior policy that is different from the conventional one used in [19,23,27,28]. Instead of using an initially admissible exploratory control law, the selected behavior policy exploits a stabilizing linear state feedback controller for the nominal position and attitude tracking error subsystems, which is generally easier to design and implement than the conventional policy
The rest of this paper is organized as follows. Section 2 introduces preliminaries on the UAV dynamics and the problem statement. The zero-sum differential games, as well as the associated HJI equations, are developed in Section 3, where the Nash equilibria of the games are discussed. In Section 4, the R 3 F 2 -based off-policy IRL algorithm is provided along with the convergence and stability analysis. A numerical example is presented to verify the theoretical results in Section 5. Finally, Section 6 concludes this paper.
Throughout this paper, R , R + , R 0 , and N 0 represent the sets of real numbers, positive real numbers, non-negative real numbers, and non-negative natural numbers, respectively. The superscript ⊤ indicates the transposition symbol. The notation ⊗ denotes the Kronecker product operator. For a vector x R n , x : = x x denotes the Euclidean norm. For a matrix A R m × n , A represents the spectral norm of A. vec ( A ) = [ a 1 , a 2 , , a n ] R m n × 1 denotes a vector-valued representation of the matrix A with a i R m being the ith column of A. λ max ( A ) and λ min ( A ) indicate the maximum and minimum eigenvalues of the matrix A, respectively. If A R n × n , A > 0 or A 0 indicates that A is a positive definite and semi-positive definite matrix, respectively. Given A > 0 , x A : = x A x represents the A-weighted norm of x. The notation I indicates an identity matrix of appropriate dimensions. The notation O m × n indicates an m × n zero matrix. The notation 1 m R m denotes an m-dimensional column vector with all components being 1.

2. Preliminaries

This section presents the UAV dynamics and the problem description for UAV robust optimal tracking control with external disturbance.

2.1. The UAV Dynamics

In this paper, the schematic diagram of a UAV is displayed in Figure 1. Let E = { e x , e y , e z } be the right-handed earth-fixed inertial frame, and B = { b x , b y , b z } be the body-fixed frame attached to the UAV with the origin located at its center of gravity. The thrusts of four rotors are defined as f i = κ ω i 2 , i = 1 , 2 , 3 , 4 , where κ R is a parameter and ω i R represents the ith-rotor speed. Moreover, m R + indicates the mass of the UAV, and g is the acceleration of gravity.
The UAV dynamic modeling is based on the following assumptions to ensure theoretical tractability and practical applicability: (1) Rigid-body approximation. The UAV is modeled as a rigid body, ignoring elastic deformations and flexible components, and is validated for most fixed-wing and quadrotor systems (e.g., [32,33,34]). (2) Flat-Earth hypothesis. The Earth’s curvature and rotation are neglected; suitable for low-altitude operations (altitude < 1000 m) ([35]). (3) Lumped aerodynamic parameters. Air resistance is described by linearized drag coefficients, with high-order effects (e.g., stall, compressibility) omitted for simplification [36]. (4) Ideal actuator dynamics. Actuators are assumed to have instantaneous responses without time delays or saturation—a condition that can be relaxed via feedback linearization in practice [37].
Denote ζ = [ x , y , z ] R 3 and ϑ = ϕ , θ , ψ R 3 as the position and the attitude of the UAV in the E -frame, respectively. The dynamical model takes the following form, derived from the rigid-body dynamics theory [32,33]:
m ζ ¨ = K ζ ζ ˙ + R B E b z F m g b z + d ζ ( translational dynamics ) J b ϑ ¨ = K ϑ ϑ ˙ + Λ τ + d ϑ ( rotational dynamics )
where F = i = 1 4 f i R denotes the total thrust produced by four rotors, τ = [ τ ϕ , τ θ , τ ψ ] R 3 is the rotational torque, b z = [ 0 , 0 , 1 ] R 3 , K ζ = diag ( k 1 , k 2 , k 3 ) R 3 × 3 and K ϑ = diag ( k 4 , k 5 , k 6 ) R 3 × 3 , with k i R + , i = 1 , , 6 , denoting the aerodynamic damping coefficient matrices, and J b = diag ( I x , I y , I z ) R 3 × 3 is the inertial matrix, with I x , I y , I z R + , and d ζ = [ d x , d y , d z ] R 3 and d ϑ = [ d ϕ , d θ , d ψ ] R 3 denote the external time-varying disturbance vectors. Moreover, we have Λ = diag ( l , l , c ) R 3 × 3 , where l R + represents the distance from the rotor to the center of gravity, and c indicates the anti-torque coefficient. In addition, the orientation matrix R B E S O ( 3 ) of the UAV from the B -frame to the E -frame is expressed via Euler angles as follows:
R B E = c ψ c θ s ϕ s θ c ψ c ϕ s ψ c ϕ s θ c ψ + s ϕ s ψ c θ s ψ s ϕ s θ s ψ + c ϕ c ψ c ϕ s θ s ψ s ϕ c ψ s θ s ϕ c θ c ϕ c θ
where s * and c * indicate sin ( ) and cos ( ) , respectively. To avoid gimbal lock and ensure unique attitude representation, Euler angles are constrained within bounded ranges:
π 2 < ϕ < π 2 , π 2 < θ < π 2 , π < ψ < π .

2.2. Problem Formulation

Let ζ r = [ x r , y r , z r ] R 3 and ϑ r = [ ϕ r , θ r , ψ r ] R 3 be the reference states of the position and attitude subsystems, respectively. The goal is to design F and τ such that the UAV tracks the reference states ζ r and ϑ r despite the presence of external time-varying disturbances d ζ and d ϑ .
To quantify the above objective, we first define the position and attitude tracking error signals of the UAV as follows:
ξ 1 = ζ ζ r ζ ˙ ζ ˙ r R 6 , ξ 2 = ϑ ϑ r ϑ ˙ ϑ ˙ r R 6 .
Substituting the derivative of (4) into (1), we can obtain the position and attitude tracking error dynamical subsystems of the UAV as follows:
ξ ˙ 1 = f 1 ( ξ 1 ) + g 1 F + h 1 Δ ζ , ξ ˙ 2 = f 2 ( ξ 2 ) + g 2 u 2 + h 2 Δ ϑ
where f 1 = [ O 3 , I 3 ; O 3 , 1 m K ζ ] · ξ 1 R 6 × 1 , f 2 = [ O 3 , I 3 ; O 3 , J b 1 K ϑ ] · ξ 2 R 6 × 1 , g 1 = [ 0 3 ; 1 m R B E b z ] R 6 × 1 , g 2 = [ O 3 ; J b 1 ] R 6 × 3 , h 1 = [ O 3 ; I 3 ] R 6 × 3 , h 2 = [ O 3 ; J b 1 ] R 6 × 3 , Δ ζ = 1 m K ζ ζ ˙ r ζ ¨ r g b z + 1 m d ζ R 3 , Δ ϑ = K ϑ ϑ ˙ r J b ϑ ¨ r + d ϑ R 3 , and u 2 = Λ τ [ u ϕ , u θ , u ψ ] R 3 .
Inspired by [32], we introduce a virtual position control input u 1 = [ u x , u y , u z ] R 3 described by
u 1 = g b z + 1 m F R r b z
where R r R 3 × 3 is the rotational command matrix given by the following:
R r = c θ r c ψ r s ϕ r s θ r c ψ r c ϕ r s ψ r c ϕ r s θ r c ψ r + s ϕ r s ψ r c θ r s ψ r s ϕ r s θ r s ψ r + c ϕ r c ψ r c ϕ r s θ r s ψ r s ϕ r c ψ r s ϕ r s ϕ r c θ r c ϕ r c θ r
with ϕ r , θ r , and ψ r being the reference states of the attitude. By solving (6), we get
F = m u 1 + g b z θ r = arctan ( u x c ψ r + u y s ψ r u z + g ) ϕ r = arcsin ( u x s ψ r u y c ψ r u 1 + g b z )
where ψ r is the reference yaw angle.
By adding and subtracting u 1 to ξ ˙ 1 , we can rewrite (5) in the following uniform form:
ξ ˙ i = F i ξ i + G i u i + v i , i = 1 , 2
where u 1 and u 2 are the controllers to be designed, F 1 = [ O 3 , I 3 ; O 3 , 1 m K ζ ] R 6 × 6 , G 1 = [ O 3 ; I 3 ] R 6 × 3 , F 2 = [ O 3 , I 3 ; O 3 , J b 1 K ϑ ] R 6 × 6 , and G 2 = [ O 3 ; J b 1 ] R 6 × 3 . Furthermore, v 1 = 1 m K ζ ζ ˙ r ζ ¨ r + 1 m d ζ + f ϑ ( F ) R 3 and v 2 = K ϑ ϑ ˙ r J b ϑ ¨ r + d ϑ R 3 refer to the lumped disturbances acting on the position and attitude tracking error subsystems. The vector f ϑ ( F ) = 1 m F ( R B E R r ) b z R 3 in v 1 is regarded as the coupling uncertainty that demonstrates the effect on the translational motion caused by rotational motion. Consequently, the control objective is cast to design u 1 and u 2 so that ζ ( t ) and ψ ( t ) can track the reference signals ζ r = [ x r , y r , z r ] and ψ r under the time-varying disturbances v 1 and v 2 .
Then, the robust tracking control problem of a UAV subject to external time-varying disturbances can be stated as follows:
Problem 1.
The purpose of the UAV robust optimal tracking control is to find an optimal state feedback control law u i such that (1) the tracking error subsystems described in (8) are closed-loop stable when v i = 0 ; (2) the bounded L 2 -gain condition holds when v i 0 , that is, given η i > 0 ,
0 T ξ i ( t ) Q ^ i 2 + u i ( t ) R ^ i 2 d t η i 2 0 T v i ( t ) 2 d t + χ ( ξ i ( 0 ) )
is satisfied for all T > 0 and v i ( t ) L 2 [ 0 , ) , where Q ^ i R 6 × 6 , R ^ i R 3 × 3 , Q ^ i = Q ^ i > 0 , R ^ i = R ^ i > 0 , and χ ( ξ i ( t ) ) C 2 with χ ( 0 ) = 0 .

3. Game-Based Robust Optimal Control Law Design and Stability Analysis

In this section, we formulate the robust control problem for the subsystems in (8) as two independent zero-sum differential games. Then, the Nash equilibrium is studied.

3.1. Zero-Sum Differential Games

In order to ensure robust performance under the lumped disturbances, according to inequality (9), we can define the performance index function as
J i ( ξ i ( 0 ) , u i , v i ) = 0 q i ( ξ i ( t ) , u i ( t ) , v i ( t ) ) d t , i = 1 , 2
where q i ( ξ i ( t ) , u i ( t ) , v i ( t ) ) refers to an instantaneous reward function defined as q i ( ξ i ( t ) , u i ( t ) , v i ( t ) ) = ξ i ( t ) Q ^ i 2 + u i ( t ) R ^ i 2 η i 2 v i ( t ) 2 .
In tracking error dynamics (8), the controller and the lumped disturbance can be regarded as the control player and disturbance player, respectively. Then, the robust tracking control problem can be reformulated as the following game problems with respect to (8):
Problem 2.
(Zero-sum differential games). The zero-sum differential games for the UAV robust tracking control can be represented as
G i : min u i max v i J i ( ξ i ( 0 ) , u i , v i ) s . t . ( 8 )
where G 1 and G 2 are the games for the position tracking control and attitude tracking control, respectively.
In game G i , i = 1 , 2 , the control player aims to seek the optimal control input u i * that minimizes J i ( ξ i ( 0 ) , u i , v i ) , while the disturbance player desires to find the worst-case disturbance v i * that maximizes it. The corresponding Nash equilibrium strategy is defined as follows:
Definition 1.
(Nash equilibrium strategy): The policy pair ( u i * , v i * ) is said to be a Nash equilibrium strategy for the zero-sum differential game G i , i = 1 , 2 , if the inequality
J i ( ξ i ( 0 ) , u i * , v i ) J i ( ξ i ( 0 ) , u i * , v i * ) J i ( ξ i ( 0 ) , u i , v i * )
holds for all u i and v i . The value J i ( ξ i ( 0 ) , u i * , v i * ) is the so-called Nash equilibrium.
In the following subsection, we will prove that the Nash equilibrium strategies are solutions to the zero-sum differential game problems G i , i = 1 , 2 .

3.2. Stability and Nash Equilibria for Zero-Sum Differential Games

According to the performance index defined in (10), we define the value functions for game G i , i = 1 , 2 , by
V i ( ξ i ( t ) ) = t q i ξ i ( τ ) , u i ( τ ) , v i ( τ ) d τ .
For convenience in the derivation, i = 1 , 2 is omitted in the following discussion. Let the associated Hamiltonian function of (13) be defined as
H i ( ξ i , u i , v i , V i , t ) = V i ξ i ˙ ( t ) + q i ( ξ i ( τ ) , u i ( τ ) , v i ( τ ) )
where the derivative of (10) with respect to ξ i is given by V i = V i ξ i .
Let V i * ( ξ i ( t ) ) be the optimal value of V i ( ξ i ( t ) ) , i.e.,
V i * ( ξ i ( t ) ) = min u i max v i t q i ξ i ( τ ) , u i ( τ ) , v i ( τ ) d τ .
It can be easily verified that V i * ( ξ i ( t ) ) satisfies the following HJI equation:
min u i max v i H i ( ξ i , u i , v i , V i , t ) = V i * F i ξ i ( t ) + G i ( u i * ( t ) + v i * ( t ) ) + ξ i ( t ) Q ^ i 2 + u i * ( t ) R ^ i 2 η i 2 v i * ( t ) 2 = 0
where V i * = V i * ξ i . By employing the stationary conditions H i u i * = 0 and H i v i * = 0 , we have
u i * = 1 2 R ^ i 1 G i V i * , v i * = 1 2 η i 2 G i V i * .
Substituting (16) into (15), one gets the HJI equation:
0 = H i ( ξ i , u i * , v i * , V i * , t ) = ξ i ( t ) Q ^ i 2 1 4 V i * G i R ^ i 1 G i V i * + V i * F i ξ i ( t ) + 1 4 η i 2 V i * G i G i V i *
with V i ( 0 ) = 0 .
The following theorem illustrates that (16), obtained by solving the above HJI equation in (17), provides a solution for both Problems 1 and 2:
Theorem 1.
Suppose that 0 < V i * C 2 satisfies the HJI equation (17). Then, under the policy pair ( u i * , v i * ) given by (16) in terms of V i * , the subsystems in (8) are closed-loop stable and the bounded L 2 -gain condition (9) holds. Moreover, ( u i * , v i * ) is a Nash equilibrium strategy of G i , i = 1 , 2 .
Proof. 
Suppose that V i * satisfies HJI equation (17). Then,
H i ( ξ i , u i , v i , V i * , t ) H i ( ξ i , u i * , v i * , V i * , t ) = V i * G i ( u i ( t ) u i * ( t ) ) + u i ( t ) R ^ i 2 u i * ( t ) R ^ i 2 + V i * G i ( v i ( t ) v i * ( t ) ) η i 2 ( v i ( t ) 2 v i * ( t ) 2 ) .
According to the Hamiltonian function (14), the following identity holds:
V ˙ i * ( ξ i ( t ) ) = H i ( ξ i , u i * , v i * , V i * , t ) + u i ( t ) u i * ( t ) R ^ i 2 q i ξ i ( t ) , u i ( t ) , v i ( t ) η i 2 v i ( t ) v i * ( t ) 2 .
By setting u i ( t ) = u i * ( t ) and v i ( t ) = v i * ( t ) (the optimal control and disturbance strategies) and substituting them into (19), we have the following:
V ˙ i * ( ξ i ( t ) ) = q i ξ i ( t ) , u i * ( t ) , v i * ( t ) 0 ,
where the equality V ˙ i * ( ξ i ( t ) ) = 0 holds if and only if ξ i ( t ) = 0 . Thus, the subsystems described in (8) are asymptotically stable.
According to the Lyapunov stability theory, since V ˙ i * ( ξ i ( t ) ) is negative semidefinite and vanishes only at the origin, the subsystems described by (8) are proven to be asymptotically stable. The Lyapunov function V i * ( ξ i ) satisfies the standard stability criteria, ensuring that trajectories converge to the equilibrium point ξ i ( t ) = 0 as t .
Integrating (20) from 0 to T yields
V i * ( ξ i ( T ) ) V i * ( ξ i ( 0 ) ) = 0 T q i ξ i ( t ) , u i * ( t ) , v i * ( t ) d t 0
Since V i * > 0 , one has
0 T ξ i ( t ) Q ^ i 2 + u i * ( t ) R ^ i 2 d t V i * ( ξ i ( 0 ) ) + η i 2 0 T v i * ( t ) 2 d t
which shows that the bounded L 2 -gain condition holds. Furthermore, due to the fact that the tracking subsystem is asymptotically stable, it can be concluded that V i * ( ξ i ( ) ) = 0 . Thus, by (14), we have
J i ( ξ i ( 0 ) , u i , v i ) = V i * ( ξ i ( 0 ) ) + 0 H i ( ξ i , u i * , v i * , V i * , t ) d t + 0 u i ( t ) u i * ( t ) R ^ i 2 η i 2 v i ( t ) v i * ( t ) 2 d t .
It follows that the condition (12) is satisfied. By setting u i = u i * and v i = v i * , one gets J i ( ξ i ( 0 ) , u i * , v i * ) = V i * ( ξ i ( 0 ) ) . Thus, u i * and v i * form a Nash equilibrium strategy and V i * ( ξ i ( 0 ) ) is the corresponding Nash equilibrium of the game G i .    □
However, as a nonlinear partial differential equation (PDE), the HJI equation given in (17) is difficult to solve directly and analytically [38]. To this end, a data-driven IRL approach is designed to obtain the solution to the HJI equation and the Nash equilibrium strategies.

4. R 3 F 2 -Based IRL Approach

In this section, an R 3 F 2 -based IRL algorithm is developed to solve the zero-sum differential games. First, the actor-critic-disturbance architecture is utilized to approximate the value function and the Nash equilibrium strategy. Based on this, a model-free integral Bellman equation is derived. Then, an R 3 F 2 -based IRL algorithm is provided to evaluate the value function and learn the Nash equilibrium strategy. Finally, the convergence and stability analysis are presented.

4.1. A Framework of R 3 F 2 -Based IRL for Zero-Sum Differential Games

Denote V i j ( ξ i ( t ) ) as the updated value function in the jth iteration, and u ^ i j and v ^ i j as the updated control policy and disturbance in the jth iteration, respectively. For all t > 0 and Δ t > 0 , and using the Fundamental Theorem of Calculus and the definition of the performance index (13), the increment of the value function V i j ( ξ i ) over the interval [ t , t + Δ t ] is as follows:
V i j ( ξ i ( t + Δ t ) ) V i j ( ξ i ( t ) ) = t t + Δ t d d τ V i j ( ξ i ( τ ) ) d τ = t t + Δ t V i j ( ξ i ( τ ) ) · ξ ˙ i ( τ ) d τ ,
where the second equality follows from the chain rule. Rearranging this expression yields the following:
V i j ( ξ i ( τ ) ) | t t + Δ t = t t + Δ t V i j ξ ˙ i ( τ ) d τ .
Define u ˜ i j = u i u ^ i j and v ˜ i j = v i v ^ i j as the estimation errors of control and disturbance, respectively. Substituting u i = u ^ i j + u ˜ i j and v i = v ^ i j + v ˜ i j into the system dynamics (8) gives the following:
ξ ˙ i = F i ξ i + G i u i + v i = F i ξ i + G i u ^ i j + v ^ i j + u ˜ i j + v ˜ i j ,
where u ^ i j and v ^ i j denote the control and disturbance policies at the j-th iteration, and u ˜ i j a n d v ˜ i j represent the residual errors. This decomposition separates nominal control/disturbance ( u ^ i j , v ^ i j ) from their errors, facilitating stability analysis.
From the Hamiltonian function (14) and the optimality conditions H i u i = 0 and H i v i = 0 , we can obtain
u ^ i j + 1 = 1 2 R ^ i 1 G i V i j , v ^ i j + 1 = 1 2 η i 2 G i V i j ,
where R ^ i 0 and η i > 0 are design parameters. These satisfy the saddle-point condition 2 H i u i 2 0 and 2 H i v i 2 0 , ensuring local optimality.
Substituting (26) and (27) into (24) yields the following model-free integral Bellman equation:
V i j ( ξ i ( t ) ) V i j ( ξ i ( t + Δ t ) ) = t t + Δ t V i j F i ξ i + G i u ^ i j + v ^ i j + u ˜ i j + v ˜ i j d τ = t t + Δ t V i j F i ξ i + G i u ^ i j + v ^ i j d τ t t + Δ t V i j G i u i u ^ i j + V i j G i v i v ^ i j d τ = t t + Δ t ξ i Q ^ i ξ i + u ^ i j R ^ i u ^ i j η i 2 v ^ i j v ^ i j d τ + t t + Δ t 2 u ^ i j + 1 R ^ i u i u ^ i j 2 η i 2 v ^ i j + 1 v i v ^ i j d τ = t t + Δ t ξ i ( t ) Q ^ i 2 + u ^ i ( t ) R ^ i 2 η i 2 v ^ i ( t ) 2 d τ + 2 t t + Δ t u ^ i j + 1 ( τ ) R ^ i u ˜ i j ( τ ) η i 2 v ^ i j + 1 ( τ ) v ˜ i j ( τ ) d τ .
This equation holds for any Δ t > 0 and is model-free because it does not explicitly depend on F i or G i , relying only on measurable signals ξ i , u ^ i , and v ^ i .
Starting from an initially stabilizing control policy, the proposed algorithm first computes V i j ( ξ i ( t ) ) by solving the partial differential Equation (14), followed by computing u i j + 1 and v i j + 1 by using (27). Repeat the above-mentioned processes until the difference between V i j ( ξ i ( t ) ) and V i j 1 ( ξ i ( t ) ) is sufficiently small.
Based on the Stone–Weierstrass approximation theorem [39], the jth iterative approximate expressions of V i ( ξ i ) , u i , and v i can be represented by
i j = W i j Φ i ( ξ i ) + ϵ i j
where i j = [ V i j ( ξ i ) , u i j + 1 , v i j + 1 ] R 7 , Φ i ( ξ i ) R ( n i + 3 m i + 3 l i ) × 7 is a block matrix given by
Φ i ( ξ i ) φ i c ( ξ i ) O n i × 3 O n i × 3 O 3 m i × 1 I 3 φ i a ( ξ i ) O 3 m i × 3 O 3 l i × 1 O 3 l i × 3 I 3 φ i d ( ξ i )
and ϵ i j R 7 is the approximation error vector. In Φ i ( ξ i ) , the components φ i c ( ξ i ) R n i , φ i a ( ξ i ) R m i , and φ i d ( ξ i ) R l i are all activation functions, W i j R n i + 3 m i + 3 l i represents the weight coefficient vector in the form of W i j = [ W i c j , vec ( W i a j + 1 ) , vec ( W i d j + 1 ) ] , with W i c j R n i being the weight vector of the critic NNs, and W i a j + 1 R m i × 3 and W i d j + 1 R l i × 3 being weight matrices of the actor NNs and lumped disturbance NNs, respectively. Note that W i j is unknown and, thus, needs to be estimated. Denote W ^ i c j , W ^ i a j + 1 , and W ^ i d j + 1 as the estimated values of W i c j , W i a j + 1 , and W i d j + 1 , respectively. The estimate of i j is given by
^ i j = Φ i ( ξ i ) W ^ i j
where W ^ i j = W ^ i c j , vec ( W ^ i a j + 1 ) , vec ( W ^ i d j + 1 ) .
Assumption 1.
(1) 
The activation functions of actor-critic-disturbance NNs and their derivatives are bounded, i.e., φ i a b φ i a , φ i c b φ i c , φ i d b φ i d , φ i a b d φ i a , φ i c b d φ i c , and φ i d b d φ i d , where b φ i a , b φ i c , b φ i d , b d φ i a , b d φ i c , and b d φ i d are positive constants.
(2) 
The ideal actor-critic-disturbance NN weights satisfy W i c j W i c max j , W i a j + 1 W i a max j + 1 , and W i d j + 1 W i d max j + 1 , where W i c max j , W i a max j + 1 and W i d max j + 1 are positive constants.
(3) 
ϵ i j b ϵ i j and ϵ i j b d ϵ i j , where b ϵ i j and b d ϵ i j are positive constants.
Let Δ φ i c Δ φ i c ( ξ i ( t ) ) = φ i c ( ξ i ( t + Δ t ) ) φ i c ( ξ i ( t ) ) . According to (28) and (30), one has
0 = V i j ( ξ i ( t + Δ t ) ) V i j ( ξ i ( t ) ) + t t + Δ t ξ i ( t ) Q ^ i 2 + u ^ i ( t ) R ^ i 2 η i 2 v ^ i ( t ) 2 d τ + 2 t t + Δ t u ^ i j + 1 ( τ ) R ^ i u ˜ i j ( τ ) η i 2 v ^ i j + 1 ( τ ) v ˜ i j ( τ ) d τ = Δ φ i c ( ξ i ( t ) ) W ^ i c j + t t + Δ t q i ( ξ i ( τ ) , u ^ i ( τ ) , v ^ i ( τ ) ) d τ + 2 t t + Δ t u ^ i j + 1 ( τ ) R ^ i u ˜ i j ( τ ) η i 2 v ^ i j + 1 ( τ ) v ˜ i j ( τ ) d τ .
By using (27) we can obtain the following:
u ^ i j + 1 ( τ ) R ^ i u ˜ i j ( τ ) = [ ( u ˜ i j ( τ ) R ^ i ) φ i a ( ξ i ( τ ) ) ] vec ( W ^ i a j + 1 ) , v ^ i j + 1 ( τ ) v ˜ i j ( τ ) = [ v ˜ i j ( τ ) φ i d ( ξ i ( τ ) ) ] vec ( W ^ i d j + 1 ) .
Substituting (32) into (31) yields
C i ξ j ( t ) = Ξ i j ( t ) W ^ i j
where W ^ i j = [ W ^ i c j , vec ( W ^ i a j + 1 ) , vec ( W ^ i d j + 1 ) ] R n i + 3 m i + 3 l i , Ξ i j ( t ) R n i + 3 m i + 3 l i and C i ξ j ( t ) R are respectively expressed by
Ξ i j ( t ) = C i c j ( t ) , C i a j ( t ) , C i d j ( t ) , C i ξ j ( t ) = t t + Δ t q i ( ξ i ( τ ) , u ^ i ( τ ) , v ^ i ( τ ) ) d τ .
In (34), C i c j ( t ) = Δ φ i c ( ξ i ( t ) ) R 1 × n i , C i a j ( t ) R 1 × 3 m i , and C i d j ( t ) R 1 × 3 l i , are respectively defined as
C i a j ( t ) = 2 t t + Δ t ( u ˜ i j ( τ ) R ^ i ) φ i a ( ξ i ( τ ) ) d τ , C i d j ( t ) = 2 η i 2 t t + Δ t v ˜ i j ( τ ) φ i d ( ξ i ( τ ) ) d τ .
Note that (33) is a linear equation, where C i ξ j ( t ) denotes the measurements, Ξ i j ( t ) is the regression matrix, respectively, at time t, and W ^ i j is the uncertain vector to be estimated. The value of C i ξ j ( t ) and the data in Ξ i j ( t ) are typically obtained from a continuous-time process and are available at the sample times k Δ t , where Δ t is the sample interval. To compute the estimate of W ^ i j , we first construct two datasets, C i j = { C i ξ | 0 j , C i ξ | 1 j , , C i ξ | k j } and E i j = { Ξ i | 0 j , Ξ i | 1 j , , Ξ i | k j } , by utilizing the data produced by the behavior policy over the time window = 0 , 1 , , k .
Then, we can use the R 3 F 2 scheme to find the optimal estimate W ^ i | k j of W ^ i j by solving the following optimization problem:
Problem 3.
min W ^ i j Y i k ( W ^ i j ) = 1 2 ρ i k W ^ i j W ^ i | 0 j P i 0 j 2 + 1 2 = 0 k ρ i k ς i | j 2 s . t . ς i | j = C i ξ | j Ξ i | j W ^ i j , = 0 , 1 , , k
where ρ i ( 0 , 1 ] represents the forgetting factor, W ^ i | 0 j R n i + 3 m i + 3 l i indicates the initial estimate of W ^ i j , P i 0 j R ( n i + 3 m i + 3 l i ) × ( n i + 3 m i + 3 l i ) is positive definite, and P i 0 j denotes the inverse matrix, ( P i 0 j ) 1 , of P i 0 j for brevity.
Remark 1.
In most of the existing works (e.g., [27,28,40]), the LS scheme is applied to approximate the solution to (33), in which random or sinusoidal function noises are usually added to the behavior policy to ensure that the matrix Ξ i | 0 j , Ξ i | 1 j , , Ξ i | k j has full column rank. However, the selection of rich input signals is usually experience-based. In Problem 3, the exponentially weighted mechanism is used to apply greater weights to the recent data, and the weighted vector norm regularization term 1 2 ρ i k W ^ i j W ^ i | 0 j P i 0 j 2 is designed to address the numerical instability of the matrix inversion and subsequently produce lower variance models.
We can obtain the optimal solution to Problem 3 as
W ^ i | k j = P i | k j S i | k j
where P i | k j and S i | k are given by
P i | k j = = 0 k ρ i k Ξ i | j Ξ i | j + ρ i k P i 0 j 1 , S i | k j = = 0 k ρ i k C i ξ | j Ξ i | j + ρ i k P i 0 j W ^ i | 0 j .
Then, for all k N 0 , we can obtain the update rule for W ^ i | k j as follows:
W ^ i | k j = W ^ i | ( k 1 ) j + P i | k j Ξ i | k j ς ¯ i | k
where
ς ¯ i | k = C i ξ | k j Ξ i | k j W ^ i | ( k 1 ) j P i | k j = ρ i 1 ( P i | ( k 1 ) j σ i | k j P i | ( k 1 ) j Ξ i | k j 2 )
with σ i | k j = ρ i + Ξ i | k j P i | ( k 1 ) j Ξ i | k j .
Remark 2.
The computational requirements of R 3 F 2 are primarily determined by n i + 3 m i + 3 l i . Since P i | k j is of size ( n i + 3 m i + 3 l i ) × ( n i + 3 m i + 3 l i ) , the computational requirement for updating P i | k j given by (37) is of O ( ( n i + 3 m i + 3 l i ) 2 ) . Moreover, the σ i | k j in (37) is of O ( 1 ) , which is much less demanding than the matrix inverse operation required by the LS algorithm. In addition, the storage requirements of the recursive R 3 F 2 are of O ( ( n i + 3 m i + 3 l i ) 2 ) , which does not grow with k. Thus, the computational and memory requirements of the recursive R 3 F 2 are significantly less than those of LS.
The implementation procedure of the proposed R 3 F 2 -based IRL approach is summarized in Algorithm 1. For a fixed controller u i and the arbitrary uncertainty signals v i , the proposed algorithm can simultaneously learn estimates of the value function V i j ( ξ i ) , control policy u i j + 1 , and uncertainties v i j + 1 without applying the knowledge of the UAV dynamics.
Algorithm 1:  R 3 F 2 -based IRL for zero-sum differential games
    Drones 09 00477 i001

4.2. Convergence and Stability Analysis

The following results show that the NN weight tuning laws guarantee the proposed R 3 F 2 -based IRL algorithm converges to the Nash equilibrium strategy, while ensuring the stability of the closed-loop system.
Define W ^ ˜ i | k j = W ^ i j W ^ i | k j and W ˜ i | k j = W i j W ^ i | k j .
Then, it follows from (36) that W ^ ˜ i | k j and W ˜ i | k j can be written as
W ^ ˜ i | k j = ( I P i | k j Ξ i | k j Ξ i | k j ) W ^ ˜ i | k 1 j ,
W ˜ i | k j = W ˜ i | ( k 1 ) j P i | k j Ξ i | k j Ξ i | k j W ^ ˜ i | ( k 1 ) j .
Definition 2.
The matrix sequence, { Ξ i | k j } k = 0 , is said to be persistently exciting (PE) if for some N N 0 and N 0 , β 1 , β 2 R + such that β 1 I l = + N Ξ i | l j Ξ i | l j β 2 I < .
Lemma 1.
Suppose that { Ξ i | k j } k = 0 is PE. Let N , β 1 and β 2 be given by Definition 2, ρ i ( 0 , 1 ] , and P i | k j be given by (37). Then, for all k N + 1 ,
0 < β ¯ 1 I P i | k j β ¯ 2 I + P N 1 <
with β ¯ 1 = ρ i N ( 1 ρ i ) 1 ρ i N + 1 β 1 and β ¯ 2 = 1 1 ρ i N + 1 β 2 .
Theorem 2
(Convergence of R 3 F 2 -Based IRL Algorithm). Let (36) be the update rule of the actor-critic-disturbance NN weights. Suppose that { Ξ i | k j } k = 0 is PE. Then, for any initial condition W ^ i | 0 j , the estimation error W ^ ˜ i | k j converges exponentially to the origin, i.e., for any W ^ i | 0 j there exists γ 0 R + such that for all k N 0 ,
W ^ ˜ i | k j 2 γ 0 ρ i k W i | 0 j 2
and W ˜ i | k j is ultimately uniformly bounded (UUB).
Proof. 
Define the Lyapunov function candidate as
L ^ i | k j = W ^ ˜ i | k j P i | k j W ^ ˜ i | k j .
Since
P i | k j = ρ i P i | ( k 1 ) j + Ξ i | k j Ξ i | k j ,
using (42) and (38a), one has
Δ L ^ i | k j = : L ^ i | k j L ^ i | ( k 1 ) j = W ^ ˜ i | k 1 j Ξ i | k j 1 Ξ i | k j P i | k j Ξ i | k j Ξ i | k j W ^ ˜ i | ( k 1 ) j + ( ρ i 1 ) W ^ ˜ i | k 1 j P i | ( k 1 ) j W ^ ˜ i | ( k 1 ) j .
Moreover, it follows from (42) and ρ i ( 0 , 1 ] that
P i | k j ρ i P i | ( k 1 ) j ρ i k P i | 0 j > 0 .
Thus, for all k N 0 , L ^ i | k j λ min ( P i | 0 j ) ρ i k W ^ ˜ i | k j 2 .
Denote a = Ξ i | k j P i | ( k 1 ) j Ξ i | k j > 0 . Using (37), we can obtain
1 Ξ i | k j P i | k j Ξ i | k j = ρ i ρ i + a > 0 .
Combining (45) with (43) leads to
L ^ i | k j L ^ i | ( k 1 ) j ( ρ i 1 ) L ^ i | ( k 1 ) j 0
which in turn gives
L ^ i | k j ρ i L ^ i | ( k 1 ) j ρ i k L ^ i | 0 j = ρ i k W ^ ˜ i | 0 j P i | 0 j W ^ ˜ i | 0 j .
If { Ξ i | k j } k = 0 is PE, Lemma 1 holds. Therefore, combining (47) with (39) and (41) leads to, for all k N + 1 ,
W ^ ˜ i | k j 2 1 β ¯ 1 · λ max ( P i | 0 j ) λ min ( P i | 0 j ) ρ i k W ^ ˜ i | 0 j 2 .
Denoting γ 0 = 1 β ¯ 1 · λ max ( P i | 0 j ) λ min ( P i | 0 j ) , one can obtain (40).
Next, consider the Lyapunov function
L i | k j = W ˜ i | k j P i | k j W ˜ i | k j .
Substituting (37) and (38b) into (49) yields
L i | k j L i | ( k 1 ) j = W ˜ i | ( k 1 ) j P i | k j P i | ( k 1 ) j W ˜ i | ( k 1 ) j 2 W ˜ i | ( k 1 ) j Ξ i | k j Ξ i | k j κ p W ^ ˜ i | ( k 1 ) j Ξ i | k j Ξ i | k j W ^ ˜ i | ( k 1 ) j
where κ p = Ξ i | k j P i | k j Ξ i | k j . Let b W ^ ˜ i | k j = ( γ 0 ρ i k ) 1 2 W i | 0 j . Using (39) and (44), we can obtain
L i | k j L i | ( k 1 ) j M 1 W ˜ i | ( k 1 ) j 2 + 2 M 2 W ˜ i | ( k 1 ) j + C 0
where M 1 = ρ i k 1 λ min ( P i | 0 j ) β ¯ 2 λ min ( P N 1 ) , M 2 = b W ^ ˜ i | k j ( β ¯ 2 + λ max ( P N 1 ) + ρ i k λ max ( P i | 0 j ) ) , and C 0 = κ p ( β ¯ 2 + λ max ( P N 1 ) ρ i k λ min ( P i | 0 j ) ) b W ^ ˜ i | k j 2 .
Completing the squares, we get L i | k j L i | ( k 1 ) j < 0 if W ˜ i | ( k 1 ) j > M 2 M 1 + M 2 2 M 1 2 + C 0 M 1 b W ˜ i | k j .
According to the standard Lyapunov extension theorem, the analysis above demonstrates that the weight estimation error W ˜ i | k j is UUB. □
Theorem 3
(Stability of the tracking error subsystems). Consider the tracking error subsystems described in (8). Let the approximations of the value function, the optimal control law, and the worst-case lumped disturbance be given by(30), and the actor-critic-disturbance NN weights tuning law be given by (36). Then, for all ξ i ( t 0 ) = ξ i 0 and some T > 0 , there exists a ϱ i R 0 such that the tracking error state ξ i is UUB, i.e.,
ξ i ( t ) ϱ i λ i min ( Q ^ i ) for all t T .
Proof. 
We consider the Lyapunov function V i j ( ξ i ( t ) ) given in (29), which is the approximate solution to (17). Taking the time derivative of V i j ( ξ i ( t ) ) along the trajectory generated by u ^ i j ( t ) and v ^ i j ( t ) yields
V ˙ i j = V i j [ F i ξ i + G i ( u ^ i j + v ^ i j ) ] .
Since
ξ i ( t ) Q ^ i 2 1 4 V i j G i R ^ i 1 G i V i j + V i j F i ξ i ( t ) + 1 4 η i 2 V i j G i G i V i j = 0
subtracting (51) by (52) yields
V ˙ i j = V i j G i ( u ^ i j + v ^ i j ) ξ i ( t ) Q ^ i 2 + 1 4 G i V i j R ^ i 1 2 1 4 η i 2 G i V i j 2 .
From (16), it can be derived that
V i j G i u i j = 1 2 G i V i j R ^ i 1 2 .
Adding (54) to (53) yields V ˙ i j = ξ i ( t ) Q ^ i 2 + Σ 1 Σ 2 Σ 3
where Σ 1 = V i j G i ( u ^ i j u i j + v ^ i j ) , Σ 2 = 1 4 G i V i j R ^ i 1 2 , and Σ 3 = 1 4 η i 2 G i V i j 2 .
Since Σ 2 > 0 and Σ 3 > 0 , one has
V ˙ i j ξ i ( t ) Q ^ i 2 + Σ 1 .
According to Assumption 1 and the expression of G i , one has
Σ 1 ( b d φ i d c W i c max j + b d ϵ i j ) G i ( b ϵ i j + v ^ i j ) .
As W ^ i j = W ^ ˜ i | k j + W i j W ˜ i | k j , by taking norms, we get
W ^ i j W ^ ˜ i | k j + W ˜ i | k j + W i j ( γ 0 ρ k ) 1 2 W i | 0 j + b W ˜ i | k j + b W i j b W ^ i j
where b W i j max { W i c max j , W i a max j + 1 , W i d max j + 1 } . Let b Φ i = max { b φ i a , b φ i c , b φ i c } . Then, it follows from (30) that
v ^ i j Φ i ( ξ i ) W ^ i j b Φ i b W ^ i j b v ^ i j .
Taking into account (56) and (57), (55) becomes
V ˙ i j ξ i ( t ) 2 λ i min ( Q ^ i ) + ϱ i
where ϱ i = ( b d φ i d c W i c max j + b d ϵ i j ) G i ( b ϵ i j + b v ^ i j ) . It follows that V ˙ i j < 0 if ξ i > ϱ i λ i min ( Q ^ i ) . With Lyapunov’s indirect method, it is proven that ξ i is UUB. □
Theorem 4
(Nash Equilibrium of the Game). Suppose that the hypotheses in Theorems 2 and 3 hold. Then, H i ( ξ i , u ^ i j , v ^ i j , V ^ i j , t ) is UUB with V ^ i j , u ^ i j , and v ^ i j being given by (30). Moreover, ( u ^ i j , v ^ i j ) converges to the Nash equilibrium solution ( u i * , v i * ) of game G i , i = 1 , 2 .
Proof. 
First, the approximate coupled HJB equation is
H i ( ξ i , u ^ i j , v ^ i j , V ^ i j , t ) = W ^ i c j φ i c ( ξ i ) F i ξ i ξ i ( t ) Q ^ i 2 + 1 4 η i 2 W ^ i c j φ i c ( ξ i ) G i G i φ i c ( ξ i ) W ^ i c j 1 4 W ^ i c j φ i c ( ξ i ) G i R ^ i 1 G i φ i c ( ξ i ) W ^ i c j
Subtracting H i ( ξ i , u i j , v i j , V i j , t ) = 0 from the right-hand side of (59) yields
H i ( ξ i , u ^ i j , v ^ i j , V ^ i j , t ) 1 4 φ i c ( ξ i ) G i R ^ i 1 2 W i c j 2 + 1 4 η i 2 φ i c ( ξ i ) G i 2 W ^ i c j 2 + ( W ^ i c j + W i c j ) φ i c ( ξ i ) F i ξ i .
All terms on the right-hand side of (60) are UUB. Hence, H i ( ξ i , u ^ i j , v ^ i j , V ^ i j , t ) is UUB and the convergence of the approximate HJI is achieved.
Furthermore, from Theorem 2, u ^ i u i * and v ^ i v i * are UUB due to the fact that
W ^ i a W i a W ^ i a + W i a b W ^ i j + W i a max j , W ^ i d W i d W ^ i d + W i d b W ^ i j + W i d max j .
Therefore, the pair ( u ^ i , v ^ i ) gives the approximate Nash equilibrium solution to game G i . This completes the proof. □

5. Simulation Results

In this section, the performance of the proposed R 3 F 2 -based off-policy IRL scheme will be verified via several simulated examples, where the performance of the gradient descent RL-based controller is also given for the purposes of control performance comparison. To ensure the simulation setting closely follows real-world conditions, we conduct our tracking examples in a virtual experiment platform called the Robot Operating System (ROS) and the Gazebo robotics simulation environment. The virtual UAV tracking scenario is shown in Figure 2, which features a UAV and a bay land.
The physical parameters in UAV dynamics are listed in Table 1. The aerodynamic damping coefficients k i , i = 1 , , 6 are set as k 1 = 0.01 , k 2 = 0.01 , k 3 = 0.01 , k 4 = 0.015 , k 5 = 0.015 , and k 6 = 0.015 . The anti-torque coefficient is chosen as c = 0.05 m . The external disturbances acting on the position and the attitude subsystems are defined as
d ζ = d ϑ a ( t ) , t Γ 0 , o t h e r w i s e
where
a ( t ) = [ 0.2 sin ( 0.1 t ) , 0.2 cos ( 0.1 t ) , 0.1 ]
and
Γ = [ T s i m / 8 , T s i m / 6 ] [ T s i m / 4 , T s i m / 3 ] [ T s i m / 2 , 3 T s i m / 5 ] [ 5 T s i m / 6 , 7 T s i m / 8 ]
with T s i m = 150 s being the total simulation time.
The parameters in the value function are set as Q ^ 1 = Q ^ 2 = 1.2 I 6 , R ^ 1 = R ^ 2 = 1.4 I 3 , and η 1 2 = η 2 2 = 5 . The reference trajectories of the position and yaw angle are selected as ζ r = [ 10 sin ( t 10 ) , 10 sin ( t 20 ) , 8 ] and ψ r = cos ( 0.5 t ) . The initial states of the UAV are ζ ( 0 ) = [ 2.5 , 2.5 , 0 ] , ζ ˙ ( 0 ) = [ 0 , 0 , 0 ] , ϑ ( 0 ) = [ 1 , 1 , 1 ] , and ϑ ˙ ( 0 ) = [ 0 , 0 , 0 ] . The sampling time interval Δ t is set as 0.01 s . Furthermore, the activation functions for the critic NNs are chosen as the quadratic polynomials with the structure of 6-21-1. For both the actor and the lumped disturbance NNs, the activation functions are chosen as first-order polynomials with the structure of 6-6-3.
Figure 3 shows the trajectories of the UAV in the x y plane while tracking a white-colored figure-eight trajectory over the overall simulation time. The instantaneous positions of the UAV can be found in the subfigures, illustrating successful tracking performance under external disturbance. In Figure 4, Figure 5 and Figure 6, we demonstrate the simulated results of the proposed approach. Specifically, Figure 4 and Figure 5 show the trajectory tracking performance under the proposed tracking control framework. It can be seen from Figure 5 that the positions and attitudes successfully track the given reference. Moreover, the convergence results of the actor-critic-disturbance NN weights are displayed in Figure 6. It can be seen that the convergence is achieved after the 10th iteration, which verifies the theoretical result established in Theorem 2.
To further show the superiority of the proposed approach, we also compare the performance of this method with that of the gradient descent RL-based controller. Figure 7 depicts the comparison results of the tracking errors of the subsystems under the gradient descent RL-based controllers and the learned robust controllers. One can find that the tracking errors exponentially converge to a small neighborhood of the origin, and the learned robust controllers achieve better performance than the RL-based controllers. In addition, the performance index is taken as the average value of the cost during all the simulation time, i.e., J ¯ 1 = t = 0 T q i ξ i ( t ) , u i ( t ) , v i ( t ) d t T s i m and J ¯ 2 = t = 0 T q i ξ i ( t ) , u i ( t ) , v i ( t ) d t T s i m . Then, the quantitative performance comparison is shown in Table 2. It can be seen that the performance of our method is significantly better than that of the gradient descent RL method in the position control loop, while the performance is almost the same in the altitude control loop. This difference is likely due to the higher nonlinearity in the position control subsystem. Generally, the attitude subsystem behaves more like a linear system, which explains why our method achieves almost identical performance compared with the gradient descent RL-based one. In summary, our method can successfully handle this nonlinearity (or coupling) with improved control performance.

6. Conclusions

In this paper, a novel robust optimal tracking control framework for a nonlinear UAV with unknown disturbances is proposed. By introducing a virtual control law, the UAV robust optimal tracking control problem is formulated as two independent zero-sum differential games. An R 3 F 2 -based IRL algorithm is presented to obtain the optimal control law and the corresponding worst-case disturbance in a data-driven manner. It is also shown that the learned control law and worst-case disturbance converge to the Nash equilibrium strategies. Finally, numerical simulation results are provided to illustrate the effectiveness of the proposed framework. In future work, this methodology will be extended to multi-agent systems or higher-dimensional dynamics and will address scalability challenges for large-scale differential games.

Author Contributions

Conceptualization, Y.G. and Q.S.; methodology, Y.G.; validation, Q.S. and Q.P.; investigation, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, Q.S.; visualization, Q.S.; supervision, Q.P.; project administration, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 52301389, the Aeronautical Science Foundation of China under grant 2023Z023053001, and the National Key Laboratory of Underwater Information and Control.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank the editors and anonymous reviewers for their valuable time and effort dedicated to the review process of our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhou, X.; Wen, X.; Wang, Z.; Gao, Y.; Li, H.; Wang, Q.; Yang, T.; Lu, H.; Cao, Y.; Xu, C.; et al. Swarm of micro flying robots in the wild. Sci. Robot. 2022, 7, eabm5954. [Google Scholar] [CrossRef]
  2. Hua, H.; Fang, Y. A novel reinforcement learning-based robust control strategy for a quadrotor. IEEE Trans. Ind. Electron. 2022, 70, 2812–2821. [Google Scholar] [CrossRef]
  3. Wang, Y.; Lu, Q.; Ren, B. Wind turbine crack inspection using a quadrotor with image motion blur avoided. IEEE Robot. Autom. Lett. 2023, 8, 1069–1076. [Google Scholar] [CrossRef]
  4. Ma, Q.; Jin, P.; Lewis, F.L. Guaranteed cost attitude tracking control for uncertain quadrotor unmanned aerial vehicle under safety constraints. IEEE/CAA J. Autom. Sin. 2024, 11, 1447–1457. [Google Scholar] [CrossRef]
  5. Wei, Q.; Yang, Z.; Su, H.; Wang, L. Online Adaptive Dynamic Programming for Optimal Self-Learning Control of VTOL Aircraft Systems With Disturbances. IEEE Trans. Automat. Sci. Eng. 2024, 21, 343–352. [Google Scholar] [CrossRef]
  6. Tonan, M.; Bottin, M.; Doria, A.; Rosati, G. Analysis and design of a 3-DOF spatial underactuated differentially flat robot. In Proceedings of the 2025 11th International Conference on Mechatronics and Robotics Engineering (ICMRE), Milan, Italy, 27–29 February 2024; IEEE: New York, NY, USA, 2025; pp. 202–207. [Google Scholar]
  7. Wang, L.; Su, J. Robust disturbance rejection control for attitude tracking of an aircraft. IEEE Trans. Control Syst. Technol. 2015, 23, 2361–2368. [Google Scholar] [CrossRef]
  8. Zhou, Y.; Chen, M.; Jiang, C. Robust tracking control of uncertain mimo nonlinear systems with application to UAVs. IEEE/CAA J. Autom. Sin. 2015, 2, 25–32. [Google Scholar] [CrossRef]
  9. Dydek, Z.T.; Annaswamy, A.M.; Lavretsky, E. Adaptive control of quadrotor UAVs: A design trade study with flight evaluations. IEEE Trans. Control Syst. Technol. 2012, 21, 1400–1406. [Google Scholar] [CrossRef]
  10. Sun, S.; Romero, A.; Foehn, P.; Kaufmann, E.; Scaramuzza, D. A comparative study of nonlinear MPC and differential-flatness-based control for quadrotor agile flight. IEEE Trans. Robot. 2022, 38, 3357–3373. [Google Scholar] [CrossRef]
  11. Jiao, Q.; Modares, H.; Xu, S.; Lewis, F.L.; Vamvoudakis, K.G. Multi-agent zero-sum differential graphical games for disturbance rejection in distributed control. Automatica 2016, 69, 24–34. [Google Scholar] [CrossRef]
  12. Smolyanskiy, N.; Kamenev, A.; Smith, J.; Birchfield, S. Toward low-flying autonomous mav trail navigation using deep neural networks for environmental awareness. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 4241–4247. [Google Scholar]
  13. Wang, Y.; Sun, J.; He, H.; Sun, C. Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans. Syst. Man, Cybern. Syst. 2019, 50, 3713–3725. [Google Scholar] [CrossRef]
  14. Petrlík, M.; Báča, T.; Heřt, D.; Vrba, M.; Krajník, T.; Saska, M. A robust uav system for operations in a constrained environment. IEEE Robot. Autom. Lett. 2020, 5, 2169–2176. [Google Scholar] [CrossRef]
  15. Guo, Y.; Sun, Q.; Wang, Y.; Pan, Q. Differential graphical game-based multi-agent tracking control using integral reinforcement learning. IET Control Theory Appl. 2024, 18, 2766–2776. [Google Scholar] [CrossRef]
  16. Rubí, B.; Morcego, B.; Pérez, R. A deep reinforcement learning approach for path following on a quadrotor. In Proceedings of the 2020 European Control Conference (ECC), Saint Petersburg, Russia, 12–15 May 2020; IEEE: New York, NY, USA, 2020; pp. 1092–1098. [Google Scholar]
  17. Ma, H.-J.; Xu, L.-X.; Yang, G.-H. Multiple environment integral reinforcement learning-based fault-tolerant control for affine nonlinear systems. IEEE Trans. Cybern. 2021, 51, 1913–1928. [Google Scholar] [CrossRef]
  18. Greatwood, C.; Richards, A.G. Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control. Auton. Robot. 2019, 43, 1681–1693. [Google Scholar] [CrossRef]
  19. Mu, C.; Zhang, Y. Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 259–273. [Google Scholar] [CrossRef]
  20. Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F.L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
  21. Wang, D.; He, H.; Mu, C.; Liu, D. Intelligent critic control with disturbance attenuation for affine dynamics including an application to a microgrid system. IEEE Trans. Ind. Electron. 2017, 64, 4935–4944. [Google Scholar] [CrossRef]
  22. Zhao, B.; Shi, G.; Wang, D. Asymptotically stable critic designs for approximate optimal stabilization of nonlinear systems subject to mismatched external disturbances. Neurocomputing 2020, 396, 201–208. [Google Scholar] [CrossRef]
  23. Mohammadi, M.; Arefi, M.M.; Setoodeh, P.; Kaynak, O. Optimal tracking control based on reinforcement learning value iteration algorithm for time-delayed nonlinear systems with external disturbances and input constraints. Inform. Sci. 2021, 554, 84–98. [Google Scholar] [CrossRef]
  24. Yang, X.; Gao, Z.; Zhang, J. Event-driven H control with critic learning for nonlinear systems. Neural Netw. 2020, 132, 30–42. [Google Scholar] [CrossRef]
  25. Song, R.; Lewis, F.L.; Wei, Q.; Zhang, H. Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans. Cybern. 2015, 46, 1041–1050. [Google Scholar] [CrossRef]
  26. Yang, X.; Liu, D.; Luo, B.; Li, C. Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning. Inform. Sci. 2016, 369, 731–747. [Google Scholar] [CrossRef]
  27. Modares, H.; Lewis, F.L.; Jiang, Z. H tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2550–2562. [Google Scholar] [CrossRef] [PubMed]
  28. Luo, B.; Wu, H.; Huang, T. Off-policy reinforcement learning for H control design. IEEE Trans. Cybern. 2015, 45, 65–76. [Google Scholar] [CrossRef] [PubMed]
  29. Cui, X.; Zhang, H.; Luo, Y.; Jiang, H. Adaptive dynamic programming for tracking design of uncertain nonlinear systems with disturbances and input constraints. Int. J. Adapt. Control 2017, 31, 1567–1583. [Google Scholar] [CrossRef]
  30. Xiao, G.; Zhang, H.; Zhang, K.; Wen, Y. Value iteration based integral reinforcement learning approach for H controller design of continuous-time nonlinear systems. Neurocomputing 2018, 285, 51–59. [Google Scholar] [CrossRef]
  31. Zhao, W.; Liu, H.; Lewis, F.L. Robust formation control for cooperative underactuated quadrotors via reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4577–4587. [Google Scholar] [CrossRef]
  32. Zhao, B.; Xian, B.; Zhang, Y.; Zhang, X. Nonlinear robust adaptive tracking control of a quadrotor UAV via immersion and invariance methodology. IEEE Trans. Ind. Electron. 2015, 62, 2891–2902. [Google Scholar] [CrossRef]
  33. Lee, T.; Leok, M.; McClamroch, N.H. Geometric tracking control of a quadrotor UAV on SE (3). In Proceedings of the 49th IEEE Conference on Decision and Control (CDC), Atlanta, GA, USA, 15–17 December 2010; pp. 5420–5425. [Google Scholar] [CrossRef]
  34. Cooke, R.D. Unmanned Aerial Vehicles: Design, Development and Deployment; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  35. Wong, J.Y. Theory of Ground Vehicles; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
  36. Spong, M.W.; Hutchinson, S.; Vidyasagar, M. Robot Modeling and Control; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  37. Corke, P. Robotics, Vision and Control: Fundamental Algorithms in MATLAB; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  38. Song, R.; Lewis, F.L. Robust optimal control for a class of nonlinear systems with unknown disturbances based on disturbance observer and policy iteration. Neurocomputing 2020, 390, 185–195. [Google Scholar] [CrossRef]
  39. Stone, M.H. The generalized weierstrass approximation theorem. Math. Mag. 1948, 21, 167–184. [Google Scholar] [CrossRef]
  40. Gao, W.; Jiang, Z.P. Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE Trans. Autom. Control 2016, 61, 4164–4169. [Google Scholar] [CrossRef]
Figure 1. Diagram of a UAV.
Figure 1. Diagram of a UAV.
Drones 09 00477 g001
Figure 2. The ROS-/Gazebo-based virtual UAV tracking experimental scenarios.
Figure 2. The ROS-/Gazebo-based virtual UAV tracking experimental scenarios.
Drones 09 00477 g002
Figure 3. The virtual experimental results of using the proposed game-based robust tracking control method under an eight-shape reference trajectory; (a) t = 0 s . (b) t = 50 s . (c) t = 100 s . (d) t = 150 s .
Figure 3. The virtual experimental results of using the proposed game-based robust tracking control method under an eight-shape reference trajectory; (a) t = 0 s . (b) t = 50 s . (c) t = 100 s . (d) t = 150 s .
Drones 09 00477 g003
Figure 4. The UAV and reference trajectory in 3 dimensions, where the red-colored circle denotes the starting position and the same-colored diamond represents the end position.
Figure 4. The UAV and reference trajectory in 3 dimensions, where the red-colored circle denotes the starting position and the same-colored diamond represents the end position.
Drones 09 00477 g004
Figure 5. The UAV trajectories under the learned robust tracking control controller.
Figure 5. The UAV trajectories under the learned robust tracking control controller.
Drones 09 00477 g005
Figure 6. Convergence of the actor-critic-disturbance NN weights.
Figure 6. Convergence of the actor-critic-disturbance NN weights.
Drones 09 00477 g006
Figure 7. Evolution of the position and yaw angle tracking error. The red dashed line represents the tracking errors under the gradient descent RL-based controllers, while the blue dashed line represents the tracking errors under the learned robust tracking controllers.
Figure 7. Evolution of the position and yaw angle tracking error. The red dashed line represents the tracking errors under the gradient descent RL-based controllers, while the blue dashed line represents the tracking errors under the learned robust tracking controllers.
Drones 09 00477 g007
Table 1. Parameters of the UAV in the simulation.
Table 1. Parameters of the UAV in the simulation.
Parametersmlg I x I y I z
Values 2.33 0.4 9.8 0.16 0.16 0.32
Units kg m m / s 2 N · m N · m N · m
Table 2. Comparison of the average costs of the gradient descent RL-based controller and R 3 F 2 -IRL controller.
Table 2. Comparison of the average costs of the gradient descent RL-based controller and R 3 F 2 -IRL controller.
Average CostGradient Descent RL R 3 F 2 -IRL
J ¯ 1 10.20 7.6775
J ¯ 2 0.0576 0.0582
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, Y.; Sun, Q.; Pan, Q. Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones 2025, 9, 477. https://doi.org/10.3390/drones9070477

AMA Style

Guo Y, Sun Q, Pan Q. Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones. 2025; 9(7):477. https://doi.org/10.3390/drones9070477

Chicago/Turabian Style

Guo, Yaning, Qi Sun, and Quan Pan. 2025. "Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games" Drones 9, no. 7: 477. https://doi.org/10.3390/drones9070477

APA Style

Guo, Y., Sun, Q., & Pan, Q. (2025). Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones, 9(7), 477. https://doi.org/10.3390/drones9070477

Article Metrics

Back to TopTop