Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games

Guo, Yaning; Sun, Qi; Pan, Quan

doi:10.3390/drones9070477

Open AccessArticle

Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games

by

Yaning Guo

¹,

Qi Sun

^2,*

and

Quan Pan

¹

School of Automation, Northwestern Polytechnical University, Xi’an 710072, China

²

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(7), 477; https://doi.org/10.3390/drones9070477 (registering DOI)

Submission received: 28 May 2025 / Revised: 2 July 2025 / Accepted: 3 July 2025 / Published: 5 July 2025

Download

Browse Figures

Versions Notes

Abstract

This paper investigates the robust tracking control of unmanned aerial vehicles (UAVs) against external time-varying disturbances. First, by introducing a virtual position controller, we innovatively decouple the UAV dynamics into independent position and attitude error subsystems, transforming the robust tracking problem into two zero-sum differential games. This approach contrasts with conventional methods by treating disturbances as strategic “players”, enabling a systematic framework to address both external disturbances and model uncertainties. Second, we develop an integral reinforcement learning (IRL) framework that approximates the optimal solution to the Hamilton–Jacobi–Isaacs (HJI) equations without relying on precise system models. This model-free strategy overcomes the limitation of traditional robust control methods that require known disturbance bounds or accurate dynamics, offering superior adaptability to complex environments. Third, the proposed recursive Ridge regression with a forgetting factor (

R^{3} F^{2}

) algorithm updates actor-critic-disturbance neural network (NN) weights in real time, ensuring both computational efficiency and convergence stability. Theoretical analyses rigorously prove the closed-loop system stability and algorithm convergence, which fills a gap in existing data-driven control studies lacking rigorous stability guarantees. Finally, numerical results validate that the method outperforms state-of-the-art model-based and model-free approaches in tracking accuracy and disturbance rejection, demonstrating its practical utility for engineering applications.

Keywords:

unmanned aerial vehicle (UAV); differential games; integral reinforcement learning (IRL); robust tracking control

1. Introduction

Unmanned aerial vehicles (UAVs) have drawn increasing research interest and have been applied to many complex tasks, e.g., surveillance, firefighting, environmental monitoring, etc., mainly due to several advantages such as simple mechanisms, flexible mobility, and the ability to vertically hover and take off [1,2,3,4,5,6]. However, it may pose challenges to designing a robust optimal tracking controller for UAVs in a dynamic and complex environment. In typical application scenarios, UAVs inherently exhibit a mismatch between control inputs and degrees of freedom (e.g., four control inputs regulating six degrees of freedom), leading to strong coupling between position and attitude control. Furthermore, external time-varying disturbances (e.g., airflow disturbances), model parameter perturbations, and unmodeled dynamics further exacerbate the control complexity. Various classic robust control methods and technologies have been applied to address the aforementioned issues, e.g.,

H_{\infty}

control [7], backstepping control [8], adaptive control [9], and model predictive control [10]. These kinds of underactuated systems pose dual requirements for control precision and robustness, which makes the existing control methods based on perfect model assumptions difficult to handle. Although the actor-critic algorithm solves the policy interaction problem, it does not involve the handling of physical system dynamics and uncertainties, and therefore cannot be directly applied to the robust tracking control of UAV systems.

To address the issues, a combination of zero-sum differential game and reinforcement learning (RL) has recently drawn increasing research interest [11,12,13]. Game theory provides an effective solution for such problems [11], which models the interaction between the system and uncertainty as a dynamic game process. The external disturbances, parameter perturbations, and other uncertainties are modeled as an “adversary”, while the control strategy is modeled as a “defender” countering them, constructing a dynamic adversarial relationship between optimal control and the worst-case disturbance. Robust control strategies [14,15] designed based on this theoretical framework can obtain optimal control strategies through game equilibrium solutions, considering extreme disturbance scenarios, and are naturally adapted to the dual requirements of underactuated systems for strong robustness and high-precision tracking. Reinforcement learning enables agents to select the optimal action or control strategy based on the observed system responses in order to maximize or minimize the cumulative reward. In recent years, RL and deep reinforcement learning (DRL) technologies have been widely used in UAV control problems. For instance, the authors of [12] proposed a deep neural network (DNN) called TrailNet, which was applied to the tracking control of micro air vehicles (MAVs) in an unstructured outdoor environment (e.g., mountainous areas, forests, jungles, etc.). A deep deterministic policy gradient (DDPG) algorithm for UAV trajectory tracking in a simulation environment was investigated in [16]. In [17], an RL method was applied to the fault-tolerant controller design problem for an unknown nonlinear UAV with actuator faults. In [18], an RL-based MPC approach was developed for UAV navigation control in unknown environments, where both the vehicle control and obstacle avoidance tasks were fulfilled using MPC, whereas the guidance of the MAV through complex environments was completed by RL. In [19], a model-based adaptive dynamic programming (ADP) was designed for solving the UAV robust tracking control problem under coupling uncertainties. Moreover, the DRL-based robust control problem for UAV helicopters was studied in [13].

Integral RL (IRL) [20] is an integral temporal difference-based RL methodology, which is provably effective in solving the optimal control and decision-making problems of nonlinear systems with partially known or completely unknown system dynamics. To relieve the demand for knowing the system model, the IRL scheme aims to design the optimal robust controller by solving the integral Bellman equations via the policy iteration (PI) technique. Among the existing literature, the proposed IRL-based robust tracking control methods typically use three NN-based architectures, as follows: critic neural networks (NNs), actor-critic NNs, and actor-critic-disturbance NNs. For instance, Wang et al. [21] proposed a neural identifier-based intelligent critic control for unknown nonlinear affine dynamics to achieve disturbance attenuation. In [22], nonlinear disturbance-observer-based critic NNs were used to solve the problem of cooperative control for nonlinear dynamical systems under mismatched disturbances. Motivated by [22], Mohammadi et al. [23] proposed a value-iteration-based algorithm for optimal tracking control of nonlinear dynamical systems subject to delays, disturbances, and constraints. In [24], Yang et al. developed a critic NN-based event-driven robust controller for continuous-time (CT) nonlinear input-affine systems. The requirement of input dynamics inevitably degrades the performance of the aforementioned critic NN-based IRL approaches to a certain extent. While the system dynamics model is completely unknown, IRL research still manages to exploit actor-critic and actor-critic-disturbance architectures. In [25], a robust control law was developed for unknown nonlinear systems by combining the actor-critic NN-based IRL algorithm with disturbance compensation schemes. In [26], the robust control problem was converted into a constrained optimal control problem, and then an actor-critic NN-based IRL framework was proposed to address the optimal control problem of constrained nonlinear dynamical systems. The model-free online IRL algorithms for nonlinear robust tracking control based on the actor-critic-disturbance structure were investigated in [27,28,29,30,31].

However, in practical algorithm implementation, traditional reinforcement learning algorithms used to solve zero-sum differential games generally adopt gradient descent and least-squares methods when optimizing the execution–evaluation–disturbance neural network weight parameters. These two methods have significant inherent defects, as follows: (1) the gradient descent highly depends on manually configured learning rates, and is prone to falling into local minima traps in high-dimensional nonlinear problems, leading to slow convergence and enormous computational resource consumption; (2) the least-squares method is based on a linearization assumption to approximate nonlinear systems, making it difficult to adapt to changes in system characteristics in complex dynamic environments, leading to significant model mismatch errors. More critically, both methods implicitly assume that the system’s statistical characteristics remain stable and rely on a large amount of sufficient data samples for parameter estimation. In practical underactuated systems, the time-varying nature of external disturbances, the nonlinearity of system dynamics, and unmodeled dynamic characteristics make these assumptions difficult to hold, leading to the inability to effectively guarantee the convergence and stability of the algorithm. Moreover, in high-dimensional state spaces and multi-objective optimization scenarios, traditional methods struggle even more to effectively cope with the combined challenges of interference and uncertainty, and there is an urgent need for innovative algorithms to break through existing technical bottlenecks.

Based on the above discussion, a data-driven reinforcement learning method based on a zero-sum differential game framework is proposed. By introducing a virtual position controller, the tracking error dynamics model of the underactuated quadrotor is decoupled into a position error subsystem and an attitude error subsystem. Based on this decoupling, the robust optimal tracking control problem is transformed into two parallel zero-sum differential games. To address the unknown system model problem, an IRL algorithm without prior dynamic knowledge is proposed. By replacing the traditional Bellman equation with the integral Bellman equation, the instantaneous reward is extended to the cumulative reward, enhancing robustness to dynamic uncertainties. The main contributions are summarized in three aspects, as follows:

(1): Compared with the method in [19], which requires a known prior disturbance bound, our approach considers the coupling of nonlinear uncertainty and time-varying external disturbance as adversarial players by formulating the UAV robust optimal tracking control problem as two zero-sum differential games.
(2): A data-driven IRL algorithm is presented to approximately achieve the Nash equilibrium strategies and the value functions without using the knowledge of the UAV dynamics. We develop an improved actor-critic-disturbance NN weights update law by using the recursive Ridge regression with the forgetting factor ( $R^{3} F^{2}$ ) estimator, which is more computationally efficient than the least-squares (LS) method adopted in [11,27,28,29,30,31]. The results also show that the weight parameter estimation converges exponentially to a small neighborhood of the true value.
(3): In order to generate the learning dataset for the off-policy IRL algorithm, we introduce a novel behavior policy that is different from the conventional one used in [19,23,27,28]. Instead of using an initially admissible exploratory control law, the selected behavior policy exploits a stabilizing linear state feedback controller for the nominal position and attitude tracking error subsystems, which is generally easier to design and implement than the conventional policy

The rest of this paper is organized as follows. Section 2 introduces preliminaries on the UAV dynamics and the problem statement. The zero-sum differential games, as well as the associated HJI equations, are developed in Section 3, where the Nash equilibria of the games are discussed. In Section 4, the

R^{3} F^{2}

-based off-policy IRL algorithm is provided along with the convergence and stability analysis. A numerical example is presented to verify the theoretical results in Section 5. Finally, Section 6 concludes this paper.

Throughout this paper,

R

,

R^{+}

,

R_{\geq 0}

, and

N_{\geq 0}

represent the sets of real numbers, positive real numbers, non-negative real numbers, and non-negative natural numbers, respectively. The superscript ⊤ indicates the transposition symbol. The notation ⊗ denotes the Kronecker product operator. For a vector

x \in R^{n}

,

∥ x ∥ : = \sqrt{x^{⊤} x}

denotes the Euclidean norm. For a matrix

A \in R^{m \times n}

,

∥ A ∥

represents the spectral norm of A.

vec (A) = {[a_{1}^{⊤}, a_{2}^{⊤}, \dots, a_{n}^{⊤}]}^{⊤} \in R^{m n \times 1}

denotes a vector-valued representation of the matrix A with

a_{i} \in R^{m}

being the ith column of A.

λ_{max} (A)

and

λ_{min} (A)

indicate the maximum and minimum eigenvalues of the matrix A, respectively. If

A \in R^{n \times n}

,

A > 0

or

A \geq 0

indicates that A is a positive definite and semi-positive definite matrix, respectively. Given

A > 0

,

{∥ x ∥}_{A} : = \sqrt{x^{⊤} A x}

represents the A-weighted norm of x. The notation I indicates an identity matrix of appropriate dimensions. The notation

O_{m \times n}

indicates an

m \times n

zero matrix. The notation

1_{m} \in R^{m}

denotes an m-dimensional column vector with all components being 1.

2. Preliminaries

This section presents the UAV dynamics and the problem description for UAV robust optimal tracking control with external disturbance.

2.1. The UAV Dynamics

In this paper, the schematic diagram of a UAV is displayed in Figure 1. Let

E = {e_{x}, e_{y}, e_{z}}

be the right-handed earth-fixed inertial frame, and

B = {b_{x}, b_{y}, b_{z}}

be the body-fixed frame attached to the UAV with the origin located at its center of gravity. The thrusts of four rotors are defined as

f_{i} = κ ω_{i}^{2}

,

i = 1, 2, 3, 4

, where

κ \in R

is a parameter and

ω_{i} \in R

represents the ith-rotor speed. Moreover,

m \in R^{+}

indicates the mass of the UAV, and g is the acceleration of gravity.

The UAV dynamic modeling is based on the following assumptions to ensure theoretical tractability and practical applicability: (1) Rigid-body approximation. The UAV is modeled as a rigid body, ignoring elastic deformations and flexible components, and is validated for most fixed-wing and quadrotor systems (e.g., [32,33,34]). (2) Flat-Earth hypothesis. The Earth’s curvature and rotation are neglected; suitable for low-altitude operations (altitude

< 1000

m) ([35]). (3) Lumped aerodynamic parameters. Air resistance is described by linearized drag coefficients, with high-order effects (e.g., stall, compressibility) omitted for simplification [36]. (4) Ideal actuator dynamics. Actuators are assumed to have instantaneous responses without time delays or saturation—a condition that can be relaxed via feedback linearization in practice [37].

Denote

ζ = {[x, y, z]}^{⊤} \in R^{3}

and

ϑ = {[ϕ, θ, ψ]}^{⊤} \in R^{3}

as the position and the attitude of the UAV in the

E

-frame, respectively. The dynamical model takes the following form, derived from the rigid-body dynamics theory [32,33]:

\begin{matrix} m \ddot{ζ} = - K_{ζ} \dot{ζ} + R_{B E} b_{z} F - m g b_{z} + d_{ζ} (translational dynamics) \\ J_{b} \ddot{ϑ} = - K_{ϑ} \dot{ϑ} + Λ τ + d_{ϑ} (rotational dynamics) \end{matrix}

(1)

where

F = \sum_{i = 1}^{4} f_{i} \in R

denotes the total thrust produced by four rotors,

τ = {[τ_{ϕ}, τ_{θ}, τ_{ψ}]}^{⊤} \in R^{3}

is the rotational torque,

b_{z} = {[0, 0, 1]}^{⊤} \in R^{3}

,

K_{ζ} = diag (k_{1}, k_{2}, k_{3}) \in R^{3 \times 3}

and

K_{ϑ} = diag (k_{4}, k_{5}, k_{6}) \in R^{3 \times 3}

, with

k_{i} \in R^{+}, i = 1, \dots, 6

, denoting the aerodynamic damping coefficient matrices, and

J_{b} = diag (I_{x}, I_{y}, I_{z}) \in R^{3 \times 3}

is the inertial matrix, with

I_{x}, I_{y}, I_{z} \in R^{+}

, and

d_{ζ} = {[d_{x}, d_{y}, d_{z}]}^{⊤} \in R^{3}

and

d_{ϑ} = {[d_{ϕ}, d_{θ}, d_{ψ}]}^{⊤} \in R^{3}

denote the external time-varying disturbance vectors. Moreover, we have

Λ = diag (l, l, c) \in R^{3 \times 3}

, where

l \in R^{+}

represents the distance from the rotor to the center of gravity, and c indicates the anti-torque coefficient. In addition, the orientation matrix

R_{B E} \in S O (3)

of the UAV from the

B

-frame to the

E

-frame is expressed via Euler angles as follows:

R_{B E} = [\begin{matrix} c ψ c θ & s ϕ s θ c ψ - c ϕ s ψ & c ϕ s θ c ψ + s ϕ s ψ \\ c θ s ψ & s ϕ s θ s ψ + c ϕ c ψ & c ϕ s θ s ψ - s ϕ c ψ \\ - s θ & s ϕ c θ & c ϕ c θ \end{matrix}]

(2)

where

s *

and

c *

indicate

sin (*)

and

cos (*)

, respectively. To avoid gimbal lock and ensure unique attitude representation, Euler angles are constrained within bounded ranges:

- \frac{π}{2} < ϕ < \frac{π}{2}, - \frac{π}{2} < θ < \frac{π}{2}, - π < ψ < π .

(3)

2.2. Problem Formulation

Let

ζ_{r} = {[x_{r}, y_{r}, z_{r}]}^{⊤} \in R^{3}

and

ϑ_{r} = {[ϕ_{r}, θ_{r}, ψ_{r}]}^{⊤} \in R^{3}

be the reference states of the position and attitude subsystems, respectively. The goal is to design F and

τ

such that the UAV tracks the reference states

ζ_{r}

and

ϑ_{r}

despite the presence of external time-varying disturbances

d_{ζ}

and

d_{ϑ}

.

To quantify the above objective, we first define the position and attitude tracking error signals of the UAV as follows:

ξ_{1} = [\begin{matrix} ζ - ζ_{r} \\ \dot{ζ} - {\dot{ζ}}_{r} \end{matrix}] \in R^{6}, ξ_{2} = [\begin{matrix} ϑ - ϑ_{r} \\ \dot{ϑ} - {\dot{ϑ}}_{r} \end{matrix}] \in R^{6} .

(4)

Substituting the derivative of (4) into (1), we can obtain the position and attitude tracking error dynamical subsystems of the UAV as follows:

\begin{matrix} {\dot{ξ}}_{1} & = f_{1} (ξ_{1}) + g_{1} F + h_{1} Δ_{ζ}, \\ {\dot{ξ}}_{2} & = f_{2} (ξ_{2}) + g_{2} u_{2} + h_{2} Δ_{ϑ} \end{matrix}

(5)

where

f_{1} = [O_{3}, I_{3}; O_{3}, - \frac{1}{m} K_{ζ}] \cdot ξ_{1} \in R^{6 \times 1}

,

f_{2} = [O_{3}, I_{3}; O_{3}, - J_{b}^{- 1} K_{ϑ}] \cdot ξ_{2} \in R^{6 \times 1}

,

g_{1} = [0_{3}; \frac{1}{m} R_{B E} b_{z}] \in R^{6 \times 1}

,

g_{2} = [O_{3}; J_{b}^{- 1}] \in R^{6 \times 3}

,

h_{1} = [O_{3}; I_{3}] \in R^{6 \times 3}

,

h_{2} = [O_{3}; J_{b}^{- 1}] \in R^{6 \times 3}

,

Δ_{ζ} = - \frac{1}{m} K_{ζ} {\dot{ζ}}_{r} - {\ddot{ζ}}_{r} - g b_{z} + \frac{1}{m} d_{ζ} \in R^{3}

,

Δ_{ϑ} = - K_{ϑ} {\dot{ϑ}}_{r} - J_{b} {\ddot{ϑ}}_{r} + d_{ϑ} \in R^{3}

, and

u_{2} = Λ τ ≜ {[u_{ϕ}, u_{θ}, u_{ψ}]}^{⊤} \in R^{3}

.

Inspired by [32], we introduce a virtual position control input

u_{1} = {[u_{x}, u_{y}, u_{z}]}^{⊤} \in R^{3}

described by

\begin{matrix} u_{1} = - g b_{z} + \frac{1}{m} F R_{r} b_{z} \end{matrix}

(6)

where

R_{r} \in R^{3 \times 3}

is the rotational command matrix given by the following:

R_{r} = [\begin{matrix} c θ_{r} c ψ_{r} & s ϕ_{r} s_{θ_{r}} c ψ_{r} - c ϕ_{r} s ψ_{r} & c ϕ_{r} s θ_{r} c ψ_{r} + s ϕ_{r} s ψ_{r} \\ c θ_{r} s ψ_{r} & s ϕ_{r} s θ_{r} s ψ_{r} + c ϕ_{r} c ψ_{r} & c ϕ_{r} s θ_{r} s ψ_{r} - s ϕ_{r} c ψ_{r} \\ - s ϕ_{r} & s ϕ_{r} c θ_{r} & c ϕ_{r} c θ_{r} \end{matrix}]

with

ϕ_{r}

,

θ_{r}

, and

ψ_{r}

being the reference states of the attitude. By solving (6), we get

\{\begin{matrix} F = m ∥ u_{1} + g b_{z} ∥ \\ θ_{r} = arctan (\frac{u_{x} c ψ_{r} + u_{y} s ψ_{r}}{u_{z} + g}) \\ ϕ_{r} = arcsin (\frac{u_{x} s ψ_{r} - u_{y} c ψ_{r}}{∥ u_{1} + g b_{z} ∥}) \end{matrix}

(7)

where

ψ_{r}

is the reference yaw angle.

By adding and subtracting

u_{1}

to

{\dot{ξ}}_{1}

, we can rewrite (5) in the following uniform form:

{\dot{ξ}}_{i} = F_{i} ξ_{i} + G_{i} (u_{i} + v_{i}), i = 1, 2

(8)

where

u_{1}

and

u_{2}

are the controllers to be designed,

F_{1} = [O_{3}, I_{3}; O_{3}, - \frac{1}{m} K_{ζ}] \in R^{6 \times 6}

,

G_{1} = [O_{3}; I_{3}] \in R^{6 \times 3}

,

F_{2} = [O_{3}, I_{3}; O_{3}, - J_{b}^{- 1} K_{ϑ}] \in R^{6 \times 6}

, and

G_{2} = [O_{3}; J_{b}^{- 1}] \in R^{6 \times 3}

. Furthermore,

v_{1} = - \frac{1}{m} K_{ζ} {\dot{ζ}}_{r} - {\ddot{ζ}}_{r} + \frac{1}{m} d_{ζ} + f_{ϑ} (F) \in R^{3}

and

v_{2} = - K_{ϑ} {\dot{ϑ}}_{r} - J_{b} {\ddot{ϑ}}_{r} + d_{ϑ} \in R^{3}

refer to the lumped disturbances acting on the position and attitude tracking error subsystems. The vector

f_{ϑ} (F) = \frac{1}{m} F (R_{B E} - R_{r}) b_{z} \in R^{3}

in

v_{1}

is regarded as the coupling uncertainty that demonstrates the effect on the translational motion caused by rotational motion. Consequently, the control objective is cast to design

u_{1}

and

u_{2}

so that

ζ (t)

and

ψ (t)

can track the reference signals

ζ_{r} = [x_{r}, y_{r}, z_{r}]

and

ψ_{r}

under the time-varying disturbances

v_{1}

and

v_{2}

.

Then, the robust tracking control problem of a UAV subject to external time-varying disturbances can be stated as follows:

Problem 1.

The purpose of the UAV robust optimal tracking control is to find an optimal state feedback control law

u_{i}

such that (1) the tracking error subsystems described in (8) are closed-loop stable when

v_{i} = 0

; (2) the bounded

L_{2}

-gain condition holds when

v_{i} \neq 0

, that is, given

η_{i} > 0

,

\int_{0}^{T} ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ u_{i} {(t) ∥}_{{\hat{R}}_{i}}^{2} d t \leq η_{i}^{2} \int_{0}^{T} {∥ v_{i} (t) ∥}^{2} d t + χ (ξ_{i} (0))

(9)

is satisfied for all

T > 0

and

v_{i} (t) \in L_{2} [0, \infty)

, where

{\hat{Q}}_{i} \in R^{6 \times 6}

,

{\hat{R}}_{i} \in R^{3 \times 3}

,

{\hat{Q}}_{i} = {\hat{Q}}_{i}^{⊤} > 0

,

{\hat{R}}_{i} = {\hat{R}}_{i}^{⊤} > 0

, and

χ (ξ_{i} (t)) \in C^{2}

with

χ (0) = 0

.

3. Game-Based Robust Optimal Control Law Design and Stability Analysis

In this section, we formulate the robust control problem for the subsystems in (8) as two independent zero-sum differential games. Then, the Nash equilibrium is studied.

3.1. Zero-Sum Differential Games

In order to ensure robust performance under the lumped disturbances, according to inequality (9), we can define the performance index function as

J_{i} (ξ_{i} (0), u_{i}, v_{i}) = \int_{0}^{\infty} q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t)) d t, i = 1, 2

(10)

where

q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t))

refers to an instantaneous reward function defined as

q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t)) = ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ u_{i} {(t) ∥}_{{\hat{R}}_{i}}^{2} - η_{i}^{2} {∥ v_{i} (t) ∥}^{2}

.

In tracking error dynamics (8), the controller and the lumped disturbance can be regarded as the control player and disturbance player, respectively. Then, the robust tracking control problem can be reformulated as the following game problems with respect to (8):

Problem 2.

(Zero-sum differential games). The zero-sum differential games for the UAV robust tracking control can be represented as

G_{i} : min_{u_{i}} max_{v_{i}} J_{i} (ξ_{i} (0), u_{i}, v_{i}) s . t . (8)

(11)

where

G_{1}

and

G_{2}

are the games for the position tracking control and attitude tracking control, respectively.

In game

G_{i}

,

i = 1, 2

, the control player aims to seek the optimal control input

u_{i}^{*}

that minimizes

J_{i} (ξ_{i} (0), u_{i}, v_{i})

, while the disturbance player desires to find the worst-case disturbance

v_{i}^{*}

that maximizes it. The corresponding Nash equilibrium strategy is defined as follows:

Definition 1.

(Nash equilibrium strategy): The policy pair

(u_{i}^{*}, v_{i}^{*})

is said to be a Nash equilibrium strategy for the zero-sum differential game

G_{i}

,

i = 1, 2

, if the inequality

J_{i} (ξ_{i} (0), u_{i}^{*}, v_{i}) \leq J_{i} (ξ_{i} (0), u_{i}^{*}, v_{i}^{*}) \leq J_{i} (ξ_{i} (0), u_{i}, v_{i}^{*})

(12)

holds for all

u_{i}

and

v_{i}

. The value

J_{i} (ξ_{i} (0), u_{i}^{*}, v_{i}^{*})

is the so-called Nash equilibrium.

In the following subsection, we will prove that the Nash equilibrium strategies are solutions to the zero-sum differential game problems

G_{i}

,

i = 1, 2

.

3.2. Stability and Nash Equilibria for Zero-Sum Differential Games

According to the performance index defined in (10), we define the value functions for game

G_{i}

,

i = 1, 2

, by

V_{i} (ξ_{i} (t)) = \int_{t}^{\infty} q_{i} (ξ_{i} (τ), u_{i} (τ), v_{i} (τ)) d τ .

(13)

For convenience in the derivation,

i = 1, 2

is omitted in the following discussion. Let the associated Hamiltonian function of (13) be defined as

H_{i} (ξ_{i}, u_{i}, v_{i}, V_{i}, t) = \nabla V_{i}^{⊤} \dot{ξ_{i}} (t) + q_{i} (ξ_{i} (τ), u_{i} (τ), v_{i} (τ))

(14)

where the derivative of (10) with respect to

ξ_{i}

is given by

\nabla V_{i} = \frac{\partial V_{i}}{\partial ξ_{i}}

.

Let

V_{i}^{*} (ξ_{i} (t))

be the optimal value of

V_{i} (ξ_{i} (t))

, i.e.,

V_{i}^{*} (ξ_{i} (t)) = min_{u_{i}} max_{v_{i}} \int_{t}^{\infty} q_{i} (ξ_{i} (τ), u_{i} (τ), v_{i} (τ)) d τ .

It can be easily verified that

V_{i}^{*} (ξ_{i} (t))

satisfies the following HJI equation:

\begin{matrix} min_{u_{i}} max_{v_{i}} H_{i} (ξ_{i}, u_{i}, v_{i}, V_{i}, t) & = \nabla V_{i}^{* ⊤} [F_{i} ξ_{i} (t) + G_{i} (u_{i}^{*} (t) + v_{i}^{*} (t))] \\ + ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ u_{i}^{*} {(t) ∥}_{{\hat{R}}_{i}}^{2} - η_{i}^{2} {∥ v_{i}^{*} (t) ∥}^{2} = 0 \end{matrix}

(15)

where

\nabla V_{i}^{*} = \frac{\partial V_{i}^{*}}{\partial ξ_{i}}

. By employing the stationary conditions

\frac{\partial H_{i}}{\partial u_{i}^{*}} = 0

and

\frac{\partial H_{i}}{\partial v_{i}^{*}} = 0

, we have

u_{i}^{*} = - \frac{1}{2} {\hat{R}}_{i}^{- 1} G_{i}^{⊤} \nabla V_{i}^{*}, v_{i}^{*} = \frac{1}{2 η_{i}^{2}} G_{i}^{⊤} \nabla V_{i}^{*} .

(16)

Substituting (16) into (15), one gets the HJI equation:

\begin{matrix} 0 & = H_{i} (ξ_{i}, u_{i}^{*}, v_{i}^{*}, V_{i}^{*}, t) \\ = ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} - \frac{1}{4} \nabla V_{i}^{* ⊤} G_{i} {\hat{R}}_{i}^{- 1} G_{i}^{⊤} \nabla V_{i}^{*} + \nabla V_{i}^{* ⊤} F_{i} ξ_{i} (t) + \frac{1}{4 η_{i}^{2}} \nabla V_{i}^{* ⊤} G_{i} G_{i}^{⊤} \nabla V_{i}^{*} \end{matrix}

(17)

with

V_{i} (0) = 0

.

The following theorem illustrates that (16), obtained by solving the above HJI equation in (17), provides a solution for both Problems 1 and 2:

Theorem 1.

Suppose that

0 < V_{i}^{*} \in C^{2}

satisfies the HJI equation (17). Then, under the policy pair

(u_{i}^{*}, v_{i}^{*})

given by (16) in terms of

V_{i}^{*}

, the subsystems in (8) are closed-loop stable and the bounded

L_{2}

-gain condition (9) holds. Moreover,

(u_{i}^{*}, v_{i}^{*})

is a Nash equilibrium strategy of

G_{i}

,

i = 1, 2

.

Proof.

Suppose that

V_{i}^{*}

satisfies HJI equation (17). Then,

\begin{matrix} H_{i} (ξ_{i}, u_{i}, v_{i}, V_{i}^{*}, t) - H_{i} (ξ_{i}, u_{i}^{*}, v_{i}^{*}, V_{i}^{*}, t) \\ = \nabla V_{i}^{* ⊤} G_{i} (u_{i} (t) - u_{i}^{*} (t)) + ∥ u_{i} {(t) ∥}_{{\hat{R}}_{i}}^{2} - {∥ u_{i}^{*} (t) ∥}_{{\hat{R}}_{i}}^{2} \\ + \nabla V_{i}^{* ⊤} G_{i} (v_{i} (t) - v_{i}^{*} (t)) - η_{i}^{2} (∥ v_{i} {(t) ∥}^{2} - ∥ v_{i}^{*} (t) ∥^{2}) . \end{matrix}

(18)

According to the Hamiltonian function (14), the following identity holds:

\begin{matrix} {\dot{V}}_{i}^{*} (ξ_{i} (t)) = & H_{i} (ξ_{i}, u_{i}^{*}, v_{i}^{*}, V_{i}^{*}, t) + {∥ u_{i} (t) - u_{i}^{*} (t) ∥}_{{\hat{R}}_{i}}^{2} \\ - q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t)) - η_{i}^{2} {∥ v_{i} (t) - v_{i}^{*} (t) ∥}^{2} . \end{matrix}

(19)

By setting

u_{i} (t) = u_{i}^{*} (t)

and

v_{i} (t) = v_{i}^{*} (t)

(the optimal control and disturbance strategies) and substituting them into (19), we have the following:

{\dot{V}}_{i}^{*} (ξ_{i} (t)) = - q_{i} (ξ_{i} (t), u_{i}^{*} (t), v_{i}^{*} (t)) \leq 0,

(20)

where the equality

{\dot{V}}_{i}^{*} (ξ_{i} (t)) = 0

holds if and only if

ξ_{i} (t) = 0

. Thus, the subsystems described in (8) are asymptotically stable.

According to the Lyapunov stability theory, since

{\dot{V}}_{i}^{*} (ξ_{i} (t))

is negative semidefinite and vanishes only at the origin, the subsystems described by (8) are proven to be asymptotically stable. The Lyapunov function

V_{i}^{*} (ξ_{i})

satisfies the standard stability criteria, ensuring that trajectories converge to the equilibrium point

ξ_{i} (t) = 0

as

t \to \infty

.

Integrating (20) from 0 to T yields

\begin{matrix} V_{i}^{*} (ξ_{i} (T)) - V_{i}^{*} (ξ_{i} (0)) = - \int_{0}^{T} q_{i} (ξ_{i} (t), u_{i}^{*} (t), v_{i}^{*} (t)) d t \leq 0 \end{matrix}

(21)

Since

V_{i}^{*} > 0

, one has

\begin{matrix} \int_{0}^{T} ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ u_{i}^{*} {(t) ∥}_{{\hat{R}}_{i}}^{2} d t \leq V_{i}^{*} (ξ_{i} (0)) + η_{i}^{2} \int_{0}^{T} {∥ v_{i}^{*} (t) ∥}^{2} d t \end{matrix}

(22)

which shows that the bounded

L_{2}

-gain condition holds. Furthermore, due to the fact that the tracking subsystem is asymptotically stable, it can be concluded that

V_{i}^{*} (ξ_{i} (\infty)) = 0

. Thus, by (14), we have

\begin{matrix} J_{i} (ξ_{i} (0), u_{i}, v_{i}) = & V_{i}^{*} (ξ_{i} (0)) + \int_{0}^{\infty} H_{i} (ξ_{i}, u_{i}^{*}, v_{i}^{*}, V_{i}^{*}, t) d t \\ + \int_{0}^{\infty} [∥ u_{i} (t) - u_{i}^{*} {(t) ∥}_{{\hat{R}}_{i}}^{2} - η_{i}^{2} {∥ v_{i} (t) - v_{i}^{*} (t) ∥}^{2}] d t . \end{matrix}

(23)

It follows that the condition (12) is satisfied. By setting

u_{i} = u_{i}^{*}

and

v_{i} = v_{i}^{*}

, one gets

J_{i} (ξ_{i} (0), u_{i}^{*}, v_{i}^{*}) = V_{i}^{*} (ξ_{i} (0))

. Thus,

u_{i}^{*}

and

v_{i}^{*}

form a Nash equilibrium strategy and

V_{i}^{*} (ξ_{i} (0))

is the corresponding Nash equilibrium of the game

G_{i}

. □

However, as a nonlinear partial differential equation (PDE), the HJI equation given in (17) is difficult to solve directly and analytically [38]. To this end, a data-driven IRL approach is designed to obtain the solution to the HJI equation and the Nash equilibrium strategies.

4. $R^{3} F^{2}$ -Based IRL Approach

In this section, an

R^{3} F^{2}

-based IRL algorithm is developed to solve the zero-sum differential games. First, the actor-critic-disturbance architecture is utilized to approximate the value function and the Nash equilibrium strategy. Based on this, a model-free integral Bellman equation is derived. Then, an

R^{3} F^{2}

-based IRL algorithm is provided to evaluate the value function and learn the Nash equilibrium strategy. Finally, the convergence and stability analysis are presented.

4.1. A Framework of $R^{3} F^{2}$ -Based IRL for Zero-Sum Differential Games

Denote

V_{i}^{j} (ξ_{i} (t))

as the updated value function in the jth iteration, and

{\hat{u}}_{i}^{j}

and

{\hat{v}}_{i}^{j}

as the updated control policy and disturbance in the jth iteration, respectively. For all

t > 0

and

Δ t > 0

, and using the Fundamental Theorem of Calculus and the definition of the performance index (13), the increment of the value function

V_{i}^{j} (ξ_{i})

over the interval

[t, t + Δ t]

is as follows:

\begin{matrix} V_{i}^{j} (ξ_{i} (t + Δ t)) - V_{i}^{j} (ξ_{i} (t)) & = \int_{t}^{t + Δ t} \frac{d}{d τ} V_{i}^{j} (ξ_{i} (τ)) d τ \\ = \int_{t}^{t + Δ t} \nabla V_{i}^{j ⊤} (ξ_{i} (τ)) \cdot {\dot{ξ}}_{i} (τ) d τ, \end{matrix}

(24)

where the second equality follows from the chain rule. Rearranging this expression yields the following:

V_{i}^{j} (ξ_{i} (τ)) |_{t}^{t + Δ t} = - \int_{t}^{t + Δ t} \nabla V_{i}^{j ⊤} {\dot{ξ}}_{i} (τ) d τ .

(25)

Define

{\tilde{u}}_{i}^{j} = u_{i} - {\hat{u}}_{i}^{j}

and

{\tilde{v}}_{i}^{j} = v_{i} - {\hat{v}}_{i}^{j}

as the estimation errors of control and disturbance, respectively. Substituting

u_{i} = {\hat{u}}_{i}^{j} + {\tilde{u}}_{i}^{j}

and

v_{i} = {\hat{v}}_{i}^{j} + {\tilde{v}}_{i}^{j}

into the system dynamics (8) gives the following:

\begin{matrix} {\dot{ξ}}_{i} & = F_{i} ξ_{i} + G_{i} (u_{i} + v_{i}) \\ = F_{i} ξ_{i} + G_{i} ({\hat{u}}_{i}^{j} + {\hat{v}}_{i}^{j} + {\tilde{u}}_{i}^{j} + {\tilde{v}}_{i}^{j}), \end{matrix}

(26)

where

{\hat{u}}_{i}^{j}

and

{\hat{v}}_{i}^{j}

denote the control and disturbance policies at the j-th iteration, and

{\tilde{u}}_{i}^{j} a n d {\tilde{v}}_{i}^{j}

represent the residual errors. This decomposition separates nominal control/disturbance (

{\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j}

) from their errors, facilitating stability analysis.

From the Hamiltonian function (14) and the optimality conditions

\frac{\partial H_{i}}{\partial u_{i}} = 0

and

\frac{\partial H_{i}}{\partial v_{i}} = 0

, we can obtain

{\hat{u}}_{i}^{j + 1} = - \frac{1}{2} {\hat{R}}_{i}^{- 1} G_{i}^{⊤} \nabla V_{i}^{j}, {\hat{v}}_{i}^{j + 1} = \frac{1}{2 η_{i}^{2}} G_{i}^{⊤} \nabla V_{i}^{j},

(27)

where

{\hat{R}}_{i} ≻ 0

and

η_{i} > 0

are design parameters. These satisfy the saddle-point condition

\frac{\partial^{2} H_{i}}{\partial u_{i}^{2}} ≻ 0

and

\frac{\partial^{2} H_{i}}{\partial v_{i}^{2}} ≺ 0

, ensuring local optimality.

Substituting (26) and (27) into (24) yields the following model-free integral Bellman equation:

\begin{matrix} V_{i}^{j} (ξ_{i} (t)) - V_{i}^{j} (ξ_{i} (t + Δ t)) & = - \int_{t}^{t + Δ t} \nabla V_{i}^{j ⊤} [F_{i} ξ_{i} + G_{i} ({\hat{u}}_{i}^{j} + {\hat{v}}_{i}^{j} + {\tilde{u}}_{i}^{j} + {\tilde{v}}_{i}^{j})] d τ \\ = - \int_{t}^{t + Δ t} \nabla V_{i}^{j ⊤} [F_{i} ξ_{i} + G_{i} ({\hat{u}}_{i}^{j} + {\hat{v}}_{i}^{j})] d τ \\ - \int_{t}^{t + Δ t} [\nabla V_{i}^{j ⊤} G_{i} (u_{i} - {\hat{u}}_{i}^{j}) + \nabla V_{i}^{j ⊤} G_{i} (v_{i} - {\hat{v}}_{i}^{j})] d τ \\ = \int_{t}^{t + Δ t} (ξ_{i}^{⊤} {\hat{Q}}_{i} ξ_{i} + {\hat{u}}_{i}^{j ⊤} {\hat{R}}_{i} {\hat{u}}_{i}^{j} - η_{i}^{2} {\hat{v}}_{i}^{j ⊤} {\hat{v}}_{i}^{j}) d τ \\ + \int_{t}^{t + Δ t} [2 {\hat{u}}_{i}^{j + 1 ⊤} {\hat{R}}_{i} (u_{i} - {\hat{u}}_{i}^{j}) - 2 η_{i}^{2} {\hat{v}}_{i}^{j + 1 ⊤} (v_{i} - {\hat{v}}_{i}^{j})] d τ \\ = \int_{t}^{t + Δ t} ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ {\hat{u}}_{i} {(t) ∥}_{{\hat{R}}_{i}}^{2} - η_{i}^{2} {∥ {\hat{v}}_{i} (t) ∥}^{2} d τ \\ + 2 \int_{t}^{t + Δ t} {\hat{u}}_{i}^{j + 1 ⊤} (τ) {\hat{R}}_{i} {\tilde{u}}_{i}^{j} (τ) - η_{i}^{2} {\hat{v}}_{i}^{j + 1 ⊤} (τ) {\tilde{v}}_{i}^{j} (τ) d τ . \end{matrix}

(28)

This equation holds for any

Δ t > 0

and is model-free because it does not explicitly depend on

F_{i}

or

G_{i}

, relying only on measurable signals

ξ_{i}

,

{\hat{u}}_{i}

, and

{\hat{v}}_{i}

.

Starting from an initially stabilizing control policy, the proposed algorithm first computes

V_{i}^{j} (ξ_{i} (t))

by solving the partial differential Equation (14), followed by computing

u_{i}^{j + 1}

and

v_{i}^{j + 1}

by using (27). Repeat the above-mentioned processes until the difference between

V_{i}^{j} (ξ_{i} (t))

and

V_{i}^{j - 1} (ξ_{i} (t))

is sufficiently small.

Based on the Stone–Weierstrass approximation theorem [39], the jth iterative approximate expressions of

V_{i} (ξ_{i})

,

u_{i}

, and

v_{i}

can be represented by

℘_{i}^{j} = W_{i}^{j ⊤} Φ_{i} (ξ_{i}) + ϵ_{i}^{j}

(29)

where

℘_{i}^{j} = {[V_{i}^{j} (ξ_{i}), u_{i}^{j + 1 ⊤}, v_{i}^{j + 1 ⊤}]}^{⊤} \in R^{7}

,

Φ_{i} (ξ_{i}) \in R^{(n_{i} + 3 m_{i} + 3 l_{i}) \times 7}

is a block matrix given by

Φ_{i} (ξ_{i}) ≜ [\begin{matrix} φ_{i c} (ξ_{i}) & O_{n_{i} \times 3} & O_{n_{i} \times 3} \\ O_{3 m_{i} \times 1} & I_{3} \otimes φ_{i a} (ξ_{i}) & O_{3 m_{i} \times 3} \\ O_{3 l_{i} \times 1} & O_{3 l_{i} \times 3} & I_{3} \otimes φ_{i d} (ξ_{i}) \end{matrix}]

and

ϵ_{i}^{j} \in R^{7}

is the approximation error vector. In

Φ_{i} (ξ_{i})

, the components

φ_{i c} (ξ_{i}) \in R^{n_{i}}

,

φ_{i a} (ξ_{i}) \in R^{m_{i}}

, and

φ_{i d} (ξ_{i}) \in R^{l_{i}}

are all activation functions,

W_{i}^{j} \in R^{n_{i} + 3 m_{i} + 3 l_{i}}

represents the weight coefficient vector in the form of

W_{i}^{j} = {[W_{i c}^{j ⊤}, vec {(W_{i a}^{j + 1})}^{⊤}, vec {(W_{i d}^{j + 1})}^{⊤}]}^{⊤}

, with

W_{i c}^{j} \in R^{n_{i}}

being the weight vector of the critic NNs, and

W_{i a}^{j + 1} \in R^{m_{i} \times 3}

and

W_{i d}^{j + 1} \in R^{l_{i} \times 3}

being weight matrices of the actor NNs and lumped disturbance NNs, respectively. Note that

W_{i}^{j}

is unknown and, thus, needs to be estimated. Denote

{\hat{W}}_{i c}^{j}

,

{\hat{W}}_{i a}^{j + 1}

, and

{\hat{W}}_{i d}^{j + 1}

as the estimated values of

W_{i c}^{j}

,

W_{i a}^{j + 1}

, and

W_{i d}^{j + 1}

, respectively. The estimate of

℘_{i}^{j}

is given by

{\hat{℘}}_{i}^{j} = Φ_{i}^{⊤} (ξ_{i}) {\hat{W}}_{i}^{j}

(30)

where

{\hat{W}}_{i}^{j} = {[{\hat{W}}_{i c}^{j ⊤}, vec {({\hat{W}}_{i a}^{j + 1})}^{⊤}, vec {({\hat{W}}_{i d}^{j + 1})}^{⊤}]}^{⊤}

.

Assumption 1.

(1): The activation functions of actor-critic-disturbance NNs and their derivatives are bounded, i.e., $∥ φ_{i a} ∥ \leq b_{φ_{i a}}$ , $∥ φ_{i c} ∥ \leq b_{φ_{i c}}$ , $∥ φ_{i d} ∥ \leq b_{φ_{i d}}$ , $∥ \nabla φ_{i a} ∥ \leq b_{d φ_{i a}}$ , $∥ \nabla φ_{i c} ∥ \leq b_{d φ_{i c}}$ , and $∥ \nabla φ_{i d} ∥ \leq b_{d φ_{i d}}$ , where $b_{φ_{i a}}$ , $b_{φ_{i c}}$ , $b_{φ_{i d}}$ , $b_{d φ_{i a}}$ , $b_{d φ_{i c}}$ , and $b_{d φ_{i d}}$ are positive constants.
(2): The ideal actor-critic-disturbance NN weights satisfy $∥ W_{i c}^{j} ∥ \leq W_{i c max}^{j}$ , $∥ W_{i a}^{j + 1} ∥ \leq W_{i a max}^{j + 1}$ , and $∥ W_{i d}^{j + 1} ∥ \leq W_{i d max}^{j + 1}$ , where $W_{i c max}^{j}$ , $W_{i a max}^{j + 1}$ and $W_{i d max}^{j + 1}$ are positive constants.
(3): $∥ ϵ_{i}^{j} ∥ \leq b_{ϵ_{i}^{j}}$ and $∥ \nabla ϵ_{i}^{j} ∥ \leq b_{d ϵ_{i}^{j}}$ , where $b_{ϵ_{i}^{j}}$ and $b_{d ϵ_{i}^{j}}$ are positive constants.

Let

Δ φ_{i c} ≜ Δ φ_{i c} (ξ_{i} (t)) = φ_{i c} (ξ_{i} (t + Δ t)) - φ_{i c} (ξ_{i} (t))

. According to (28) and (30), one has

\begin{matrix} 0 = & V_{i}^{j} (ξ_{i} (t + Δ t)) - V_{i}^{j} (ξ_{i} (t)) + \int_{t}^{t + Δ t} ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ {\hat{u}}_{i} {(t) ∥}_{{\hat{R}}_{i}}^{2} - η_{i}^{2} {∥ {\hat{v}}_{i} (t) ∥}^{2} d τ \\ + 2 \int_{t}^{t + Δ t} {\hat{u}}_{i}^{j + 1 ⊤} (τ) {\hat{R}}_{i} {\tilde{u}}_{i}^{j} (τ) - η_{i}^{2} {\hat{v}}_{i}^{j + 1 ⊤} (τ) {\tilde{v}}_{i}^{j} (τ) d τ \\ = & Δ φ_{i c}^{⊤} (ξ_{i} (t)) {\hat{W}}_{i c}^{j} + \int_{t}^{t + Δ t} q_{i} (ξ_{i} (τ), {\hat{u}}_{i} (τ), {\hat{v}}_{i} (τ)) d τ \\ + 2 \int_{t}^{t + Δ t} {\hat{u}}_{i}^{j + 1 ⊤} (τ) {\hat{R}}_{i} {\tilde{u}}_{i}^{j} (τ) - η_{i}^{2} {\hat{v}}_{i}^{j + 1 ⊤} (τ) {\tilde{v}}_{i}^{j} (τ) d τ . \end{matrix}

(31)

By using (27) we can obtain the following:

\begin{matrix} {\hat{u}}_{i}^{j + 1 ⊤} (τ) {\hat{R}}_{i} {\tilde{u}}_{i}^{j} (τ) = [({\tilde{u}}_{i}^{j ⊤} (τ) {\hat{R}}_{i}) \otimes φ_{i a}^{⊤} (ξ_{i} (τ))] vec ({\hat{W}}_{i a}^{j + 1}), \\ {\hat{v}}_{i}^{j + 1 ⊤} (τ) {\tilde{v}}_{i}^{j} (τ) = [{\tilde{v}}_{i}^{j ⊤} (τ) \otimes φ_{i d}^{⊤} (ξ_{i} (τ))] vec ({\hat{W}}_{i d}^{j + 1}) . \end{matrix}

(32)

Substituting (32) into (31) yields

\begin{matrix} C_{i ξ}^{j} (t) = Ξ_{i}^{j ⊤} (t) {\hat{W}}_{i}^{j} \end{matrix}

(33)

where

{\hat{W}}_{i}^{j} = {[{\hat{W}}_{i c}^{j ⊤}, vec {({\hat{W}}_{i a}^{j + 1})}^{⊤}, vec {({\hat{W}}_{i d}^{j + 1})}^{⊤}]}^{⊤} \in R^{n_{i} + 3 m_{i} + 3 l_{i}}

,

Ξ_{i}^{j} (t) \in R^{n_{i} + 3 m_{i} + 3 l_{i}}

and

C_{i ξ}^{j} (t) \in R

are respectively expressed by

\begin{matrix} Ξ_{i}^{j} (t) = {[C_{i c}^{j} (t), C_{i a}^{j} (t), C_{i d}^{j} (t)]}^{⊤}, \\ C_{i ξ}^{j} (t) = - \int_{t}^{t + Δ t} q_{i} (ξ_{i} (τ), {\hat{u}}_{i} (τ), {\hat{v}}_{i} (τ)) d τ . \end{matrix}

(34)

In (34),

C_{i c}^{j} (t) = Δ φ_{i c}^{⊤} (ξ_{i} (t)) \in R^{1 \times n_{i}}

,

C_{i a}^{j} (t) \in R^{1 \times 3 m_{i}}

, and

C_{i d}^{j} (t) \in R^{1 \times 3 l_{i}}

, are respectively defined as

\begin{matrix} C_{i a}^{j} (t) = 2 \int_{t}^{t + Δ t} ({\tilde{u}}_{i}^{j ⊤} (τ) {\hat{R}}_{i}) \otimes φ_{i a}^{⊤} (ξ_{i} (τ)) d τ, \\ C_{i d}^{j} (t) = - 2 η_{i}^{2} \int_{t}^{t + Δ t} {\tilde{v}}_{i}^{j ⊤} (τ) \otimes φ_{i d}^{⊤} (ξ_{i} (τ)) d τ . \end{matrix}

Note that (33) is a linear equation, where

C_{i ξ}^{j} (t)

denotes the measurements,

Ξ_{i}^{j} (t)

is the regression matrix, respectively, at time t, and

{\hat{W}}_{i}^{j}

is the uncertain vector to be estimated. The value of

C_{i ξ}^{j} (t)

and the data in

Ξ_{i}^{j} (t)

are typically obtained from a continuous-time process and are available at the sample times

k Δ t

, where

Δ t

is the sample interval. To compute the estimate of

{\hat{W}}_{i}^{j}

, we first construct two datasets,

C_{i}^{j} = {C_{i ξ | 0}^{j}, C_{i ξ | 1}^{j}, \dots, C_{i ξ | k}^{j}}

and

E_{i}^{j} = {Ξ_{i | 0}^{j}, Ξ_{i | 1}^{j}, \dots, Ξ_{i | k}^{j}}

, by utilizing the data produced by the behavior policy over the time window

ℓ = 0, 1, \dots, k

.

Then, we can use the

R^{3} F^{2}

scheme to find the optimal estimate

{\hat{W}}_{i | k}^{j}

of

{\hat{W}}_{i}^{j}

by solving the following optimization problem:

Problem 3.

\begin{matrix} min_{{\hat{W}}_{i}^{j}} Y_{i k} ({\hat{W}}_{i}^{j}) = \frac{1}{2} ρ_{i}^{k} {∥ {\hat{W}}_{i}^{j} - {\hat{W}}_{i | 0}^{j} ∥}_{P_{i 0}^{- j}}^{2} + \frac{1}{2} \sum_{ℓ = 0}^{k} ρ_{i}^{k - ℓ} ς_{i | ℓ}^{j 2} \\ s . t . ς_{i | ℓ}^{j} = C_{i ξ | ℓ}^{j} - Ξ_{i | ℓ}^{j ⊤} {\hat{W}}_{i}^{j}, ℓ = 0, 1, \dots, k \end{matrix}

where

ρ_{i} \in (0, 1]

represents the forgetting factor,

{\hat{W}}_{i | 0}^{j} \in R^{n_{i} + 3 m_{i} + 3 l_{i}}

indicates the initial estimate of

{\hat{W}}_{i}^{j}

,

P_{i 0}^{j} \in R^{(n_{i} + 3 m_{i} + 3 l_{i}) \times (n_{i} + 3 m_{i} + 3 l_{i})}

is positive definite, and

P_{i 0}^{- j}

denotes the inverse matrix,

{(P_{i 0}^{j})}^{- 1}

, of

P_{i 0}^{j}

for brevity.

Remark 1.

In most of the existing works (e.g., [27,28,40]), the LS scheme is applied to approximate the solution to (33), in which random or sinusoidal function noises are usually added to the behavior policy to ensure that the matrix

[Ξ_{i | 0}^{j}, Ξ_{i | 1}^{j}, \dots, Ξ_{i | k}^{j}]

has full column rank. However, the selection of rich input signals is usually experience-based. In Problem 3, the exponentially weighted mechanism is used to apply greater weights to the recent data, and the weighted vector norm regularization term

\frac{1}{2} ρ_{i}^{k} {∥ {\hat{W}}_{i}^{j} - {\hat{W}}_{i | 0}^{j} ∥}_{P_{i 0}^{- j}}^{2}

is designed to address the numerical instability of the matrix inversion and subsequently produce lower variance models.

We can obtain the optimal solution to Problem 3 as

{\hat{W}}_{i | k}^{j} = P_{i | k}^{j} S_{i | k}^{j}

where

P_{i | k}^{j}

and

S_{i | k}

are given by

\begin{matrix} P_{i | k}^{j} & = {(\sum_{ℓ = 0}^{k} ρ_{i}^{k - ℓ} Ξ_{i | ℓ}^{j} Ξ_{i | ℓ}^{j ⊤} + ρ_{i}^{k} P_{i 0}^{- j})}^{- 1}, \\ S_{i | k}^{j} & = \sum_{ℓ = 0}^{k} ρ_{i}^{k - ℓ} C_{i ξ | ℓ}^{j} Ξ_{i | ℓ}^{j} + ρ_{i}^{k} P_{i 0}^{- j} {\hat{W}}_{i | 0}^{j} . \end{matrix}

(35)

Then, for all

k \in N_{\geq 0}

, we can obtain the update rule for

{\hat{W}}_{i | k}^{j}

as follows:

{\hat{W}}_{i | k}^{j} = {\hat{W}}_{i | (k - 1)}^{j} + P_{i | k}^{j} Ξ_{i | k}^{j} {\bar{ς}}_{i | k}

(36)

where

\begin{matrix} {\bar{ς}}_{i | k} = C_{i ξ | k}^{j} - Ξ_{i | k}^{j ⊤} {\hat{W}}_{i | (k - 1)}^{j} \\ P_{i | k}^{j} = ρ_{i}^{- 1} (P_{i | (k - 1)}^{j} - σ_{i | k}^{- j} ∥ P_{i | (k - 1)}^{j} Ξ_{i | k}^{j} ∥^{2}) \end{matrix}

(37)

with

σ_{i | k}^{j} = ρ_{i} + Ξ_{i | k}^{j ⊤} P_{i | (k - 1)}^{j} Ξ_{i | k}^{j} .

Remark 2.

The computational requirements of

R^{3} F^{2}

are primarily determined by

n_{i} + 3 m_{i} + 3 l_{i}

. Since

P_{i | k}^{j}

is of size

(n_{i} + 3 m_{i} + 3 l_{i}) \times (n_{i} + 3 m_{i} + 3 l_{i})

, the computational requirement for updating

P_{i | k}^{j}

given by (37) is of

O ({(n_{i} + 3 m_{i} + 3 l_{i})}^{2})

. Moreover, the

σ_{i | k}^{- j}

in (37) is of

O (1)

, which is much less demanding than the matrix inverse operation required by the LS algorithm. In addition, the storage requirements of the recursive

R^{3} F^{2}

are of

O ({(n_{i} + 3 m_{i} + 3 l_{i})}^{2})

, which does not grow with k. Thus, the computational and memory requirements of the recursive

R^{3} F^{2}

are significantly less than those of LS.

The implementation procedure of the proposed

R^{3} F^{2}

-based IRL approach is summarized in Algorithm 1. For a fixed controller

u_{i}

and the arbitrary uncertainty signals

v_{i}

, the proposed algorithm can simultaneously learn estimates of the value function

V_{i}^{j} (ξ_{i})

, control policy

u_{i}^{j + 1}

, and uncertainties

v_{i}^{j + 1}

without applying the knowledge of the UAV dynamics.

Algorithm 1:

R^{3} F^{2}

-based IRL for zero-sum differential games

4.2. Convergence and Stability Analysis

The following results show that the NN weight tuning laws guarantee the proposed

R^{3} F^{2}

-based IRL algorithm converges to the Nash equilibrium strategy, while ensuring the stability of the closed-loop system.

Define

{\tilde{\hat{W}}}_{i | k}^{j} = {\hat{W}}_{i}^{j} - {\hat{W}}_{i | k}^{j}

and

{\tilde{W}}_{i | k}^{j} = W_{i}^{j} - {\hat{W}}_{i | k}^{j}

.

Then, it follows from (36) that

{\tilde{\hat{W}}}_{i | k}^{j}

and

{\tilde{W}}_{i | k}^{j}

can be written as

\begin{matrix} {\tilde{\hat{W}}}_{i | k}^{j} = (I - P_{i | k}^{j} Ξ_{i | k}^{j} Ξ_{i | k}^{j ⊤}) {\tilde{\hat{W}}}_{i | k - 1}^{j}, \end{matrix}

(38a)

\begin{matrix} {\tilde{W}}_{i | k}^{j} = {\tilde{W}}_{i | (k - 1)}^{j} - P_{i | k}^{j} Ξ_{i | k}^{j} Ξ_{i | k}^{j ⊤} {\tilde{\hat{W}}}_{i | (k - 1)}^{j} . \end{matrix}

(38b)

Definition 2.

The matrix sequence,

{Ξ_{i | k}^{j}}_{k = 0}^{\infty}

, is said to be persistently exciting (PE) if for some

N \in N_{\geq 0}

and

\forall ℓ \in N_{\geq 0}

,

\exists β_{1}, β_{2} \in R^{+}

such that

β_{1} I \leq \sum_{l = ℓ}^{ℓ + N} Ξ_{i | l}^{j} Ξ_{i | l}^{j ⊤} \leq β_{2} I < \infty

.

Lemma 1.

Suppose that

{Ξ_{i | k}^{j}}_{k = 0}^{\infty}

is PE. Let

N, β_{1}

and

β_{2}

be given by Definition 2,

ρ_{i} \in (0, 1]

, and

P_{i | k}^{j}

be given by (37). Then, for all

k \geq N + 1

,

0 < {\bar{β}}_{1} I \leq P_{i | k}^{- j} \leq {\bar{β}}_{2} I + P_{N}^{- 1} < \infty

(39)

with

{\bar{β}}_{1} = \frac{ρ_{i}^{N} (1 - ρ_{i})}{1 - ρ_{i}^{N + 1}} β_{1}

and

{\bar{β}}_{2} = \frac{1}{1 - ρ_{i}^{N + 1}} β_{2}

.

Theorem 2

(Convergence of

R^{3} F^{2}

-Based IRL Algorithm). Let (36) be the update rule of the actor-critic-disturbance NN weights. Suppose that

{Ξ_{i | k}^{j}}_{k = 0}^{\infty}

is PE. Then, for any initial condition

{\hat{W}}_{i | 0}^{j}

, the estimation error

{\tilde{\hat{W}}}_{i | k}^{j}

converges exponentially to the origin, i.e., for any

{\hat{W}}_{i | 0}^{j}

there exists

γ_{0} \in R^{+}

such that for all

k \in N_{\geq 0}

,

∥ {\tilde{\hat{W}}}_{i | k}^{j} ∥^{2} \leq γ_{0} ρ_{i}^{k} {∥ W_{i | 0}^{j} ∥}^{2}

(40)

and

{\tilde{W}}_{i | k}^{j}

is ultimately uniformly bounded (UUB).

Proof.

Define the Lyapunov function candidate as

{\hat{L}}_{i | k}^{j} = {\tilde{\hat{W}}}_{i | k}^{j ⊤} P_{i | k}^{- j} {\tilde{\hat{W}}}_{i | k}^{j} .

(41)

Since

P_{i | k}^{- j} = ρ_{i} P_{i | (k - 1)}^{- j} + Ξ_{i | k}^{j} Ξ_{i | k}^{j ⊤},

(42)

using (42) and (38a), one has

\begin{matrix} Δ {\hat{L}}_{i | k}^{j} & = : {\hat{L}}_{i | k}^{j} - {\hat{L}}_{i | (k - 1)}^{j} \\ = - {\tilde{\hat{W}}}_{i | k - 1}^{j ⊤} Ξ_{i | k}^{j} (1 - Ξ_{i | k}^{j ⊤} P_{i | k}^{j} Ξ_{i | k}^{j}) Ξ_{i | k}^{j ⊤} {\tilde{\hat{W}}}_{i | (k - 1)}^{j} + (ρ_{i} - 1) {\tilde{\hat{W}}}_{i | k - 1}^{j ⊤} P_{i | (k - 1)}^{- j} {\tilde{\hat{W}}}_{i | (k - 1)}^{j} . \end{matrix}

(43)

Moreover, it follows from (42) and

ρ_{i} \in (0, 1]

that

P_{i | k}^{- j} \geq ρ_{i} P_{i | (k - 1)}^{- j} \geq \dots \geq ρ_{i}^{k} P_{i | 0}^{- j} > 0 .

(44)

Thus, for all

k \in N_{\geq 0}

,

{\hat{L}}_{i | k}^{j} \geq λ_{min} (P_{i | 0}^{- j}) ρ_{i}^{k} {∥ {\tilde{\hat{W}}}_{i | k}^{j} ∥}^{2}

.

Denote

a = Ξ_{i | k}^{j ⊤} P_{i | (k - 1)}^{j} Ξ_{i | k}^{j} > 0

. Using (37), we can obtain

\begin{matrix} 1 - Ξ_{i | k}^{j ⊤} P_{i | k}^{j} Ξ_{i | k}^{j} = \frac{ρ_{i}}{ρ_{i} + a} > 0 . \end{matrix}

(45)

Combining (45) with (43) leads to

\begin{matrix} {\hat{L}}_{i | k}^{j} - {\hat{L}}_{i | (k - 1)}^{j} \leq (ρ_{i} - 1) {\hat{L}}_{i | (k - 1)}^{j} \leq 0 \end{matrix}

(46)

which in turn gives

{\hat{L}}_{i | k}^{j} \leq ρ_{i} {\hat{L}}_{i | (k - 1)}^{j} \leq ρ_{i}^{k} {\hat{L}}_{i | 0}^{j} = ρ_{i}^{k} {\tilde{\hat{W}}}_{i | 0}^{j ⊤} P_{i | 0}^{- j} {\tilde{\hat{W}}}_{i | 0}^{j} .

(47)

If

{Ξ_{i | k}^{j}}_{k = 0}^{\infty}

is PE, Lemma 1 holds. Therefore, combining (47) with (39) and (41) leads to, for all

k \geq N + 1

,

∥ {\tilde{\hat{W}}}_{i | k}^{j} ∥^{2} \leq \frac{1}{{\bar{β}}_{1}} \cdot \frac{λ_{max} (P_{i | 0}^{- j})}{λ_{min} (P_{i | 0}^{- j})} ρ_{i}^{k} {∥ {\tilde{\hat{W}}}_{i | 0}^{j} ∥}^{2} .

(48)

Denoting

γ_{0} = \frac{1}{{\bar{β}}_{1}} \cdot \frac{λ_{max} (P_{i | 0}^{- j})}{λ_{min} (P_{i | 0}^{- j})}

, one can obtain (40).

Next, consider the Lyapunov function

L_{i | k}^{j} = {\tilde{W}}_{i | k}^{j ⊤} P_{i | k}^{- j} {\tilde{W}}_{i | k}^{j} .

(49)

Substituting (37) and (38b) into (49) yields

\begin{matrix} L_{i | k}^{j} - L_{i | (k - 1)}^{j} = & {\tilde{W}}_{i | (k - 1)}^{j ⊤} (P_{i | k}^{- j} - P_{i | (k - 1)}^{- j}) {\tilde{W}}_{i | (k - 1)}^{j} \\ - (2 {\tilde{W}}_{i | (k - 1)}^{j ⊤} Ξ_{i | k}^{j} Ξ_{i | k}^{j ⊤} - κ_{p} {\tilde{\hat{W}}}_{i | (k - 1)}^{j ⊤} Ξ_{i | k}^{j} Ξ_{i | k}^{j ⊤}) {\tilde{\hat{W}}}_{i | (k - 1)}^{j} \end{matrix}

where

κ_{p} = Ξ_{i | k}^{j ⊤} P_{i | k}^{j} Ξ_{i | k}^{j}

. Let

b_{{\tilde{\hat{W}}}_{i | k}^{j}} = {(γ_{0} ρ_{i}^{k})}^{\frac{1}{2}} ∥ W_{i | 0}^{j} ∥

. Using (39) and (44), we can obtain

\begin{matrix} L_{i | k}^{j} - L_{i | (k - 1)}^{j} \leq - M_{1} ∥ {\tilde{W}}_{i | (k - 1)}^{j} ∥^{2} + 2 M_{2} ∥ {\tilde{W}}_{i | (k - 1)}^{j} ∥ + C_{0} \end{matrix}

where

M_{1} = ρ_{i}^{k - 1} λ_{min} (P_{i | 0}^{- j}) - {\bar{β}}_{2} - λ_{min} (P_{N}^{- 1})

,

M_{2} = b_{{\tilde{\hat{W}}}_{i | k}^{j}} ({\bar{β}}_{2} + λ_{max} (P_{N}^{- 1}) + ρ_{i}^{k} λ_{max} (P_{i | 0}^{- j}))

, and

C_{0} = κ_{p} ({\bar{β}}_{2} + λ_{max} (P_{N}^{- 1}) - ρ_{i}^{k} λ_{min} (P_{i | 0}^{- j})) b_{{\tilde{\hat{W}}}_{i | k}^{j}}^{2}

.

Completing the squares, we get

L_{i | k}^{j} - L_{i | (k - 1)}^{j} < 0

if

∥ {\tilde{W}}_{i | (k - 1)}^{j} ∥ > \frac{M_{2}}{M_{1}} + \sqrt{\frac{M_{2}^{2}}{M_{1}^{2}} + \frac{C_{0}}{M_{1}}} \equiv b_{{\tilde{W}}_{i | k}^{j}}

.

According to the standard Lyapunov extension theorem, the analysis above demonstrates that the weight estimation error

{\tilde{W}}_{i | k}^{j}

is UUB. □

Theorem 3

(Stability of the tracking error subsystems). Consider the tracking error subsystems described in (8). Let the approximations of the value function, the optimal control law, and the worst-case lumped disturbance be given by(30), and the actor-critic-disturbance NN weights tuning law be given by (36). Then, for all

ξ_{i} (t_{0}) = ξ_{i 0}

and some

T > 0

, there exists a

ϱ_{i} \in R_{\geq 0}

such that the tracking error state

ξ_{i}

is UUB, i.e.,

∥ ξ_{i} (t) ∥ \leq \sqrt{\frac{ϱ_{i}}{λ_{i min} ({\hat{Q}}_{i})}} for all t \geq T .

(50)

Proof.

We consider the Lyapunov function

V_{i}^{j} (ξ_{i} (t))

given in (29), which is the approximate solution to (17). Taking the time derivative of

V_{i}^{j} (ξ_{i} (t))

along the trajectory generated by

{\hat{u}}_{i}^{j} (t)

and

{\hat{v}}_{i}^{j} (t)

yields

{\dot{V}}_{i}^{j} = \nabla V_{i}^{j} [F_{i} ξ_{i} + G_{i} ({\hat{u}}_{i}^{j} + {\hat{v}}_{i}^{j})] .

(51)

Since

\begin{matrix} ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} - \frac{1}{4} \nabla V_{i}^{j ⊤} G_{i} {\hat{R}}_{i}^{- 1} G_{i}^{⊤} \nabla V_{i}^{j} + \nabla V_{i}^{j ⊤} F_{i} ξ_{i} (t) + \frac{1}{4 η_{i}^{2}} \nabla V_{i}^{j ⊤} G_{i} G_{i}^{⊤} \nabla V_{i}^{j} = 0 \end{matrix}

(52)

subtracting (51) by (52) yields

\begin{matrix} {\dot{V}}_{i}^{j} = & \nabla V_{i}^{j} G_{i} ({\hat{u}}_{i}^{j} + {\hat{v}}_{i}^{j}) - {∥ ξ_{i} (t) ∥}_{{\hat{Q}}_{i}}^{2} \\ + \frac{1}{4} ∥ G_{i}^{⊤} \nabla V_{i}^{j} ∥_{{\hat{R}}_{i}^{- 1}}^{2} - \frac{1}{4 η_{i}^{2}} {∥ G_{i}^{⊤} \nabla V_{i}^{j} ∥}^{2} . \end{matrix}

(53)

From (16), it can be derived that

\begin{matrix} \nabla V_{i}^{j} G_{i} u_{i}^{j} = - \frac{1}{2} {∥ G_{i}^{⊤} \nabla V_{i}^{j} ∥}_{{\hat{R}}_{i}^{- 1}}^{2} . \end{matrix}

(54)

Adding (54) to (53) yields

{\dot{V}}_{i}^{j} = - {∥ ξ_{i} (t) ∥}_{{\hat{Q}}_{i}}^{2} + Σ_{1} - Σ_{2} - Σ_{3}

where

Σ_{1} = \nabla V_{i}^{j} G_{i} ({\hat{u}}_{i}^{j} - u_{i}^{j} + {\hat{v}}_{i}^{j})

,

Σ_{2} = \frac{1}{4} {∥ G_{i}^{⊤} \nabla V_{i}^{j} ∥}_{{\hat{R}}_{i}^{- 1}}^{2}

, and

Σ_{3} = \frac{1}{4 η_{i}^{2}} {∥ G_{i}^{⊤} \nabla V_{i}^{j} ∥}^{2}

.

Since

Σ_{2} > 0

and

Σ_{3} > 0

, one has

{\dot{V}}_{i}^{j} \leq - ∥ ξ_{i} {(t) ∥}_{{\hat{Q}}_{i}}^{2} + ∥ Σ_{1} ∥ .

(55)

According to Assumption 1 and the expression of

G_{i}

, one has

\begin{matrix} ∥ Σ_{1} ∥ \leq (b_{d φ_{i d c}} W_{i c max}^{j} + b_{d ϵ_{i}^{j}}) ∥ G_{i} ∥ (b_{ϵ_{i}^{j}} + ∥ {\hat{v}}_{i}^{j} ∥) . \end{matrix}

(56)

As

{\hat{W}}_{i}^{j} = {\tilde{\hat{W}}}_{i | k}^{j} + W_{i}^{j} - {\tilde{W}}_{i | k}^{j}

, by taking norms, we get

\begin{matrix} ∥ {\hat{W}}_{i}^{j} ∥ & \leq ∥ {\tilde{\hat{W}}}_{i | k}^{j} ∥ + ∥ {\tilde{W}}_{i | k}^{j} ∥ + ∥ W_{i}^{j} ∥ \\ \leq {(γ_{0} ρ^{k})}^{\frac{1}{2}} ∥ W_{i | 0}^{j} ∥ + b_{{\tilde{W}}_{i | k}^{j}} + b_{W_{i}^{j}} ≜ b_{{\hat{W}}_{i}^{j}} \end{matrix}

where

b_{W_{i}^{j}} ≜ max {W_{i c max}^{j}, W_{i a max}^{j + 1}, W_{i d max}^{j + 1}}

. Let

b_{Φ_{i}} = max {b_{φ_{i a}}, b_{φ_{i c}}, b_{φ_{i c}}}

. Then, it follows from (30) that

\begin{matrix} ∥ {\hat{v}}_{i}^{j} ∥ \leq ∥ Φ_{i} (ξ_{i}) ∥ ∥ {\hat{W}}_{i}^{j} ∥ \leq b_{Φ_{i}} b_{{\hat{W}}_{i}^{j}} \equiv b_{{\hat{v}}_{i}^{j}} . \end{matrix}

(57)

Taking into account (56) and (57), (55) becomes

{\dot{V}}_{i}^{j} \leq - {∥ ξ_{i} (t) ∥}^{2} λ_{i min} ({\hat{Q}}_{i}) + ϱ_{i}

(58)

where

ϱ_{i} = (b_{d φ_{i d c}} W_{i c max}^{j} + b_{d ϵ_{i}^{j}}) ∥ G_{i} ∥ (b_{ϵ_{i}^{j}} + b_{{\hat{v}}_{i}^{j}})

. It follows that

{\dot{V}}_{i}^{j} < 0

if

∥ ξ_{i} ∥ > \sqrt{\frac{ϱ_{i}}{λ_{i min} ({\hat{Q}}_{i})}}

. With Lyapunov’s indirect method, it is proven that

ξ_{i}

is UUB. □

Theorem 4

(Nash Equilibrium of the Game). Suppose that the hypotheses in Theorems 2 and 3 hold. Then,

H_{i} (ξ_{i}, {\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j}, {\hat{V}}_{i}^{j}, t)

is UUB with

{\hat{V}}_{i}^{j}

,

{\hat{u}}_{i}^{j}

, and

{\hat{v}}_{i}^{j}

being given by (30). Moreover,

({\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j})

converges to the Nash equilibrium solution

(u_{i}^{*}, v_{i}^{*})

of game

G_{i}

,

i = 1, 2

.

Proof.

First, the approximate coupled HJB equation is

\begin{matrix} H_{i} (ξ_{i}, {\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j}, {\hat{V}}_{i}^{j}, t) = & {\hat{W}}_{i c}^{j ⊤} \nabla φ_{i c} (ξ_{i}) F_{i} ξ_{i} {∥ ξ_{i} (t) ∥}_{{\hat{Q}}_{i}}^{2} + \frac{1}{4 η_{i}^{2}} {\hat{W}}_{i c}^{j ⊤} \nabla φ_{i c} (ξ_{i}) G_{i} G_{i}^{⊤} \nabla φ_{i c}^{⊤} (ξ_{i}) {\hat{W}}_{i c}^{j} \\ - \frac{1}{4} {\hat{W}}_{i c}^{j ⊤} \nabla φ_{i c} (ξ_{i}) G_{i} {\hat{R}}_{i}^{- 1} G_{i}^{⊤} \nabla φ_{i c}^{⊤} (ξ_{i}) {\hat{W}}_{i c}^{j} \end{matrix}

(59)

Subtracting

H_{i} (ξ_{i}, u_{i}^{j}, v_{i}^{j}, V_{i}^{j}, t) = 0

from the right-hand side of (59) yields

\begin{matrix} H_{i} (ξ_{i}, {\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j}, {\hat{V}}_{i}^{j}, t) \leq & \frac{1}{4} ∥ \nabla φ_{i c} (ξ_{i}) G_{i} ∥_{{\hat{R}}_{i}^{- 1}}^{2} ∥ W_{i c}^{j} ∥^{2} + \frac{1}{4 η_{i}^{2}} ∥ \nabla φ_{i c} (ξ_{i}) G_{i} ∥^{2} {∥ {\hat{W}}_{i c}^{j} ∥}^{2} \\ + (∥ {\hat{W}}_{i c}^{j} ∥ + ∥ W_{i c}^{j} ∥) ∥ \nabla φ_{i c} (ξ_{i}) ∥ ∥ F_{i} ∥ ∥ ξ_{i} ∥ . \end{matrix}

(60)

All terms on the right-hand side of (60) are UUB. Hence,

H_{i} (ξ_{i}, {\hat{u}}_{i}^{j}, {\hat{v}}_{i}^{j}, {\hat{V}}_{i}^{j}, t)

is UUB and the convergence of the approximate HJI is achieved.

Furthermore, from Theorem 2,

{\hat{u}}_{i} - u_{i}^{*}

and

{\hat{v}}_{i} - v_{i}^{*}

are UUB due to the fact that

\begin{matrix} ∥ {\hat{W}}_{i a} - W_{i a} ∥ \leq ∥ {\hat{W}}_{i a} ∥ + ∥ W_{i a} ∥ \leq b_{{\hat{W}}_{i}^{j}} + W_{i a max}^{j}, \\ ∥ {\hat{W}}_{i d} - W_{i d} ∥ \leq ∥ {\hat{W}}_{i d} ∥ + ∥ W_{i d} ∥ \leq b_{{\hat{W}}_{i}^{j}} + W_{i d max}^{j} . \end{matrix}

Therefore, the pair

({\hat{u}}_{i}, {\hat{v}}_{i})

gives the approximate Nash equilibrium solution to game

G_{i}

. This completes the proof. □

5. Simulation Results

In this section, the performance of the proposed

R^{3} F^{2}

-based off-policy IRL scheme will be verified via several simulated examples, where the performance of the gradient descent RL-based controller is also given for the purposes of control performance comparison. To ensure the simulation setting closely follows real-world conditions, we conduct our tracking examples in a virtual experiment platform called the Robot Operating System (ROS) and the Gazebo robotics simulation environment. The virtual UAV tracking scenario is shown in Figure 2, which features a UAV and a bay land.

The physical parameters in UAV dynamics are listed in Table 1. The aerodynamic damping coefficients

k_{i}, i = 1, \dots, 6

are set as

k_{1} = 0.01, k_{2} = 0.01, k_{3} = 0.01, k_{4} = 0.015, k_{5} = 0.015

, and

k_{6} = 0.015

. The anti-torque coefficient is chosen as

c = 0.05 m

. The external disturbances acting on the position and the attitude subsystems are defined as

d_{ζ} = d_{ϑ} ≜ \{\begin{matrix} a (t), t \in Γ \\ 0, o t h e r w i s e \end{matrix}

where

a (t) = {[0.2 sin (0.1 t), 0.2 cos (0.1 t), 0.1]}^{⊤}

and

\begin{matrix} Γ & = [T_{s i m} / 8, T_{s i m} / 6] \cup [T_{s i m} / 4, T_{s i m} / 3] \\ \cup [T_{s i m} / 2, 3 T_{s i m} / 5] \cup [5 T_{s i m} / 6, 7 T_{s i m} / 8] \end{matrix}

with

T_{s i m} = 150 s

being the total simulation time.

The parameters in the value function are set as

{\hat{Q}}_{1} = {\hat{Q}}_{2} = 1.2 I_{6}

,

{\hat{R}}_{1} = {\hat{R}}_{2} = 1.4 I_{3}

, and

η_{1}^{2} = η_{2}^{2} = 5

. The reference trajectories of the position and yaw angle are selected as

ζ_{r} = {[10 sin (\frac{t}{10}), 10 sin (\frac{t}{20}), 8]}^{⊤}

and

ψ_{r} = cos (0.5 t)

. The initial states of the UAV are

ζ (0) = {[2.5, 2.5, 0]}^{⊤}

,

\dot{ζ} (0) = {[0, 0, 0]}^{⊤}

,

ϑ (0) = {[1, 1, 1]}^{⊤}

, and

\dot{ϑ} (0) = {[0, 0, 0]}^{⊤}

. The sampling time interval

Δ t

is set as

0.01 s

. Furthermore, the activation functions for the critic NNs are chosen as the quadratic polynomials with the structure of 6-21-1. For both the actor and the lumped disturbance NNs, the activation functions are chosen as first-order polynomials with the structure of 6-6-3.

Figure 3 shows the trajectories of the UAV in the

x y

plane while tracking a white-colored figure-eight trajectory over the overall simulation time. The instantaneous positions of the UAV can be found in the subfigures, illustrating successful tracking performance under external disturbance. In Figure 4, Figure 5 and Figure 6, we demonstrate the simulated results of the proposed approach. Specifically, Figure 4 and Figure 5 show the trajectory tracking performance under the proposed tracking control framework. It can be seen from Figure 5 that the positions and attitudes successfully track the given reference. Moreover, the convergence results of the actor-critic-disturbance NN weights are displayed in Figure 6. It can be seen that the convergence is achieved after the 10th iteration, which verifies the theoretical result established in Theorem 2.

To further show the superiority of the proposed approach, we also compare the performance of this method with that of the gradient descent RL-based controller. Figure 7 depicts the comparison results of the tracking errors of the subsystems under the gradient descent RL-based controllers and the learned robust controllers. One can find that the tracking errors exponentially converge to a small neighborhood of the origin, and the learned robust controllers achieve better performance than the RL-based controllers. In addition, the performance index is taken as the average value of the cost during all the simulation time, i.e.,

{\bar{J}}_{1} = \frac{\int_{t = 0}^{T} q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t)) d t}{T_{s i m}}

and

{\bar{J}}_{2} = \frac{\int_{t = 0}^{T} q_{i} (ξ_{i} (t), u_{i} (t), v_{i} (t)) d t}{T_{s i m}}

. Then, the quantitative performance comparison is shown in Table 2. It can be seen that the performance of our method is significantly better than that of the gradient descent RL method in the position control loop, while the performance is almost the same in the altitude control loop. This difference is likely due to the higher nonlinearity in the position control subsystem. Generally, the attitude subsystem behaves more like a linear system, which explains why our method achieves almost identical performance compared with the gradient descent RL-based one. In summary, our method can successfully handle this nonlinearity (or coupling) with improved control performance.

6. Conclusions

In this paper, a novel robust optimal tracking control framework for a nonlinear UAV with unknown disturbances is proposed. By introducing a virtual control law, the UAV robust optimal tracking control problem is formulated as two independent zero-sum differential games. An

R^{3} F^{2}

-based IRL algorithm is presented to obtain the optimal control law and the corresponding worst-case disturbance in a data-driven manner. It is also shown that the learned control law and worst-case disturbance converge to the Nash equilibrium strategies. Finally, numerical simulation results are provided to illustrate the effectiveness of the proposed framework. In future work, this methodology will be extended to multi-agent systems or higher-dimensional dynamics and will address scalability challenges for large-scale differential games.

Author Contributions

Conceptualization, Y.G. and Q.S.; methodology, Y.G.; validation, Q.S. and Q.P.; investigation, Y.G.; writing—original draft preparation, Y.G.; writing—review and editing, Q.S.; visualization, Q.S.; supervision, Q.P.; project administration, Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 52301389, the Aeronautical Science Foundation of China under grant 2023Z023053001, and the National Key Laboratory of Underwater Information and Control.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank the editors and anonymous reviewers for their valuable time and effort dedicated to the review process of our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, X.; Wen, X.; Wang, Z.; Gao, Y.; Li, H.; Wang, Q.; Yang, T.; Lu, H.; Cao, Y.; Xu, C.; et al. Swarm of micro flying robots in the wild. Sci. Robot. 2022, 7, eabm5954. [Google Scholar] [CrossRef]
Hua, H.; Fang, Y. A novel reinforcement learning-based robust control strategy for a quadrotor. IEEE Trans. Ind. Electron. 2022, 70, 2812–2821. [Google Scholar] [CrossRef]
Wang, Y.; Lu, Q.; Ren, B. Wind turbine crack inspection using a quadrotor with image motion blur avoided. IEEE Robot. Autom. Lett. 2023, 8, 1069–1076. [Google Scholar] [CrossRef]
Ma, Q.; Jin, P.; Lewis, F.L. Guaranteed cost attitude tracking control for uncertain quadrotor unmanned aerial vehicle under safety constraints. IEEE/CAA J. Autom. Sin. 2024, 11, 1447–1457. [Google Scholar] [CrossRef]
Wei, Q.; Yang, Z.; Su, H.; Wang, L. Online Adaptive Dynamic Programming for Optimal Self-Learning Control of VTOL Aircraft Systems With Disturbances. IEEE Trans. Automat. Sci. Eng. 2024, 21, 343–352. [Google Scholar] [CrossRef]
Tonan, M.; Bottin, M.; Doria, A.; Rosati, G. Analysis and design of a 3-DOF spatial underactuated differentially flat robot. In Proceedings of the 2025 11th International Conference on Mechatronics and Robotics Engineering (ICMRE), Milan, Italy, 27–29 February 2024; IEEE: New York, NY, USA, 2025; pp. 202–207. [Google Scholar]
Wang, L.; Su, J. Robust disturbance rejection control for attitude tracking of an aircraft. IEEE Trans. Control Syst. Technol. 2015, 23, 2361–2368. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, M.; Jiang, C. Robust tracking control of uncertain mimo nonlinear systems with application to UAVs. IEEE/CAA J. Autom. Sin. 2015, 2, 25–32. [Google Scholar] [CrossRef]
Dydek, Z.T.; Annaswamy, A.M.; Lavretsky, E. Adaptive control of quadrotor UAVs: A design trade study with flight evaluations. IEEE Trans. Control Syst. Technol. 2012, 21, 1400–1406. [Google Scholar] [CrossRef]
Sun, S.; Romero, A.; Foehn, P.; Kaufmann, E.; Scaramuzza, D. A comparative study of nonlinear MPC and differential-flatness-based control for quadrotor agile flight. IEEE Trans. Robot. 2022, 38, 3357–3373. [Google Scholar] [CrossRef]
Jiao, Q.; Modares, H.; Xu, S.; Lewis, F.L.; Vamvoudakis, K.G. Multi-agent zero-sum differential graphical games for disturbance rejection in distributed control. Automatica 2016, 69, 24–34. [Google Scholar] [CrossRef]
Smolyanskiy, N.; Kamenev, A.; Smith, J.; Birchfield, S. Toward low-flying autonomous mav trail navigation using deep neural networks for environmental awareness. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 4241–4247. [Google Scholar]
Wang, Y.; Sun, J.; He, H.; Sun, C. Deterministic policy gradient with integral compensator for robust quadrotor control. IEEE Trans. Syst. Man, Cybern. Syst. 2019, 50, 3713–3725. [Google Scholar] [CrossRef]
Petrlík, M.; Báča, T.; Heřt, D.; Vrba, M.; Krajník, T.; Saska, M. A robust uav system for operations in a constrained environment. IEEE Robot. Autom. Lett. 2020, 5, 2169–2176. [Google Scholar] [CrossRef]
Guo, Y.; Sun, Q.; Wang, Y.; Pan, Q. Differential graphical game-based multi-agent tracking control using integral reinforcement learning. IET Control Theory Appl. 2024, 18, 2766–2776. [Google Scholar] [CrossRef]
Rubí, B.; Morcego, B.; Pérez, R. A deep reinforcement learning approach for path following on a quadrotor. In Proceedings of the 2020 European Control Conference (ECC), Saint Petersburg, Russia, 12–15 May 2020; IEEE: New York, NY, USA, 2020; pp. 1092–1098. [Google Scholar]
Ma, H.-J.; Xu, L.-X.; Yang, G.-H. Multiple environment integral reinforcement learning-based fault-tolerant control for affine nonlinear systems. IEEE Trans. Cybern. 2021, 51, 1913–1928. [Google Scholar] [CrossRef]
Greatwood, C.; Richards, A.G. Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control. Auton. Robot. 2019, 43, 1681–1693. [Google Scholar] [CrossRef]
Mu, C.; Zhang, Y. Learning-based robust tracking control of quadrotor with time-varying and coupling uncertainties. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 259–273. [Google Scholar] [CrossRef]
Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M.; Lewis, F.L. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
Wang, D.; He, H.; Mu, C.; Liu, D. Intelligent critic control with disturbance attenuation for affine dynamics including an application to a microgrid system. IEEE Trans. Ind. Electron. 2017, 64, 4935–4944. [Google Scholar] [CrossRef]
Zhao, B.; Shi, G.; Wang, D. Asymptotically stable critic designs for approximate optimal stabilization of nonlinear systems subject to mismatched external disturbances. Neurocomputing 2020, 396, 201–208. [Google Scholar] [CrossRef]
Mohammadi, M.; Arefi, M.M.; Setoodeh, P.; Kaynak, O. Optimal tracking control based on reinforcement learning value iteration algorithm for time-delayed nonlinear systems with external disturbances and input constraints. Inform. Sci. 2021, 554, 84–98. [Google Scholar] [CrossRef]
Yang, X.; Gao, Z.; Zhang, J. Event-driven H_∞ control with critic learning for nonlinear systems. Neural Netw. 2020, 132, 30–42. [Google Scholar] [CrossRef]
Song, R.; Lewis, F.L.; Wei, Q.; Zhang, H. Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans. Cybern. 2015, 46, 1041–1050. [Google Scholar] [CrossRef]
Yang, X.; Liu, D.; Luo, B.; Li, C. Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning. Inform. Sci. 2016, 369, 731–747. [Google Scholar] [CrossRef]
Modares, H.; Lewis, F.L.; Jiang, Z. H_∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2550–2562. [Google Scholar] [CrossRef] [PubMed]
Luo, B.; Wu, H.; Huang, T. Off-policy reinforcement learning for H_∞ control design. IEEE Trans. Cybern. 2015, 45, 65–76. [Google Scholar] [CrossRef] [PubMed]
Cui, X.; Zhang, H.; Luo, Y.; Jiang, H. Adaptive dynamic programming for tracking design of uncertain nonlinear systems with disturbances and input constraints. Int. J. Adapt. Control 2017, 31, 1567–1583. [Google Scholar] [CrossRef]
Xiao, G.; Zhang, H.; Zhang, K.; Wen, Y. Value iteration based integral reinforcement learning approach for H_∞ controller design of continuous-time nonlinear systems. Neurocomputing 2018, 285, 51–59. [Google Scholar] [CrossRef]
Zhao, W.; Liu, H.; Lewis, F.L. Robust formation control for cooperative underactuated quadrotors via reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4577–4587. [Google Scholar] [CrossRef]
Zhao, B.; Xian, B.; Zhang, Y.; Zhang, X. Nonlinear robust adaptive tracking control of a quadrotor UAV via immersion and invariance methodology. IEEE Trans. Ind. Electron. 2015, 62, 2891–2902. [Google Scholar] [CrossRef]
Lee, T.; Leok, M.; McClamroch, N.H. Geometric tracking control of a quadrotor UAV on SE (3). In Proceedings of the 49th IEEE Conference on Decision and Control (CDC), Atlanta, GA, USA, 15–17 December 2010; pp. 5420–5425. [Google Scholar] [CrossRef]
Cooke, R.D. Unmanned Aerial Vehicles: Design, Development and Deployment; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Wong, J.Y. Theory of Ground Vehicles; John Wiley & Sons: Hoboken, NJ, USA, 2001. [Google Scholar]
Spong, M.W.; Hutchinson, S.; Vidyasagar, M. Robot Modeling and Control; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Corke, P. Robotics, Vision and Control: Fundamental Algorithms in MATLAB; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Song, R.; Lewis, F.L. Robust optimal control for a class of nonlinear systems with unknown disturbances based on disturbance observer and policy iteration. Neurocomputing 2020, 390, 185–195. [Google Scholar] [CrossRef]
Stone, M.H. The generalized weierstrass approximation theorem. Math. Mag. 1948, 21, 167–184. [Google Scholar] [CrossRef]
Gao, W.; Jiang, Z.P. Adaptive dynamic programming and adaptive optimal output regulation of linear systems. IEEE Trans. Autom. Control 2016, 61, 4164–4169. [Google Scholar] [CrossRef]

Figure 1. Diagram of a UAV.

Figure 2. The ROS-/Gazebo-based virtual UAV tracking experimental scenarios.

Figure 3. The virtual experimental results of using the proposed game-based robust tracking control method under an eight-shape reference trajectory; (a)

t = 0 s

. (b)

t = 50 s

. (c)

t = 100 s

. (d)

t = 150 s

.

Figure 3. The virtual experimental results of using the proposed game-based robust tracking control method under an eight-shape reference trajectory; (a)

t = 0 s

. (b)

t = 50 s

. (c)

t = 100 s

. (d)

t = 150 s

.

Figure 4. The UAV and reference trajectory in 3 dimensions, where the red-colored circle denotes the starting position and the same-colored diamond represents the end position.

Figure 5. The UAV trajectories under the learned robust tracking control controller.

Figure 6. Convergence of the actor-critic-disturbance NN weights.

Figure 7. Evolution of the position and yaw angle tracking error. The red dashed line represents the tracking errors under the gradient descent RL-based controllers, while the blue dashed line represents the tracking errors under the learned robust tracking controllers.

Table 1. Parameters of the UAV in the simulation.

Parameters	m	l	g	$I_{x}$	$I_{y}$	$I_{z}$
Values	$2.33$	$0.4$	$9.8$	$0.16$	$0.16$	$0.32$
Units	$kg$	$m$	$m / s^{2}$	$N \cdot m$	$N \cdot m$	$N \cdot m$

Table 2. Comparison of the average costs of the gradient descent RL-based controller and

R^{3}

F^{2}

-IRL controller.

Table 2. Comparison of the average costs of the gradient descent RL-based controller and

R^{3}

F^{2}

-IRL controller.

Average Cost	Gradient Descent RL	$R^{3} F^{2}$ -IRL
${\bar{J}}_{1}$	$10.20$	$7.6775$
${\bar{J}}_{2}$	$0.0576$	$0.0582$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Sun, Q.; Pan, Q. Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones 2025, 9, 477. https://doi.org/10.3390/drones9070477

AMA Style

Guo Y, Sun Q, Pan Q. Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones. 2025; 9(7):477. https://doi.org/10.3390/drones9070477

Chicago/Turabian Style

Guo, Yaning, Qi Sun, and Quan Pan. 2025. "Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games" Drones 9, no. 7: 477. https://doi.org/10.3390/drones9070477

APA Style

Guo, Y., Sun, Q., & Pan, Q. (2025). Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games. Drones, 9(7), 477. https://doi.org/10.3390/drones9070477

Article Menu

Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games

Abstract

1. Introduction

2. Preliminaries

2.1. The UAV Dynamics

2.2. Problem Formulation

3. Game-Based Robust Optimal Control Law Design and Stability Analysis

3.1. Zero-Sum Differential Games

3.2. Stability and Nash Equilibria for Zero-Sum Differential Games

4. $R^{3} F^{2}$ -Based IRL Approach

4.1. A Framework of $R^{3} F^{2}$ -Based IRL for Zero-Sum Differential Games

4.2. Convergence and Stability Analysis

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Robust Tracking Control of Underactuated UAVs Based on Zero-Sum Differential Games

Abstract

1. Introduction

2. Preliminaries

2.1. The UAV Dynamics

2.2. Problem Formulation

3. Game-Based Robust Optimal Control Law Design and Stability Analysis

3.1. Zero-Sum Differential Games

3.2. Stability and Nash Equilibria for Zero-Sum Differential Games

4. R 3 F 2 -Based IRL Approach

4.1. A Framework of R 3 F 2 -Based IRL for Zero-Sum Differential Games

4.2. Convergence and Stability Analysis

5. Simulation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. $R^{3} F^{2}$ -Based IRL Approach

4.1. A Framework of $R^{3} F^{2}$ -Based IRL for Zero-Sum Differential Games