Next Article in Journal
A Novel Multi-Channel Image Encryption Algorithm Leveraging Pixel Reorganization and Hyperchaotic Maps
Previous Article in Journal
Advanced Trans-EEGNet Deep Learning Model for Hypoxic-Ischemic Encephalopathy Severity Grading
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Integral Reinforcement Learning-Based Online Adaptive Dynamic Event-Triggered Control Design in Mixed Zero-Sum Games for Unknown Nonlinear Systems

1
School of Artificial Intelligence, Shenyang University of Technology, Shenyang 110870, China
2
School of Information Science and Engineer, Northeastern University, Shenyang 110819, China
3
School of Science, Liaoning University of Technology, Jinzhou 121000, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(24), 3916; https://doi.org/10.3390/math12243916
Submission received: 6 November 2024 / Revised: 3 December 2024 / Accepted: 4 December 2024 / Published: 12 December 2024

Abstract

:
Mixed zero-sum games consider both zero-sum and non-zero-sum differential game problems simultaneously. In this paper, multiplayer mixed zero-sum games (MZSGs) are studied by the means of an integral reinforcement learning (IRL) algorithm under the dynamic event-triggered control (DETC) mechanism for completely unknown nonlinear systems. Firstly, the adaptive dynamic programming (ADP)-based on-policy approach is proposed for solving the MZSG problem for the nonlinear system with multiple players. Secondly, to avoid using dynamic information of the system, a model-free control strategy is developed by utilizing actor–critic neural networks (NNs) for addressing the MZSG problem of unknown systems. On this basis, for the purpose of avoiding wasted communication and computing resources, the dynamic event-triggered mechanism is integrated into the integral reinforcement learning algorithm, in which a dynamic triggering condition is designed to further reduce triggering times. With the help of the Lyapunov stability theorem, the system states and weight values of NNs are proven to be uniformly ultimately bounded (UUB) stable. Finally, two examples are demonstrated to show the effectiveness and feasibility of the developed control method. Compared with static event-triggering mode, the simulation results show that the number of actuator updates in the DETC mechanism has been reduced by 55% and 69%, respectively.

1. Introduction

In recent years, differential games theory has been extensively applied in various fields such as economic and financial decision-making [1,2,3,4], automation control systems [5,6], and computer science [7]. In these studies, multiple participants engaged in continuous games and strove to optimize their respective independent, then conflicting, objectives so the Nash equilibrium could be achieved. In two-player zero-sum games (ZSGs) [8,9], if one player gained, it inevitably resulted in a loss for another player, and precluded any possibility of cooperation. However, in multi-person non-zero-sum games (NZSGs) [7,10], the total sum of benefits or losses for the parties involved were not zero, thus they allowed for the possibility of a win–win scenario and subsequent cooperation. Based on the characteristics and advantages of these two types of games, Ref. [11] proposed a novel game strategy distinct from ZSGs and NZSGs, termed MZSGs. In such scenarios, determining the Nash equilibrium typically involved solving the coupled Hamilton–Jacobi–Bellman (HJB) equations. When the system was linear, the solution could be obtained by simplifying the HJ equations to algebraic Riccati equations. However, when the system was nonlinear, obtaining analytical solutions became challenging due to the presence of coupling terms in the equations.
To address the issue of unattainable analytical solutions, researchers devised a novel method based on the ADP algorithm to approximate optimal solutions [12,13,14,15]. This method was applied in numerous control domains to tackle various problems, such as H control issues [16], time-delay system control [17], optimal spin polarization control [18], and reinforcement learning (RL) control [19]. In the work of [18], a new ADP structure was proposed for three-dimensional spin polarization control. A novel ADP method, entirely independent of system information, was introduced for nonlinear Markov jump systems in order to solve the two-player Stackelberg differential game in [20]. In the work of [21], an iterative ADP algorithm was proposed to iteratively solve the parameterized Bellman equation for the global stabilization of continuous-time linear systems under control constraints.
The integral reinforcement learning (IRL) algorithm was proposed as a derivative of the RL algorithm in [22]. In policy iteration-based ADP algorithms, an approximate optimal solution to the HJB equation was required. However, in the IRL algorithm, model-free implementation was achievable, which means that the HJB equation could be solved without any knowledge of system dynamics [23]. Due to the difficulty in modeling or accurately describing system dynamics in practical problems, the IRL algorithm has been widely applied to address optimal control problems. In [24], it was demonstrated that data-based IRL algorithms were equivalent to model-based policy iteration algorithms. Researchers utilized the IRL algorithm to obtain iterative control, thereby achieving policy evaluation and policy improvement [25]. Moreover, a novel guaranteed cost control was designed using the IRL algorithm, which aimed to relax the system dynamics knowledge of stochastic systems modulated by stochastic time-varying parameters in [26]. In the research by [27], an adaptive optimal bipartite consensus scheme was developed using the IRL algorithm in order to contribute a critic-only controller structure. However, these control methods had all been developed on the basis of a time-triggered mechanism; the defining characteristic of the mechanism was that the sampled data and each control update were periodic. However, as systems gradually stabilized or operated within permissible performance margins, it became desirable for controllers to remain in a static state [28].
In response to these requirements, researchers proposed static event-triggered control (SETC) for sampling processes, which can ensure stable system operation. Specifically, a novel triggering condition was pre-designed by the authors, which mapped to a defined threshold. The controller would only update when this condition was met, significantly reducing the wastage of communication resources [29,30,31,32]. Nevertheless, the triggering rules associated with static event triggers were solely dependent on the current state and error signals and disregarded previous values. This limitation was addressed in Girard’s research in [33], where an internal dynamic variable was incorporated into the previous triggering mechanism. This enhancement made the triggering mechanism contingent not only related to current values but also associated historical data, which led to the development of DETC. The advantage of this new mechanism laid in its ability to extend the interval between triggers. In subsequent studies [34,35,36,37], researchers extensively explored the application of DETC in multi-agent systems, theoretically proving their feasibility. Ref. [38] investigated fault-tolerant control issues based on DETC, where the dynamic variable adopted a form more general than exponential functions. Ref. [39] investigated non-fragile dissipative filters for discrete-time interval type-2 fuzzy Markov jump systems.
In this paper, an online algorithm was proposed based on ADP to address the IRL online adaptive DETC problem for MZSGs with unknown nonlinear systems. The main contributions are given as follows:
1.
Unlike the studies on ZSG or NZSG problems in [8,19,24,30], the ZSG and NZSG problems are considered simultaneously, in which an event-triggered ADP approach is used to solve the problem to reduce communication and computing resources.
2.
By establishing the actor NNs to approximate the optimal control strategies and auxiliary control inputs of each player, a novel event-triggered IRL algorithm is proposed to solve the mixed zero-sum games problem without using the information of system functions.
3.
In the developed off-policy DETC method, by introducing the dynamic adaptive parameters to the triggering conditions, compared with the static event-triggered mechanism, the triggering frequency is further reduced, thereby achieving greater utilization of communication resources.
The rest of this paper is organized as follows. Section 2, elaborates on unknown nonlinear systems and introduces knowledge of multiplayer MZSGs. Section 3 proposes ADP-based near-optimal control for MZSGs under the DETC mechanism. Section 4 develops a dynamic event-triggered IRL algorithm for unknown systems. Section 5 proves the UUB properties of closed-loop system states and the convergence of critical NN weight estimation errors. Section 6 provides two example results to validate the effectiveness of the developed algorithms. Finally, Section 7 presents the conclusion of this paper.

2. Problem Formulation

Consider the following multiplayer games system
ϑ ˙ = f ( ϑ ) + j = 1 N + 1 h j ( ϑ ) v j ,
where ϑ n is the system state, f ( ϑ ) n and h j ( ϑ ) n × m j are unknown system and control dynamics, v j R m j are the control inputs, and ϑ ( 0 ) = ϑ 0 is the initial system state.
Assumption 1.
f ( 0 ) = 0 , f ( ϑ ) B f ϑ , B f is a positive constant. f ( ϑ ) and h i ( ϑ ) are Lipschitz continuous on a compact set Ω R n , which contains the origin, and the system is stable.
For system (1), these N + 1 participants exhibit varying competitive dynamics, where the first N participants engage in cooperative relationships akin to NZSGs, and the N-th and ( N + 1 )-th participants engage in competitive interactions akin to ZSGs. We can identify equilibrium strategies that stabilize the system, defining the cost function of the i-th participant based on their interactions with others
J i ( ϑ ) = 0 ( ϑ T B i ϑ + j = 1 N v j T R i , j v j ) d s , i Υ , i N ,
J N ( ϑ ) = 0 ( ϑ T B N ϑ + j = 1 N v j T R N , j v j γ 2 v N + 1 ) d s ,
where Υ = 1 , 2 , , N , B i 0 , R i , j > 0 and γ is a positive constant.
Definition 1.
There is a series of control policy v = v 1 , , v N , v N + 1 which is continuous on Ω, when the system (1) is stable and the cost functions (2) and (3) are finite over Ω. It is considered as admissible, denoted by v Θ ( Ω ) .
The value functions corresponding to (4) and (5) are
Q i * ( ϑ ) = min v i Θ Ω t ( ϑ T B i ϑ + j = 1 N v j T R i , j v j ) d s ,
Q N * ( ϑ ) = min v N Θ Ω max v N + 1 Θ Ω t ( ϑ T B N ϑ + j = 1 N v j T R N , j v j γ 2 v N + 1 T v N + 1 ) d s ,
where i Υ , i N . Let ϑ T B i ϑ + j = 1 N v j T R i , j v j = r i , i N and ϑ T B N ϑ + j = 1 N v j T R N , j v j γ 2 v N + 1 T v N + 1 = r N . Our intention is to find the optimal control inputs v * = v 1 * , , v N * , v N + 1 * .
  If the optimal value functions (2) and (3) are differentiable on Ω , the Hamilton functions are defined as
H i ϑ , Q i , v 1 , , v N , v N + 1 = r i ϑ , v 1 , , v N + Q i T f ϑ + j = 1 N + 1 h j ϑ v j , i Υ , i N ,
H N ϑ , Q N , v 1 , , v N , v N + 1 = r N ϑ , v 1 , , v N , v N + 1 + Q N T f ϑ + j = 1 N + 1 h j ϑ v j ,
where Q i = Q i / ϑ , i N . HJ functions are determined by
H i ϑ , Q i , v 1 , , v N , v N + 1 = 0 , i Υ .
At equilibrium, there exist N + 1 stability conditions
H i v i = 0 , H N v N + 1 = 0 , i Υ .
Next, using stationary conditions, we can obtain
v i * ϑ = 1 2 R i , i 1 h i T Q i * ϑ , i Υ ,
v N + 1 * ϑ = 1 2 γ 2 h N + 1 T Q N * ϑ .
Remark 1.
From (10) and (11), it follows that if (8) has a unique solution, then N + 1 participants possess optimal control. γ is a parameter ensuring the unique solution of the MZSGs. According to [39], when γ exceeds γ * , the HJB equation admits a unique solution, thus γ * denotes a threshold.
Substituting the optimal control strategies Equations (10) and (11) into (8), the coupled HJ equation under time-triggered conditions can be obtained by
ϑ T B i ϑ + Q i * T f ϑ 1 2 j = 1 N h j R j , j 1 h j T Q j * + 1 2 γ 2 h N + 1 h N + 1 T Q N * + 1 4 j = 1 N Q j * T h j R j , j 1 R i , j R j , j 1 h j T Q j * = 0 , i Υ , i N ,
ϑ T B N ϑ + 1 4 j = 1 N Q j * T h j R j , j 1 R N , j R j , j 1 h j T Q j * 1 4 γ 2 Q N * T h N + 1 h N + 1 T Q N * + Q N T f ϑ 1 2 j = 1 N h j R j , j 1 h j T Q j * + 1 2 γ 2 h N + 1 h N + 1 T Q N * = 0 .

3. ADP-Based Near-Optimal Control for MZSGs Under DETC Mechanism

The dynamic event-triggered mechanism put in this paper represents t k as a singular monotonically increasing sequence of trigger instants, which satisfies that the condition 0 = t 0 < t 1 < < t k < . ϑ ¯ k ( t ) = ϑ ( t k ) is denoted as the state at trigger time t k , and the event-triggered errors are defined as:
e k , i = ϑ t ϑ ¯ k t , t k t t k + 1 .
Recalling (10) and (11), the event-triggered optimal control policies are obtained as
v ¯ i * ϑ ¯ k = 1 2 R i , i 1 h i T ϑ ¯ k Q ¯ i * ϑ ¯ k , i Υ ,
v ¯ N + 1 * ϑ ¯ k = 1 2 γ 2 h N + 1 T ϑ ¯ k Q ¯ N * ϑ ¯ k ,
where Q ¯ i * = Q i * x t = t k with i Γ . By substituting (15) and (16), the event-triggered HJ equations are represented as:
H i ϑ , Q i * , v ¯ 1 * , , v ¯ N * , v ¯ N + 1 * = ϑ T B i ϑ + 1 4 j = 1 N Q ¯ j * ϑ ¯ k T h j ϑ ¯ k R j , j 1 R i , j R j , j 1 h j T ϑ ¯ k Q ¯ j * ϑ ¯ k + Q i * T f ϑ 1 2 j = 1 N h j R j , j 1 h j ϑ ¯ k Q ¯ j * ϑ ¯ k + 1 2 γ 2 h N + 1 h N + 1 T ϑ ¯ k Q ¯ N * ϑ ¯ k ,
H N ϑ , Q N * , v ¯ 1 * , , v ¯ N * , v ¯ N + 1 * = ϑ T B N ϑ + 1 4 j = 1 N Q ¯ j * ϑ ¯ k T h j ϑ ¯ k R j , j 1 R N , j R j , j 1 h j ϑ ¯ k T Q ¯ j * ϑ ¯ k 1 4 γ 2 Q ¯ i * ϑ ¯ k T h N + 1 ϑ ¯ k h N + 1 T ϑ ¯ k Q ¯ i * ϑ ¯ k + Q N * T f ϑ 1 2 j = 1 N h j R j , j 1 h j T ϑ ¯ k Q ¯ j * ϑ ¯ k + 1 2 γ 2 h N + 1 h N + 1 T ϑ ¯ k Q ¯ N * ϑ ¯ k .
Assumption 2.
All optimal control policies v i * exhibit local Lipschitz continuity concerning the event-triggered error e k , i . That is, with i Υ , k N + and t k t t k + 1 , we can always find a normal number L u > 0 that makes v i * v ¯ i * L u e k , i hold.
Theorem 1.
Considering system (1), assume that Assumptions 1 and 2 hold and the value of cost performance is finite. The feedback control policies (10) and (11) from (8) are the equilibrium strategy, and a new dynamic triggering condition is applied as
e k , i ϑ 2 4 i = 1 N β i ω i + σ λ min B ϑ 2 4 γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) 2 4 λ max Ξ N + 1 L u 2 ,
where β i > 0 and internal signal ω i are defined as
ω ˙ i = β i ω i + 1 4 N σ λ min ( B ) ϑ 2 .
When this triggering condition is violated, the optimal control policies (15) and (16) under the dynamic event-triggering pattern can make system (1) UUB stable. The tuning parameters σ 0 , 1 can be chosen by the designer, with Ξ and B as defined in the following (23).
Proof. 
The Lyapunov function is L = ν 1 + ν 2 = i = 1 N Q i * ϑ t + i = 1 N ω i , where ν 1 = i = 1 N Q i * ϑ t and ν 2 = i = 1 N ω i .
  When implementing the optimal control policies (15) and (16) under the DETC method, the derivative of ν 1 along the trajectory of system (1) is
ν ˙ 1 = i = 1 N Q i * T f ϑ + j = 1 N + 1 h j v ¯ j * ϑ ¯ k .
According to the coupled HJB equations (12) and (13), we have
i = 1 N Q i * T f ϑ = i = 1 N ϑ T B i ϑ + j = 1 N v j * T R i , j v j * + Q i * T j = 1 N + 1 h j ϑ v j * + γ 2 v N , N + 1 * T v N , N + 1 * .
The augmented optimal control signal can be indicated by v * = v 1 * T , , v N * T T and the augmented control error can be indicated by v ˘ * = v ¯ 1 * v 1 * T , , v ¯ N * v N * T T . Substituting (22) into (21) yields:
ν ˙ 1 = i = 1 N ϑ T B i ϑ i = 1 N j = 1 N v j * T R i , j v j * i = 1 N Q i * T j = 1 N + 1 h j ϑ v j * v ¯ j * + γ 2 v N , N + 1 * T v N , N + 1 * = ϑ T B ϑ v * T R v * 2 v * T Y v ¯ * + γ 2 v N , N + 1 * T v N , N + 1 * ϑ T B ϑ v * T R v * + v * T R v * + v ˘ * T Y T R 1 Y v ˘ * + γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) 2 = ϑ T B ϑ + v ˘ * T Ξ v ˘ * + γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) 2 ,
where B = i = 1 N B i , R = d i a g i = 1 N R i , 1 , , i = 1 N R i , N , Ξ = Y T R 1 Y and
Y = R 1 , 1 h 1 P h 1 R 1 , 1 h 1 P h 2 R 1 , 1 h 1 P h N + 1 R N , N h N P h 1 R N , N h N P h 2 R N , N h N P h N + 1 . h i P is the pseudo-inverse of h i .
  It can be observed that B, R, and R 1 are all positive definite, thus the eigenvalues of Ξ are greater than zero. Because Y is not a zero matrix, both the smallest eigenvalue of B and the largest eigenvalue of Ξ are greater than zero; then, we have
ν ˙ 1 1 1 2 σ ϑ T B ϑ 1 2 σ λ min B ϑ 2 + λ max Ξ j = 1 N + 1 v ¯ j * v j * 2 + γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) 1 1 2 σ ϑ T B ϑ 1 2 σ λ min B ϑ 2 + λ max Ξ N + 1 L u 2 e k , i 2 + γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) .
On the other hand (20), we have
ν ˙ 2 i = 1 N β i ω i + 1 4 σ λ min B ϑ 2 .
Combining (24) and (25), it yields
L 1 1 2 σ ϑ T B ϑ i = 1 N β i ω i 1 4 σ λ min B ϑ 2 + λ max Ξ N + 1 L u 2 e k , i 2 + γ 2 v ¯ N , N + 1 * ( ϑ ¯ k ) .
When the dynamic triggering condition (19), designed by us, is satisfied, it yields that L 1 1 2 σ ϑ T B ϑ < 0 . According to Lyapunov’s theorem, the state vector of system (1) can gradually approach zero.
This proof is completed.   □
Remark 2.
Using Young’s inequality, the unknown term v * T Y v ¯ * can be eliminated. Through this operation, it becomes possible in the calculation of triggering conditions to dispense with the need for control input information, while simultaneously ensuring the stability of the system. Consequently, it also allows for greater conservation of system resources in computation while simultaneously making certain the stability of the system.

4. Dynamic Event-Triggered IRL Algorithm for MZSG Problem

Proposing an on-policy iterative algorithm is valuable for solving the HJI equation using the Bellman equation. The core concept of the model-based ADP algorithm outlined above forms the basis for the subsequent unknown system online learning-based Algorithm 1.
Algorithm 1 On-Policy ADP Algorithm
Step 1: The initial admissible control inputs are expressed as v ( 0 ) = ( v 1 ( 0 ) , , v N ( 0 ) , v N + 1 ( 0 ) ) also let s = 0 .
Step 2: Solve the following Bellman Equation (27) for V i s + 1 .
Step 3: Update the control policies by (28) and (29).
Step 4: If max Q i s + 1 Q i s h ( h is a set positive number), stop at step 3; otherwise, return to step 2.
Inspired by [38], we can establish the Bellman equations as follows:
Q i ( s + 1 ) T f ( ϑ ) + j = 1 N + 1 h j ( ϑ ) v j ( s ) + r i ( s ) = 0 ,
where r i ( s ) = ϑ T B i ϑ + j = 1 N v j ( s ) T R i , j v j ( s ) , i Υ , i N , r N ( s ) = ϑ T B N ϑ + j = 1 N v j ( s ) T R N , j v j ( s ) γ 2 v N + 1 ( s ) T v N + 1 ( s ) .
Based on the equilibrium condition, we have
v i ( s + 1 ) ( ϑ ) = 1 2 R i , i 1 h i T Q i ( s + 1 ) , i Υ ,
v N + 1 ( s + 1 ) ( ϑ ) = 1 2 γ 2 h N + 1 T Q N ( s + 1 ) .
In this section, motivated by the iterative idea of Algorithm 2, a model-free IRL algorithm for MZSGs problem is given. The system dynamics are entirely unknown, i.e., f ( x ) and h ( x ) are unknown, so by introducing a series of exploration signals e j into the control inputs v j to relax the need for precise knowledge of f ( x ) and h ( x ) , the traditional system (1) can be reformulated as:
ϑ ˙ = f ( ϑ ) + j = 1 N + 1 h j ( ϑ ) ( v j + e j ) .
Algorithm 2 Model-Free IRL Algorithm for MZSGs
Step 1: Set initial policies v ( 0 ) .
Step 2: Solve the Bellman Equation (31) to acquire ( Q s + 1 , v s + 1 )
Step 3: If max Q i s + 1 Q i s h ( h is a set positive number), stop it; otherwise, return to step 2.
Based on [23,38], the auxiliary detection signals are considered, and the Bellman equation can be given in the following form:
Q i ( s + 1 ) ( ϑ ( t + T ) ) Q i ( s + 1 ) ( ϑ ( t ) ) = 2 t t + T j = 1 N v i , j ( s + 1 ) T ( ϑ ) R i , j e j d s + 2 γ 2 t t + T v i , N + 1 ( s + 1 ) T ( ϑ ) e N + 1 d s t t + T r i s d s .
Considering (6) and (7), we have
Q ˙ i ( s + 1 ) ( ϑ ) = Q i ( s + 1 ) T ( ϑ ) f ( ϑ ) + j = 1 N + 1 h j ( ϑ ) v j ( s ) + Q i ( s + 1 ) T ( ϑ ) j = 1 N + 1 h j ( ϑ ) e j = Q i ( s + 1 ) T ( ϑ ) j = 1 N + 1 h j ( ϑ ) e j r i s .
Integrate both sides of Equation (32) over the interval from t to t + T , where T 0 , t ; then, we have
Q i ( s + 1 ) ( ϑ ( t + T ) ) Q i ( s + 1 ) ( ϑ ( t ) ) = t t + T Q i ( s + 1 ) T ( ϑ ) j = 1 N + 1 h j ( ϑ ) e j d s t t + T r i s d s .
To eliminate the requirement for input dynamics, the control defined similar to Equations (28) and (29) are defined as:
v i , j ( s + 1 ) ( ϑ ) = 1 2 R i , j 1 h j T Q i ( s + 1 ) , i , j Υ ,
v i , N + 1 ( s + 1 ) ( ϑ ) = 1 2 γ 2 h N + 1 T Q i ( s + 1 ) , i Υ , i N .
Theorem 2.
( Q s + 1 , v s + 1 )  is the solution of Equations (27)–(29) if and only if it is the solution of Equation (31).
Proof. 
( Q s + 1 , v s + 1 ) satisfies Equations (27)–(29) if and only if it is a solution to Equation (31). To establish the sufficient condition, it is sufficient to demonstrate that Equation (31) has a unique solution. This is proven through reasoning involving contradictions.
Assuming there exists another solution ( Q i , α , v i , α ) to Equation (31), it can be easily demonstrated as
Q ˙ i , α ( ϑ ( t ) ) = 2 j = 1 N v i , j , α T ( ϑ ) R i , j e j + 2 γ 2 v i , N + 1 , α T ( ϑ ) e j r i s .
In addition, one has
Q i s + 1 Q i , α = 2 j = 1 N v i , j ( s + 1 ) ( ϑ ) v i , j , α ( ϑ ) T R i , j e j + 2 γ 2 v i , N + 1 ( s + 1 ) ( ϑ ) v i , N + 1 , α ( ϑ ) T e N + 1 .
It is known that Equation (37) holds e j . Let e j = 0 , and it is concluded that Q i s + 1 Q i , α = a . From Q i s + 1 ( 0 ) Q i , α ( 0 ) = 0 , constant a = 0 . Furthermore, Q i s + 1 = Q i , α , to x Ω × Ω .
In this part of the document, we utilize the below NNs to estimate the value functions and the control inputs over the compact Ω , which is given as follows:
Q i ( ϑ ) = W i T ϕ i ( ϑ ) + ε i ( ϑ ) ,
where W i R L 1 are the ideal weight vectors. ϕ i ( ϑ ) R L 1 are the activation function vectors for the NN. ε i ( ϑ ) are the NN approximation errors.
The estimated value of V ^ i ( ϑ ) can be expressed as
Q ^ i ( ϑ ) = W ^ i T ϕ i ( ϑ ) ,
with W ^ i representing the estimated critic weight vector.
The target control policies are estimated by
v i , j ( ϑ ) = M i , j T φ i , j ( ϑ ) + ϵ i , j ( ϑ ) , j 1 , 2 , · · · , N + 1 ,
where M i , j R L 2 × m j are the ideal weight vectors. φ i , j ( ϑ ) R L 2 are the activation functions for the actor NNs. ϵ i , j ( ϑ ) are the NN approximation errors.
Denote the estimated values of M i , j as M ^ i , j . The estimated value of v i , j ( ϑ ) is given as follows:
v i , j ( ϑ ) = M ^ i , j T φ i , j ( ϑ ) , j 1 , 2 , · · · , N + 1 .
Refer to the event as the sampled reconstruction error ϵ i , j ( ϑ ¯ k ) = M i , j T [ φ i , j ( ϑ ¯ k e k , i ) φ i , j ( ϑ ¯ k ) ] + ϵ i , j ( ϑ ¯ k e k , i ) . When t t k , t k + 1 , the optimal control signals are designed as
v i , j ( ϑ ) = M i , j T φ i , j ( ϑ ¯ k ) + ϵ i , j ( ϑ ¯ k ) , j 1 , 2 , · · · , N + 1 .
Assumption 3.
(1) The ideal NN weight vectors W i and M i , j are bounded by W i W i , α , M i , j M i , α . (2) The NN activation functions ϕ i and φ i , j are all bounded by ϕ i ϕ i , α , φ i , j φ i , α . (3) The gradients of NN activation functions and the NN approximation error and its gradient are bounded by ϕ i ϕ i , β , φ i , j φ i , β and ε i ε i , b .
Substituting (38) and (42) into (12) and (13), respectively, the HJI equation becomes
ϑ T B i ϑ 2 j = 1 N + 1 M i , j T φ i , j ( ϑ ¯ k ) T R i , j M i , j T φ i , j ( ϑ ¯ k ) + W i T ϕ i ( ϑ ) f ( ϑ ) + j = 1 N M i , j T φ i , j ( ϑ ¯ k ) T R i , j ( M i , j T φ i , j ( ϑ ¯ k ) ) + ε H J I 1 = 0 , i Υ , i N ,
ϑ T B N ϑ + j = 1 N M N , j T φ N , j ( ϑ ¯ k ) T R N , j M N , j T φ N , j ( ϑ ¯ k ) ) + W N T ϕ N ( ϑ ) f ( ϑ ) γ 2 M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) T M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) 2 j = 1 N + 1 M N , j T φ N , j ( x ¯ k ) T R N , j M N , j T φ N , j ( x ¯ k ) + ε H J I 2 = 0 ,
where ε H J I 1 = 2 j = 1 N M i , j T φ i , j ( ϑ ¯ k ) T R i , j ϵ i , j ( ϑ ¯ k ) + j = 1 N ϵ i , j T ( ϑ ¯ k ) R i , j ϵ i , j ( ϑ ¯ k ) 4 j = 1 N + 1 M i , j T φ i , j ( ϑ ¯ k ) T R i , j ϵ i , j ( ϑ ¯ k ) 2 j = 1 N + 1 ϵ i , j T ( ϑ ¯ k ) R i , j ϵ i , j ( ϑ ¯ k ) + ε i T ( ϑ ) f ( ϑ ) and ε H J I 2 = 2 j = 1 N M N , j T φ N , j ( ϑ ¯ k ) T R N , j ϵ N , j ( ϑ ¯ k ) + j = 1 N ϵ N , j T ( ϑ ¯ k ) R N , j ϵ i , j ( ϑ ¯ k ) 2 γ 2 M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) T ϵ N , N + 1 ( ϑ ¯ k ) 4 j = 1 N + 1 M N , j T φ N , j ( ϑ ¯ k ) T R N , j ϵ N , j ( ϑ ¯ k ) 2 j = 1 N + 1 ϵ N , j T ( ϑ ¯ k ) R N , j ϵ N , j ( ϑ ¯ k ) + ε N T ( ϑ ) f ( ϑ ) .
The estimated value functions are shown in (39) and control signals can be obtained as
v ^ i , j ( ϑ ) = M ^ i , j T φ i , j ( ϑ ¯ k ) .
Based on (31) and (39), one has the residual error as
e ¯ i = W ^ i T ϕ i ( ϑ ( t + T ) ) W ^ i T ϕ i ( ϑ ( t ) ) + t t + T r ^ i d s + t t + T 2 j = 1 N e j T R i , j φ i , j ( ϑ ¯ k ) d s · vec M ^ i , j t t + T γ 2 e N + 1 T φ N , N + 1 ( ϑ ¯ k ) d s · vec M ^ N , N + 1 ,
where r ^ i = ϑ T B i ϑ + j = 1 N v ^ i , j T R i , j v ^ i , j , i Υ , i N and r ^ N = ϑ T B N ϑ + j = 1 N v ^ N , j T R N , j v ^ N , j γ 2 v ^ N , N + 1 T v ^ N , N + 1 .
Define the estimated value of the augmented NN weight vectors as
Γ i ^ = W ^ i vec ( M ^ i , j ) vec ( M ^ N , N + 1 ) .
The desired weight vector of Γ i ^ is denoted as
Γ i = W i vec ( M i , j ) vec ( M N , N + 1 ) .
Define
F 1 = t t + T r ^ i d s , F 2 = ϕ i ( ϑ ( t + T ) ) ϕ i ( ϑ ( t ) ) , F 3 = t t + T 2 j = 1 N e j T R i , j φ i , j ( ϑ ¯ k ) d s , F 4 = t t + T γ 2 e N + 1 T φ i , N + 1 ( ϑ ¯ k ) d s .
Then, the residual error e ¯ i becomes
e ¯ i = F 1 + F ¯ Γ ^ ,
where F ¯ = F 2 , F 3 , F 4 .
Let η 1 = 1 2 e ¯ i T e ¯ i , updating the augmented weight vectors using the gradient descent algorithm as
Γ ^ ˙ i = λ i F ¯ T F ¯ F ¯ T + 1 2 e ¯ i ,
where λ i is an adaptive gain value.
Assumption 4.
Consider the system in (1), let σ ¯ = F ¯ T F ¯ F ¯ T + 1 , and there exist constants m 1 and m 2 and T > 0 satisfies
m 1 I t t + T ρ ¯ ( t ) ρ ¯ T ( t ) d t m 2 I ,
which ensures that ρ ¯ remains persistently exciting over the interval t , t + T .
Remark 3.
The flowchart of the proposed method in Theorem 3 is given in Figure 1. In order to ensure the thorough learning of the neural network, variable σ ¯ ( t ) should be consistently incentivized, whereby the estimated weights Γ ^ i can converge to the ideal weights Γ i , and a term involving F ¯ F ¯ T + 1 2 is incorporated into the weight adjustment rule of NNs for stability analysis.
Considering the system dynamic (1), the augmented system state is set as § = ϑ T , ϑ ¯ k T , Γ ˜ i T T with Γ ˜ i = Γ i Γ ^ i , and the combined control system can be derived as
§ ˙ ( t ) = f ( ϑ ) + j = 1 N + 1 h j ( ϑ ) v ^ j ϑ ¯ k 0 λ i F ¯ T F ¯ F ¯ T + 1 2 e ¯ i , t t k , t k + 1 .
Additionally, we have
§ ( t ) = § ( t ) + 0 ϑ ϑ ¯ k 0 , t = t k + 1 ,
where § ( t ) = lim n 0 § ( t n ) , where n ( 0 , t k + 1 t k ) .

5. Stability Analysis

Theorem 3.
Assume that Assumptions 1–4 hold and the solution of Bellman Equation (31) exists. The DETC strategies are given in (39) and (41) for systems (1), and the augmented weight vector Γ ^ i is updated based on (50). Providing the triggering condition is as
e k , i 2 4 ( i = 1 N β 1 i i G ϑ ¯ k i = 1 N Z ( ϑ ¯ k ) ) + ϖ 2 λ min ( B ) ϑ 2 4 ( L u 2 i = 1 N j = 1 N r ¯ i , j 2 2 γ 2 N L u 2 ) ,
where G ϑ ¯ k and Z i ( ϑ ¯ k ) are proposed in (59) and (64), ϖ ( 0 , 1 ) is a design parameter.
˙ i = β 1 i i + ϖ 2 λ min ( B ) ϑ 2 4 N .
Proof. 
The Lyapunov function is chosen as
Λ = Λ 1 + Λ 2 + Λ 3 ,
where Λ 1 = K 1 + K 2 = i = 1 N Q i * ( ϑ ) + i = 1 N i with K 1 = i = 1 N Q i * ( ϑ ) and K 2 = i = 1 N i , Λ 2 = i = 1 N 1 2 λ i Γ ˜ i T Γ ˜ i and Λ 3 = i = 1 N Q i * ( ϑ ¯ k ) .
Case 1: When t t k , t k + 1 , k Γ , considering system dynamics (1) and applying the DETC policies in (45), it has Λ ˙ 1 ( t ) = K ˙ 1 + K ˙ 2 with K ˙ 1 = i = 1 N Q i * ( ϑ ) T ( f ( ϑ ) + i = 1 N + 1 h j v ^ i , j ( ϑ ¯ k ) ) and K ˙ 2 = i = 1 N ˙ i , Λ ˙ 2 ( t ) = i = 1 N 1 λ i Γ ˜ i T Γ ˜ ˙ i and Λ ˙ 3 ( t ) = 0 .
Considering (38), K ˙ 1 ( t ) becomes
i = 1 N Q ˙ i ( ϑ ) = i = 1 N W i T ϕ i ( ϑ ) f ( ϑ ) + j = 1 N + 1 h j M ^ i , j T φ i , j ( ϑ ¯ k ) + ( x ) ,
where ( ϑ ) = i = 1 N ε i ( ϑ ) f ( ϑ ) + j = 1 k + 1 h j M ^ i , j T φ i , j ϑ ¯ k .
Based on Equations (10)–(13), we have
i = 1 N Q i * ( ϑ ) T f ( ϑ ) = i = 1 N ϑ T B i ϑ i = 1 N Q i * ( ϑ ) T h N + 1 v i , N + 1 * + i = 1 N j = 1 N v i , j * T R i , j v i , j * + γ 2 v N , N + 1 * T v N , N + 1 * .
In light of (10), (11), and (58) and based on the fact R i , j = r ¯ i , j T r ¯ i , j and Assumption 2, K ˙ 1 becomes
K ˙ 1 = i = 1 N ϑ T B i ϑ 2 γ 2 i = 1 N v i , N + 1 * T ( ϑ ) v i , N + 1 * ( ϑ ) + i = 1 N j = 1 N v i , j * T ( ϑ ) R i , j v i , j * ( ϑ ) + γ 2 v N , N + 1 * T ( ϑ ) v N , N + 1 * ( ϑ ) 2 i = 1 N j = 1 N v i , j * T ( ϑ ) R i , j v ^ i , j ( ϑ ¯ k ) + 2 γ 2 i = 1 N v i , N + 1 * T ( ϑ ) v ^ i , N + 1 ( ϑ ¯ k ) ϑ T B ϑ + i = 1 N j = 1 N r ¯ i , j 2 v ^ i , j ( ϑ ¯ k ) v i , j * ( ϑ ) 2 i = 1 N j = 1 N r ¯ i , j 2 v ^ i , j ( ϑ ¯ k ) 2 2 γ 2 i = 1 N v ^ i , N + 1 ( ϑ ¯ k ) v i , N + 1 * ( ϑ ) 2
+ 2 γ 2 i = 1 N v ^ i , N + 1 ( ϑ ¯ k ) 2 + T ϑ T B ϑ + ( L u 2 i = 1 N j = 1 N r ¯ i , j 2 2 γ 2 N L u 2 ) e k , i 2 + T i = 1 N j = 1 N r ¯ i , j 2 v ^ i , j ( ϑ ¯ k ) 2 + 2 γ 2 i = 1 N v ^ i , N + 1 ( ϑ ¯ k ) 2 = ϑ T B ϑ + ( L u 2 i = 1 N j = 1 N r ¯ i , j 2 2 γ 2 N L u 2 ) e k , i 2 + G ( ϑ ¯ k ) ( 1 ϖ 2 2 ) ϑ T B ϑ ϖ 2 2 λ min ( B ) ϑ 2 + ( L u 2 i = 1 N j = 1 N r ¯ i , j 2 2 γ 2 N L u 2 ) e k , i 2 + G ( ϑ ¯ k ) ,
where B is given in (23) and G ( ϑ ¯ k ) = T i = 1 N j = 1 N r ¯ i , j 2 v ^ i , j ( ϑ ¯ k ) 2 + 2 γ 2 i = 1 N v ^ i , N + 1 ( ϑ ¯ k ) 2 where according to assumption 3, T is a normal number. And based on (49), we obtain
e ¯ i = F ¯ ( Γ i Γ ˜ i ) + t t + T ϑ T B i ϑ + U ^ i d s ,
where U ^ i = j = 1 N v ^ i , j T ( ϑ ¯ k ) R i , j v ^ i , j ( ϑ ¯ k ) , i Υ , i N and U ^ N = j = 1 N v ^ N , j T ( ϑ ¯ k ) R N , j v ^ N , j ( ϑ ¯ k ) γ 2 v ^ N , N + 1 T ( ϑ ¯ k ) v ^ N , N + 1 ( ϑ ¯ k ) .
Based on (43) and (44), we have
t t + T ϑ T B i ϑ + U ^ i d s = t t + T ( W i T ϕ i ( ϑ ) f ( ϑ ) ε H J I 1 + U ^ i U ¯ 1 + U ¯ 2 ) d s i Υ , i N , t t + T ϑ T B N ϑ + U ^ N d s = t t + T W N T ϕ N ( ϑ ) f ( ϑ ) ε H J I 2 + U ^ N U ¯ 3 + U ¯ 4
+ γ 2 ( M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) ) T ( M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) ) d s ,
where U ¯ 1 = j = 1 N ( M i , j T φ i , j ( ϑ ¯ k ) ) T R i , j M i , j T φ i , j ( ϑ ¯ k ) , U ¯ 2 = 2 j = 1 N + 1 ( M i , j T φ i , j ( ϑ ¯ k ) ) T R i , j M i , j T φ i , j ( ϑ ¯ k ) , ε H J I 1 ε h m 1 and U ¯ 3 = j = 1 N ( M N , j T φ N , j ( ϑ ¯ k ) ) T R N , j M N , j T φ N , j ( ϑ ¯ k ) , U ¯ 4 = 2 j = 1 N + 1 ( M N , j T φ N , j ( ϑ ¯ k ) ) T R N , j M N , j T φ N , j ( ϑ ¯ k ) , ε H J I 2 ε h m 2 . Define S 1 i = t t + T U ¯ 2 U ¯ 1 ε H J I 1 d s , S 2 i = t t + T U ^ i d s , i Υ , i N , S 3 i = t t + T . W i T ϕ i ( ϑ ) f ( ϑ ) d s and S 1 N = t t + T U ¯ 4 U ¯ 3 ε H J I 2 + γ 2 ( M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) ) T M N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) γ 2 ( M ^ N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) ) T M ^ N , N + 1 T φ N , N + 1 ( ϑ ¯ k ) d s , S 2 N = t t + T j = 1 N ( M ^ N , j T φ N , j ( ϑ ¯ k ) ) T R N , j M ^ N , j T φ N , j ( ϑ ¯ k ) d s .
If the sampling interval T is sufficiently small, the integral terms S 1 i , S 2 i , S 3 i , S 1 N , and S 2 N can be approximated asb S 1 i = T U ¯ 2 U ¯ 1 ε H J I 1 , S 2 i = T U ^ i , i Υ , i N , S 3 i = T W i T ϕ i ( ϑ ) f ( ϑ ) . When i = N , S 1 N = T U ¯ 4 U ¯ 3 ε H J I 2 + γ 2 ( M N , j T φ N , j ( ϑ ¯ k ) ) T M N , j T φ N , j ( ϑ ¯ k ) γ 2 ( M ^ N , j T φ N , j ( ϑ ¯ k ) ) T M ^ N , j T φ N , j ( ϑ ¯ k ) , S 2 N = T j = 1 N ( M ^ N , j T φ N , j ( ϑ ¯ k ) ) T R N , j M ^ N , j T φ N , j ( ϑ ¯ k ) .
Review Assumption 3; there is a constant d i that satisfies S 1 i d i .
Then, F 1 = S 1 i + S 2 i + S 3 i . Thus, we have
e ¯ i = F ¯ ( Γ i Γ ˜ i ) + S 1 i + S 2 i + S 3 i , i Υ .
According to (50) and (60), one has
Λ ˙ 2 = i = 1 N 1 λ i Γ ˜ i T Γ ˜ ˙ i = i = 1 N Γ ˜ i T F ¯ T ( F ¯ F ¯ T + 1 ) 2 F ¯ Γ ˜ i + F ¯ Γ i + S 1 i + S 2 i + S 3 i i = 1 N ( Γ ˜ i T ρ ¯ ρ ¯ T Γ ˜ i + Γ ˜ i T ρ ¯ ρ ¯ T Γ i + 1 2 d i Γ ˜ i + 1 4 Γ ˜ i 2 + Z i ( ϑ ¯ k ) + 1 4 Γ ˜ i 2 + 1 4 D 1 2 ϑ 2 ) i = 1 N ( ( m 1 T 1 2 ) Γ ˜ i 2 + Z i ( ϑ ¯ k ) + 1 4 D 1 2 ϑ 2 + m 2 T W i , α 2 + M i , α 2 + d i 2 Γ ˜ i ) , i Υ ,
where Z ( ϑ ¯ k ) = T 2 4 j = 1 N v ^ i , j T ( ϑ ¯ k ) R i , j v ^ i , j ( ϑ ¯ k ) 2 and D 1 = T W i , α ϕ i , β B f .
Using (55), (56), (59), and (64), Λ ˙ becomes
Λ ˙ i = 1 N β 1 i R i ( 1 ϖ 2 2 ) ϑ T B ϑ ϖ 2 4 λ min ( B ) ϑ 2 + ( L u 2 i = 1 N j = 1 N r ¯ i , j 2 2 γ 2 N L u 2 ) e k , i 2 + G ( ϑ ¯ k ) + i = 1 N ( m 1 T 1 2 ) Γ ˜ i 2 + Z i ( ϑ ¯ k ) + 1 4 D 1 2 ϑ 2
+ m 2 T W i , α 2 + M i , α 2 + d i 2 Γ ˜ i .
If the triggering condition (54) is satisfied, we have
Λ ˙ ( 1 ϖ 2 2 ) ϑ T B ϑ i = 1 N ( m 1 T 1 2 ) Γ ˜ i 2 + 1 4 D 1 2 ϑ 2 + m 2 T W i , α 2 + M i , α 2 + d i 2 Γ ˜ i ( 1 ϖ 2 2 ) ϑ T B ϑ i = 1 N ð 1 Γ ˜ i 2 m 2 W i , α 2 + M i , α 2 + T d i 2 2 m 1 T 2 + i = 1 N ( m 3 + ð 2 ϑ 2 ) ,
where ð 1 = m 1 T 1 2 , ð 2 = 1 4 D 1 2 , m 3 = 2 m 2 W i , α 2 + M i , α 2 + d i T 2 8 T 2 m 1 T .
Let the parameter ϖ , m 1 and the interval T be chosen such that ϖ 2 λ min B 1 / 4 D 1 2 > 0 and m 1 > T 2 . Then, Λ ˙ ( t ) < 0 if
ϑ > m 3 ð 2 , Γ ˜ i > m 3 ð 1 + 2 m 2 W i , α 2 + M i , α 2 + T d i 2 2 m 1 T .
Therefore, the system states and the NN weight estimation errors are UUB.
Case 2: When t = t k + 1 , the differential of this function can be expressed as
Δ Λ ( t ) = 1 2 Γ ˜ i T ϑ ¯ t k + Γ ˜ i ϑ ¯ t k + 1 2 Γ ˜ i T ϑ ¯ t k Γ ˜ i ϑ ¯ t k + Q i * ( ϑ ¯ k + 1 ) Q i * ( ϑ ¯ k ) + Q i * ϑ ¯ t k * Q i + t ¯ k .
Based on Case 1, Q ( t ) is monotonically decreasing on [ t k , t k + 1 ) . Thus, one obtains Q ( t k + ) < Q ( t k ) , ( 0 , t k + 1 t k ) . Applying the limit to each side of the inequality, it has Λ ( t k + ) Λ ( t k ) . Based on that fact, we have
Q i * ϑ ¯ s k + + 1 2 Γ ˜ i T ϑ ¯ t k + Γ ˜ i ϑ ¯ t k + Q i * ϑ ¯ t k + 1 2 Γ ˜ i T ϑ ¯ t k Γ ˜ i ϑ ¯ t k .
Moreover, Q i * ( ϑ ) is a continuous difference function at the triggering instant t k with k Γ . According to the results of Case 1, one has Q i * ( ϑ ¯ k + 1 ) Q i * ( ϑ ¯ k ) . According to the analysis presented above, it obtains Δ Λ ( t ) 0 . □

6. Simulation

In this section, examples of connecting two nonlinear systems are provided to verify the effectiveness of the algorithm.

6.1. A Nonlinear Example

In this example, to assess the efficacy of our method, we consider four players by [39]:
ϑ ˙ = ϑ 2 0.5 ϑ 1 0.5 ϑ 2 1 + cos 2 ϑ 1 + 2 2 + 0 sin ϑ 1 v 1 ( ϑ ) + 0 2 cos 2 ϑ 1 + 4 v 2 ( ϑ ) + 0 sin ϑ 2 v 3 ( ϑ ) + 0 2 sin 2 ϑ 1 + 4 d 1 ( ϑ ) .
Define the cost performances as
Q 1 = t = 0 ϑ T B 1 ϑ + v 1 T R 11 v 1 + v 2 T R 12 v 2 + v 3 T R 13 v 3 d s ,
Q 2 = t = 0 ϑ T B 2 ϑ + v 1 T R 21 v 1 + v 2 T R 22 v 2 + v 3 T R 23 v 3 d s ,
Q 3 = t = 0 ϑ T B 3 ϑ + v 1 T R 31 v 1 + v 2 T R 32 v 2 + v 3 T R 33 v 3 γ 2 d 1 T d 1 d s .
Set γ = 8 , R 11 = R 12 = R 13 = R 21 = R 22 = R 23 = 2 I , R 31 = R 32 = R 33 = I , and I has appropriate dimensions, and let B 1 = B 2 = 9 0 0 9 , B 3 = 5 0 0 5 . The values of other quantities are the same as the previous simulation.
In the design, assume that the all initial probing control inputs are a = 37 e 0.061 t ( sin 2 ( t ) cos ( t ) + sin 2 ( 2 t ) cos ( 0.1 t ) ) . The simulation parameters are shown in Table 1.
The weight values of the first critic NN converge to [0.0735, −0.0450, −0.0391, 0.0933, −0.0863, 0.0406, −0.0506, 0.0049].
The weight values of the second critic NN converge to [0.0630, 0.1011, −0.0737, 0.0608, −0.0660, 0.0269, −0.0443, −0.0504].
The weight values of the third critic NN converge to [−0.0478, 0.0633, −0.0894, 0.0563, −0.0596, 0.0573, −0.0492, −0.0533].
The weight values of the first actor NN converge to [−0.0841, 0.0986, −0.0840, −0.0064, 0.1016, −0.0182, −0.0609, −0.0122].
The weight values of the second actor NN converge to [0.0368, 0.0522, −0.0989, −0.0856, 0.0639, −0.0972, −0.0370, 0.0443].
The weight values of the third actor NNconverge to [−0.0234, 0.0595, −0.0533, 0.0918, −0.0593, −0.0925, 0.1057, 0.0467].
The weight values of the fourth actor NN converge to [−0.0901, −0.0371, −0.0823, −0.0966, 0.0726, 0.0672, −0.0004, 0.0482].
It is clear that the system state converges to zero.
It is obvious that the control input converges to zero.
It can be found that the norm of sampling error under dynamic event-triggering converges to zero.
It is obvious that compared to static event-triggering, the number of actuator updates triggered by dynamic events decreased by 55%.
It can be found that the internal dynamic signals converges to zero.
Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14 and Figure 15 display the simulation results. Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 show the convergence of the weight vectors of the critic and actor NNs.
It is clear that the norm of sampling error under static event-triggering converges to zero.
In Figure 12 and Figure 15, the horizontal axis of symbol “*” represents the triggering instant, and the vertical axis of symbol “*” denotes the difference between the current triggering instant and the previous triggering instant. The evolutions of the system states and the control inputs are displayed in Figure 9 and Figure 10, respectively. They illustrate that the system will rapidly converge to zero. The Figure 11 and Figure 14, respectively, provide curves of the sampling error norms triggered by dynamic and static events and the triggering threshold specifications. The sampling times triggered by dynamic and static events are shown in Figure 10 and Figure 15. Figure 13 shows the internal signal. From the graph, it can be seen that the internal signal is positive.
By comparison, it is evident that dynamic event-triggering strategies containing dynamic internal signals have larger thresholds and fewer event-triggering times. Compared to static event-triggering, the number of actuator updates have been reduced by 55%. This indicates that compared to other triggering mechanisms, the dynamic event-triggering mechanism proposed in this paper can significantly reduce computational complexity.

6.2. A Comparison Simulation

Consider the following system dynamics in [38]
ϑ ˙ = ϑ 2 2 ϑ 1 ( 1 2 ϑ 1 ϑ 2 + 1 4 ϑ 2 cos 2 ϑ 1 + 2 2 + 1 4 ϑ 2 sin 4 ϑ 1 2 + 2 2 ) + 0 cos 2 ϑ 1 + 2 v 1 ( ϑ ) + 0 sin 4 ϑ 1 2 + 2 v 2 ( ϑ ) + 0 sin 4 ϑ 1 + 2 d 1 ( ϑ ) .
Define the cost performances as
Q 1 = t = 0 ϑ T B 1 ϑ + v 1 T R 11 v 1 + v 2 T R 12 v 2 d s ,
Q 2 = t = 0 ϑ T B 2 ϑ + v 1 T R 21 v 1 + v 2 T R 22 v 2 γ 2 d 1 T d 1 d s .
Set γ = 8 , R 11 = R 12 = 2 I , R 21 = R 22 = I , and I has appropriate dimensions, B 1 = 2 0 0 2 , and B 2 = 1 0 0 1 . For critic NNs, the initial weights are set to [−0.1, 0.1], and the initial weight vectors of the actor NNs are randomly set in the interval [ 1 , 1 ] . Select the NN activation function as ϕ i ( ϑ ) = φ i , j ( ϑ ) = [ ϑ 1 2 , ϑ 1 ϑ 2 , ϑ 2 2 ] T . The initial system state is ϑ 0 = [ 1 , 1 ] T , and set the sampling time as T = 0.005 s. Moreover, select the parameter λ i = 0.75 . The probing control inputs are added, which are a = 37 e 0.061 t ( sin 2 ( t ) cos ( t ) + sin 2 ( 2 t ) cos ( 0.1 t ) ) .
The weight values of the first critic NN converge to 0.3315 , 0.5437 , 0.4726 .
The weight values of the second critic NN converge to 0.0832 , 0.5894 , 0.1181 .
The weight values of the first actor NN converge to 0.1354 , 0.1913 , 0.4494 .
The weight values of the second actor NN converge to 0.4913 , 1.054 , 0.2434 .
The weight values of the third actor NN converge to 0.8691 , 0.8623 , 0.5519 .
It is clear that the system state converges to zero.
Figure 16, Figure 17, Figure 18, Figure 19 and Figure 20 illustrate the convergence of the critic and actor NNs weight vector. The evolutions of the system states and the control inputs are displayed in Figure 21 and Figure 22, which show that the system states converge and the control signals rapidly converge to zero. Figure 23 and Figure 24 present the norm of the sampling error e j 2 and the norm of the trigger threshold e T , and the sampling time T 1 , respectively. The internal dynamic signals R are depicted in Figure 25. It can be seen that the internal signal is positive.
It is obvious that the control input converges to zero.
It can be found that the norm of sampling error under dynamic event-triggering converges to zero.
To highlight the benefits of the proposed dynamic event-triggering mechanism, Figure 26 and Figure 27 show the sampling error, threshold norm, and the sample interval of the systems under the static event-triggering condition. By comparing Figure 23 and Figure 24 with Figure 26 and Figure 27, it is evident that the dynamic event-triggering strategy incorporating the positive dynamic internal signal R has a larger threshold and fewer event-triggering occurrences. The dynamic event-triggering mechanism introduced in this paper can significantly reduce the computational burden.
In Figure 24 and Figure 27, the horizontal axis of symbol “*” represents the triggering instant, and the vertical axis of symbol “*” denotes the difference between the current triggering instant and the previous triggering instant.
It can be found that compared to static event-triggering, the number of actuator updates triggered by dynamic events decreased by 69%.
It can be found that the internal dynamic signals converges to zero.
It is clear that the norm of sampling error under static event-triggering converges to zero.

7. Conclusions

In this paper, a DETC approach based on IRL has been designed for the MZSG problem in nonlinear systems with completely unknown dynamics. This algorithm has relaxed the requirements on the system drift dynamics. In the adaptive control process, an actor–critic approach has been adopted to design synchronized adjustment rules for weight updates. The IRL algorithm designed in this framework has been implemented online, providing a new approach for combining NN and DETC to solve the MZSG problem. The method has proposed dynamic event-triggering conditions, which reduces communication resources for the controlled object. Finally, the effectiveness of the proposed method has been validated through two examples. The simulation results show that the number of actuator updates in the DETC mechanism has been reduced by 55% and 69%, respectively, compared with static event-triggering mode.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Z.S. and X.M.; validation, Z.S., H.S. and L.L.; formal analysis, H.S.; data curation, L.L.; writing—original draft preparation, Z.S. and Y.L.; writing—review and editing, Z.S.; visualization, X.M.; supervision, L.L. and H.S.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program Project (2022YFF0610400); Liaoning Province Applied Basic Research Program Project (2022JH2/101300246); Liaoning Provincial Science and Technology Plan Project (2023-MSLH-260); Liaoning Provincial Department of Education Youth Program for Higher Education Institutions (JYTQN2023287); and Project of Shenyang Key Laboratory for Innovative Application of Industrial Intelligent Chips and Network Systems.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors have no competing interests or conflicts of interest to declare that are relevant to the content of this article.

References

  1. Chen, L.; Yu, Z.Y. Maximum principle for nonzero-sum stochastic differential game with delays. IEEE Trans. Automa. Control 2014, 60, 1422–1426. [Google Scholar] [CrossRef]
  2. Yasini, S.; Sitani, M.B.N.; Kirampor, A. Reinforcement learning and neural networks for multi-agent nonzero-sum games of nonlinear constrained-input systems. Int. J. Mach. Learn. Cybern. 2016, 7, 967–980. [Google Scholar] [CrossRef]
  3. Rupnik Poklukar, D.; Žerovnik, J. Double Roman Domination: A Survey. Mathematics 2023, 11, 351. [Google Scholar] [CrossRef]
  4. Leon, J.F.; Li, Y.; Peyman, M.; Calvet, L.; Juan, A.A. A Discrete-Event Simheuristic for Solving a Realistic Storage Location Assignment Problem. Mathematics 2023, 11, 1577. [Google Scholar] [CrossRef]
  5. Tanwani, A.; Zhu, Q.Y.Z. Feedback nash equilibrium for randomly switching differential-algebraic games. IEEE Trans. Automa. Control 2020, 65, 3286–3301. [Google Scholar] [CrossRef]
  6. Peng, B.W.; Stancu, A.; Dang, S.P. Differential graphical games for constrained autonomous vehicles based on viability theory. IEEE Trans. Cybern. 2022, 52, 8897–8910. [Google Scholar] [CrossRef]
  7. Wu, S. Linear-quadratic non-zero sum backward stochastic differential game with overlapping information. IEEE Trans. Autom. Control 2023, 60, 1800–1806. [Google Scholar] [CrossRef]
  8. Dong, B.; Feng, Z.; Cui, Y.M. Event-triggered adaptive fuzzy optimal control of modular robot manipulators using zero-sum differential game through value iteration. Int. J. Adapt. Control Signal Process. 2023, 37, 2364–2379. [Google Scholar] [CrossRef]
  9. Wang, D.; Zhao, M.M.; Ha, M.M. Stability and admissibility analysis for zero-sum games under general value iteration formulation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 8707–8718. [Google Scholar] [CrossRef]
  10. Du, W.; Ding, S.F.; Zhang, C.L. Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning. Int. J. Mach. Learn. Cybern. 2021, 12, 2947–2961. [Google Scholar] [CrossRef]
  11. Lv, Y.F.; Ren, X.M. Approximate nash solutions for multiplayer mixed-zero-sum game with reinforcement learning. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 2739–2750. [Google Scholar] [CrossRef]
  12. Qin, C.B.; Zhang, Z.W.; Shang, Z.Y. Adaptive optimal safety tracking control for multiplayer mixed zero-sum games of continuous-time systems. Appl. Intell. 2023, 53, 17460–17475. [Google Scholar] [CrossRef]
  13. Ming, Z.Y.; Zhang, H.G.; Zhang, J.; Xie, X.P. A novel actor-critic-identifier architecture for nonlinear multiagent systems with gradient descent method. Automatica 2023, 155, 645–657. [Google Scholar] [CrossRef]
  14. Ming, Z.Y.; Zhang, H.G.; Wang, Y.C.; Dai, J. Policy iteration Q-learning for linear ito stochastic systems with markovian jumps and its application to power systems. IEEE Trans. Cybern. 2024, 54, 7804–7813. [Google Scholar] [CrossRef] [PubMed]
  15. Song, R.Z.; Liu, L.; Xia, L.; Frank, L. Online optimal event-triggered H control for nonlinear systems with constrained state and input. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 1–11. [Google Scholar] [CrossRef]
  16. Liu, Y.; Zhang, H.G.; Yu, R. H tracking control of discrete-time system with delays via data-based adaptive dynamic programming. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 4078–4085. [Google Scholar] [CrossRef]
  17. Zhang, H.G.; Liu, Y.; Xiao, G.Y. Data-based adaptive dynamic programming for a class of discrete-time systems with multiple delays. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 432–441. [Google Scholar] [CrossRef]
  18. Wang, R.G.; Wang, Z.; Liu, S.X. Optimal spin polarization control for the spin-exchange relaxation-free system using adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5835–5847. [Google Scholar] [CrossRef]
  19. Wen, Y.; Si, J.; Brandt, A. Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Trans. Cybern. 2020, 50, 2346–2356. [Google Scholar] [CrossRef]
  20. Yu, S.H.; Zhang, H.G.; Ming, Z.Y. Adaptive optimal control via continuous-time Q-learning for Stackelberg-Nash games of uncertain nonlinear systems. IEEE Trans. Syst. Man Cybern. Syst. 2023, 54, 2346–2356. [Google Scholar] [CrossRef]
  21. Rizvi, S.A.A.; Lin, Z.L. Adaptive dynamic programming for model-free global stabilization of control constrained continuous-time systems. IEEE Trans. Cybern. 2022, 52, 1048–1060. [Google Scholar] [CrossRef] [PubMed]
  22. Vrabie, D.; Pastravanu, O.; Abu-Khalaf, M. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 2009, 45, 477–484. [Google Scholar] [CrossRef]
  23. Cui, X.H.; Zhang, H.G.; Luo, Y.H. Adaptive dynamic programming for tracking design of uncertain nonlinear systems with disturbances and input constraints. Int. J. Adapt. Control Signal. Process. 2017, 31, 1567–1583. [Google Scholar] [CrossRef]
  24. Zhang, Q.C.; Zhao, D.B. Data-based reinforcement learning for nonzero-sum games with unknown drift dynamics. IEEE Trans. Cybern. 2019, 48, 2874–2885. [Google Scholar] [CrossRef] [PubMed]
  25. Song, R.Z.; Lewis, F.L.; Wei, Q.L. Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 704–713. [Google Scholar] [CrossRef]
  26. Liang, T.L.; Zhang, H.G.; Zhang, J. Event-triggered guarantee cost control for partially unknown stochastic systems via explorized integral reinforcement learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 7830–7844. [Google Scholar] [CrossRef]
  27. Ren, H.; Dai, J.; Zhang, H.G. Off-policy integral reinforcement learning algorithm in dealing with nonzero sum game for nonlinear distributed parameter systems. Trans. Inst. Meas. Control 2024, 42, 2919–2928. [Google Scholar] [CrossRef]
  28. Yoo, J.; Johansson, K.H. Event-triggered model predictive control with a statistical learning. IEEE Trans. Syst. Man Cybern. Syst. 2021, 51, 2571–2581. [Google Scholar] [CrossRef]
  29. Huang, Y.L.; Xiao, X.; Wang, Y.H. Event-triggered pinning synchronization and robust pinning synchronization of coupled neural networks with multiple weights. Int. J. Adapt. Control Signal. Process. 2022, 37, 584–602. [Google Scholar] [CrossRef]
  30. Sun, L.B.; Huang, X.C.; Song, Y.D. A novel dual-phase based approach for distributed event-triggered control of multiagent systems with guaranteed performance. IEEE Trans. Cybern. 2024, 54, 4229–4240. [Google Scholar] [CrossRef]
  31. Li, Z.X.; Yan, J.; Yu, W.W. Adaptive event-triggered control for unknown second-order nonlinear multiagent systems. IEEE Trans. Cybern. 2021, 51, 6131–6140. [Google Scholar] [CrossRef] [PubMed]
  32. Fan, Y.; Yang, Y.; Zhang, Y. Sampling-based event-triggered consensus for multi-agent systems. Neurocomputing 2016, 191, 141–147. [Google Scholar] [CrossRef]
  33. Girard, A. Dynamic triggering mechanisms for event-triggered control. IEEE Trans. Autom. Control 2015, 60, 1992–1997. [Google Scholar] [CrossRef]
  34. Zhang, J.; Yang, D.; Wang, Y.; Zhou, B. Dynamic event-based tracking control of boiler turbine systems with guaranteed performance. IEEE Trans. Autom. Sci. Eng. 2023, 21, 4272–4282. [Google Scholar] [CrossRef]
  35. Xu, H.C.; Zhu, F.L.; Ling, X.F. Observer-based semi-global bipartite average tracking of saturated discrete-time multi-agent systems via dynamic event-triggered control. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 3156–3160. [Google Scholar] [CrossRef]
  36. Chen, J.J.; Chen, B.S.; Zeng, Z.G. Adaptive dynamic event-triggered fault-tolerant consensus for nonlinear multiagent systems with directed/undirected networks. IEEE Trans. Cybern. 2023, 53, 3901–3912. [Google Scholar] [CrossRef]
  37. Hou, Q.H.; Dong, J.X. Cooperative fault-tolerant output regulation of linear heterogeneous multiagent systems via an adaptive dynamic event-triggered mechanism. IEEE Trans. Cybern. 2023, 53, 5299–5310. [Google Scholar] [CrossRef]
  38. Song, R.Z.; Yang, G.F.; Lewis, F.L. Nearly optimal control for mixed zero-sum game based on off-policy integral reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 2793–2804. [Google Scholar] [CrossRef]
  39. Han, L.H.; Wang, Y.C.; Ma, Y.C. Dynamic event-triggered non-fragile dissipative filtering for interval type-2 fuzzy Markov jump systems. Int. J. Mach. Learn. Cybern. 2024, 15, 4999–5013. [Google Scholar] [CrossRef]
Figure 1. The flowchart of the proposed method in Theorem 3.
Figure 1. The flowchart of the proposed method in Theorem 3.
Mathematics 12 03916 g001
Figure 2. Weight history of the first critic NN.
Figure 2. Weight history of the first critic NN.
Mathematics 12 03916 g002
Figure 3. Weight history of the second critic NN.
Figure 3. Weight history of the second critic NN.
Mathematics 12 03916 g003
Figure 4. Weight history of the third critic NN.
Figure 4. Weight history of the third critic NN.
Mathematics 12 03916 g004
Figure 5. Weight history of the first actor NN.
Figure 5. Weight history of the first actor NN.
Mathematics 12 03916 g005
Figure 6. Weight history of the second actor NN.
Figure 6. Weight history of the second actor NN.
Mathematics 12 03916 g006
Figure 7. Weight history of the third actor NN.
Figure 7. Weight history of the third actor NN.
Mathematics 12 03916 g007
Figure 8. Weight history of the fourth actor NN.
Figure 8. Weight history of the fourth actor NN.
Mathematics 12 03916 g008
Figure 9. The evolutions of the system state.
Figure 9. The evolutions of the system state.
Mathematics 12 03916 g009
Figure 10. The control input.
Figure 10. The control input.
Mathematics 12 03916 g010
Figure 11. Sampling error norm and triggering threshold norm under DETC.
Figure 11. Sampling error norm and triggering threshold norm under DETC.
Mathematics 12 03916 g011
Figure 12. Sampling period of the learning stage by DETC.
Figure 12. Sampling period of the learning stage by DETC.
Mathematics 12 03916 g012
Figure 13. The internal dynamic signals.
Figure 13. The internal dynamic signals.
Mathematics 12 03916 g013
Figure 14. Sampling error norm and triggering threshold using static mechanisms proposed in [29].
Figure 14. Sampling error norm and triggering threshold using static mechanisms proposed in [29].
Mathematics 12 03916 g014
Figure 15. Sampling period of the learning stage using static mechanisms proposed in [29].
Figure 15. Sampling period of the learning stage using static mechanisms proposed in [29].
Mathematics 12 03916 g015
Figure 16. Weight history of the first critic NN.
Figure 16. Weight history of the first critic NN.
Mathematics 12 03916 g016
Figure 17. Weight history of the second critic NN.
Figure 17. Weight history of the second critic NN.
Mathematics 12 03916 g017
Figure 18. Weight history of the first actor NN.
Figure 18. Weight history of the first actor NN.
Mathematics 12 03916 g018
Figure 19. Weight history of the second actor NN.
Figure 19. Weight history of the second actor NN.
Mathematics 12 03916 g019
Figure 20. Weight history of the third actor NN.
Figure 20. Weight history of the third actor NN.
Mathematics 12 03916 g020
Figure 21. The evolutions of the system state.
Figure 21. The evolutions of the system state.
Mathematics 12 03916 g021
Figure 22. The control input.
Figure 22. The control input.
Mathematics 12 03916 g022
Figure 23. Sampling error norm and triggering threshold norm under DETC.
Figure 23. Sampling error norm and triggering threshold norm under DETC.
Mathematics 12 03916 g023
Figure 24. Sampling period of the learning stage by DETC.
Figure 24. Sampling period of the learning stage by DETC.
Mathematics 12 03916 g024
Figure 25. The internal dynamic signals.
Figure 25. The internal dynamic signals.
Mathematics 12 03916 g025
Figure 26. Sampling error norm and triggering threshold using static mechanisms proposed in [31].
Figure 26. Sampling error norm and triggering threshold using static mechanisms proposed in [31].
Mathematics 12 03916 g026
Figure 27. Sampling period of the learning stage using static mechanisms proposed in [31].
Figure 27. Sampling period of the learning stage using static mechanisms proposed in [31].
Mathematics 12 03916 g027
Table 1. Parameter settings.
Table 1. Parameter settings.
simulation 1 γ λ i β 1 β 2 σ L u ω
80.750.10.10.11.20.5
simulation 2 γ λ i β 1 β 2 σ L u ω
50.70.10.10.11.20.5
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, Y.; Shao, Z.; Su, H.; Liu, L.; Mao, X. Integral Reinforcement Learning-Based Online Adaptive Dynamic Event-Triggered Control Design in Mixed Zero-Sum Games for Unknown Nonlinear Systems. Mathematics 2024, 12, 3916. https://doi.org/10.3390/math12243916

AMA Style

Liang Y, Shao Z, Su H, Liu L, Mao X. Integral Reinforcement Learning-Based Online Adaptive Dynamic Event-Triggered Control Design in Mixed Zero-Sum Games for Unknown Nonlinear Systems. Mathematics. 2024; 12(24):3916. https://doi.org/10.3390/math12243916

Chicago/Turabian Style

Liang, Yuling, Zhi Shao, Hanguang Su, Lei Liu, and Xiao Mao. 2024. "Integral Reinforcement Learning-Based Online Adaptive Dynamic Event-Triggered Control Design in Mixed Zero-Sum Games for Unknown Nonlinear Systems" Mathematics 12, no. 24: 3916. https://doi.org/10.3390/math12243916

APA Style

Liang, Y., Shao, Z., Su, H., Liu, L., & Mao, X. (2024). Integral Reinforcement Learning-Based Online Adaptive Dynamic Event-Triggered Control Design in Mixed Zero-Sum Games for Unknown Nonlinear Systems. Mathematics, 12(24), 3916. https://doi.org/10.3390/math12243916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop