Online Safe Flight Control Method Based on Constraint Reinforcement Learning

Zhao, Jiawei; Xu, Haotian; Wang, Zhaolei; Zhang, Tao

doi:10.3390/drones8090429

Open AccessArticle

Online Safe Flight Control Method Based on Constraint Reinforcement Learning

by

Jiawei Zhao

^1,*

,

Haotian Xu

²,

Zhaolei Wang

³ and

Tao Zhang

^2,*

¹

Department of Automatic Control, Xi’an Research Institute of Hi-Tech, Xi’an 710025, China

²

Department of Automation, Tsinghua University, Beijing 100091, China

³

Beijing Aerospace Automatic Control Institute, Beijing 100854, China

^*

Authors to whom correspondence should be addressed.

Drones 2024, 8(9), 429; https://doi.org/10.3390/drones8090429

Submission received: 13 June 2024 / Revised: 22 August 2024 / Accepted: 22 August 2024 / Published: 26 August 2024

(This article belongs to the Special Issue Advanced Intelligent Decision-Making and Flight Control of Unmanned Aerial Vehicles 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

UAVs are increasingly prominent in the competition for space due to their multiple characteristics, such as strong maneuverability, long flight distance, and high survivability. A new online safe flight control method based on constrained reinforcement learning is proposed for the intelligent safety control of UAVs. This method adopts constrained policy optimization as the main reinforcement learning framework and develops a constrained policy optimization algorithm with extra safety budget, which introduces Lyapunov stability requirements and limits rudder deflection loss to ensure flight safety and improves the robustness of the controller. By efficiently interacting with the constructed simulation environment, a control law model for UAVs is trained. Subsequently, a condition-triggered meta-learning online learning method is used to adjust the control raw online ensuring successful attitude angle tracking. Simulation experimental results show that using online control laws to perform aircraft attitude angle control tasks has an overall score of 100 points. After introducing online learning, the adaptability of attitude control to comprehensive errors such as aerodynamic parameters and wind improved by 21% compared to offline learning. The control law can be learned online to adjust the control policy of UAVs, ensuring their safety and stability during flight.

Keywords:

online safe flight control; constrained reinforcement learning; meta-learning

1. Introduction

Against the backdrop of continuously increasing flight mission requirements, the flight environment is becoming increasingly complex. UAVs inherently possess characteristics such as strong nonlinearity, complex coupling effects, rapid time-varying features, and significant uncertainties. Control design methods require precise knowledge or identified internal dynamic information about the aircraft. The high-order derivatives of the output variables needed in these methods are difficult to measure in practical applications [1,2]. This necessitates the intelligent upgrade and transformation of various key links in the control system, enabling the aircraft to possess intelligent learning capabilities. Reinforcement learning is a type of intelligent method that achieves end-to-end flight control based on data without the need for operations such as model decoupling and linearization.

Currently, reinforcement learning has received widespread attention and research in various decision-making tasks of artificial intelligence [3,4,5,6,7,8,9,10]. Reinforcement learning can typically be described as a Markov decision process (MDP), meaning that the state at the next moment depends only on the current state and the action taken by the agent, with rewards determined by two consecutive states and the action of the agent between them [11]. Introducing constraints into the MDP, that is, adding loss to describe the degree of violation of constraints by the agent’s behavior in the elements of reinforcement learning, forms the constrained Markov decision process (CMDP) [12]. In CMDP problems, the goal of the agent is to maximize cumulative rewards in a task while keeping the loss within certain constraints. Losses include cumulative losses and real-time losses; the former includes expectations and averages of long-term losses or the probability of exceeding a certain threshold, while the latter refers to explicit or implicit losses at each time step.

According to the objective of reinforcement learning, reinforcement learning is essentially solving an optimization problem, and constrained reinforcement learning transforms the unconstrained optimization problem of classical reinforcement learning into an optimization problem with inequality constraints. Therefore, many methods in the field of optimization can be borrowed for constrained reinforcement learning. Researchers at home and abroad have carried out some studies on the application of reinforcement learning to aircraft control [13,14,15,16,17,18]. Hao et al. [14] proposed a model reference output feedback reinforcement learning control algorithm with a broader application scope. Its learning process only relies on the output of the object and can obtain an output feedback control policy that enables the closed-loop system to have the desired dynamic performance. The algorithm constructs a reward function based on the reference model, which can effectively describe the desired closed-loop dynamic performance of the system. Huang et al. [15] utilized the deep deterministic policy gradient (DDPG) algorithm, using state information from multiple data frames as the agent’s observation state, and rudder angle and engine thrust commands as the agent’s output actions. After training, they obtained a generalized and robust intelligent flight controller. Wang et al. [19] proposed a deep deterministic policy gradient-based reference model for quadrotor UAV attitude controller design, which further incorporates a reference model in the DDPG structure to circumvent the system overshoot caused by too much control. Rui et al. [20] proposed a PPO-based RL controller for attitude control during the transition process of tilt rotor unmanned aerial vehicles (TRUAVs). Through direct interaction with the environment to learn control strategies, they designed and improved the reward function to adapt to the transition process. Burak et al. [21] proposed a design method for flight control systems based on reinforcement learning, aiming to improve the transient response performance of closed-loop reference model adaptive control systems. This method implements reinforcement learning in the feedback path gain matrix of the reference model to generate dynamic adjustment strategies, providing the possibility of learning multiple adaptation strategies, thereby improving the transient response performance of traditional model reference adaptive control system designs. Ma et al. [22] proposed an incremental reinforcement learning-based UAV tracking control algorithm in dynamic environments, using a policy relief approach to enable UAVs to explore appropriately in new environments, and a significance weighting approach to increase the plot with higher significance and richer information utilization.

Traditional reinforcement learning methods lack theoretical guarantees in terms of safety and credibility, making the trained systems unable to meet practical application requirements. Several studies have combined Lyapunov theory with deep reinforcement learning to improve the robustness and stability of control systems. Chow et al. [23] proposed a safety policy optimization method based on Lyapunov functions that trains neural network policies via a deep deterministic policy gradient and proximal policy optimization algorithms and ensures that the set of feasible solutions induced by linearized Lyapunov constraints is satisfied at each policy update. Yu et al. [24] proposed an adaptive control method for mobile robots based on Lyapunov reward shaping that optimizes the control parameters through environmental feedback to achieve real-time stable control. Therefore, this article proposes a new framework for online safe flight control based on constrained reinforcement learning. To the best of our knowledge, this is the first time that a constrained reinforcement learning approach has been applied to aircraft control research. The main contributions of this article are as follows:

A new framework is proposed for online safe flight control. The core idea is to first design a constrained reinforcement learning algorithm based on extra safety budget, which introduces Lyapunov stability requirements to ensure flight safety and improves the robustness of the controller, and then an online condition-triggered meta-learning method is used to adjust the control raw online to complete the attitude angle tracking task.
A novel flight control simulation environment is built based on the Python Flight Mechanics Engine (PyFME) [25] for offline training and online learning.
This work proves that this method not only ensures the safety and stability of the aircraft during flight but also adapts the control law to various environmental changes through online learning.

The rest of this article is structured as follows: Section 2 describes the aircraft model, controller model, and simulation environment. Section 3 introduces the proposed method. Section 4 presents the experimental results and analysis. Finally, Section 5 concludes this article.

2. Mathematic Model

To facilitate research on online trajectory planning based on the characteristics of the aircraft, three reasonable assumptions are made:

The aircraft is a rigid body.
The ground is flat and stationary, ignoring the influence of the earth’s curvature and rotation.
The deformation of the landing gear is neglected.

We only focus on the aircraft model after takeoff. The taxiing phase during takeoff and the landing phase during landing are temporarily not considered. Due to the large range of changes in the attitude motion parameters of the aircraft, quaternions are used to describe the attitude in the simulation to avoid singularities. After numerical integration of the differential equations, the attitude angles

θ, ψ

, and

r

, which represent pitch, yaw, and roll angles, are calculated using the conversion relationship between attitude angles and quaternions.

2.1. Aircraft Model

The differential equations for the state of the aircraft [26,27] are as follows:

[\begin{matrix} \dot{X} \\ \dot{Y} \\ \dot{Z} \end{matrix}] = [\begin{matrix} V_{x} \\ V_{y} \\ V_{z} \end{matrix}]

(1)

[\begin{matrix} \dot{V_{x}} \\ \dot{V_{y}} \\ \dot{V_{z}} \end{matrix}] = [A] \begin{matrix} {\dot{W}}_{x 1} \\ {\dot{W}}_{y 1} \\ {\dot{W}}_{z 1} \end{matrix} + [\begin{matrix} 0 \\ - g \\ 0 \end{matrix}]

(2)

[\begin{matrix} \dot{q_{0}} \\ \dot{q_{1}} \\ \dot{q_{2}} \\ \dot{q_{3}} \end{matrix}] = \frac{1}{2} [\begin{matrix} - q_{1} & - q_{2} & - q_{3} \\ q_{0} & - q_{3} & q_{2} \\ q_{3} & q_{0} & - q_{1} \\ - q_{2} & q_{1} & q_{0} \end{matrix}] [\begin{matrix} ω_{x 1} \\ ω_{y 1} \\ ω_{z 1} \end{matrix}]

(3)

[\begin{matrix} {\dot{ω}}_{x 1} \\ {\dot{ω}}_{y 1} \\ {\dot{ω}}_{z 1} \end{matrix}] = {[\begin{matrix} I_{x} & - I_{x y} & - I_{x z} \\ - I_{y x} & I_{y} & - I_{y z} \\ - I_{z x} & - I_{z y} & I_{z} \end{matrix}]}^{- 1} \{[\begin{matrix} I_{x z} ω_{y 1} - I_{x y} ω_{z 1} & I_{y z} ω_{y 1} - I_{z} ω_{z 1} & I_{y} ω_{y 1} - I_{y z} ω_{z 1} \\ I_{z} ω_{z 1} - I_{z x} ω_{x 1} & I_{x y} ω_{z 1} - I_{y z} ω_{x 1} & I_{x z} ω_{z 1} - I_{x} ω_{x 1} \\ I_{x y} ω_{x 1} - I_{y} ω_{y 1} & I_{x} ω_{x 1} - I_{x y} ω_{y 1} & I_{y z} ω_{x 1} - I_{z x} ω_{y 1} \end{matrix}] [\begin{matrix} ω_{x 1} \\ ω_{y 1} \\ ω_{z 1} \end{matrix}] + [\begin{matrix} M_{x 1} \\ M_{y 1} \\ M_{z 1} \end{matrix}]\}

(4)

where

X, Y, a n d Z

represent the positions along the three axes in the ground coordinate system, while

V_{x}, V_{y}, a n d V_{z}

stand for the velocities along the three axes in the ground coordinate system.

q_{0}, q_{1}, q_{2}, a n d q_{3}

are the attitude quaternions, and

ω_{x 1}, ω_{y 1}, a n d ω_{z 1}

represent the angular velocities of rotation along the three axes in the body coordinate system.

{\dot{W}}_{x 1}, {\dot{W}}_{y 1}, a n d {\dot{W}}_{z 1}

are the visual accelerations along the three axes in the body coordinate system.

M_{x 1}, M_{y 1}, a n d M_{z 1}

are the combined external moments of the three axes under the machine system, which are composed of pneumatic moments and inertia moments.

[A]

is the transformation matrix from the body coordinate system to the ground coordinate system, and

g

represents the acceleration of gravity. The corresponding calculation diagram is shown in Figure 1.

In the body coordinate system, the aerodynamic forces and moments are as follows:

\begin{matrix} F_{x 1}^{O D} = q S_{r e f} C_{x 1} \\ F_{y 1}^{O D} = q S_{r e f} C_{y 1} \\ \begin{matrix} F_{z 1}^{O D} = q S_{r e f} C_{z 1} \\ M_{x 1}^{Q D} = q S_{r e f} L_{r e f} C_{m x 1} \\ \begin{matrix} M_{y 1}^{Q D} = q S_{r e f} L_{r e f} C_{m y 1} \\ M_{z 1}^{Q D} = q S_{r e f} L_{r e f} C_{m z 1} \end{matrix} \end{matrix} \end{matrix}

(5)

where

q

represents dynamic pressure; aerodynamic force coefficients include axial force coefficient

C_{x 1}

, normal force coefficient

C_{y 1}

, and lateral force coefficient

C_{z 1}

; aerodynamic moments include rolling moment coefficient

C_{m x 1}

, yawing moment coefficient

C_{m y 1}

, and pitching moment coefficient

C_{m z 1}

; Lref represents the aerodynamic reference length; and Sref represents the aerodynamic reference area.

2.2. Controller Model

To control the aircraft using reinforcement learning algorithms, the controller serves as the reinforcement learning action network, primarily responsible for executing the aircraft attitude angle control tasks. The inputs (states s) to the controller include observable quantities such as the current altitude

H

, climbing speed

d H

, flight speed

V

, pitch angle

θ

, yaw angle

ψ

, roll angle

Φ

, sideslip angle

β

, attack angle

α

, roll rate

p

, pitch rate

q

, yaw rate

r

, and the desired yaw angle

ψ_{d e s}

and pitch angle

θ_{d e s}

. The outputs (action a) are

δ_{e}, δ_{a}

, and

δ_{r}

. In the implementation process, a neural network is used to take all the above states as inputs and produce three control commands for the control surfaces as outputs. The block diagram of the controller system is shown in Figure 2.

In the action space, the control commands for the control surfaces are continuous values. During the simulation process, a Gaussian action model is adopted to control the rotation of the control surfaces. The design of the action network is shown in Figure 3.

2.3. Flight Simulation Environment Model

Gazebo’s expertise and depth in flight mechanics simulation are not as good as PyFME, and it consumes more resources and has higher hardware requirements in complex simulations [28]. AirSim is not as focused on the depth and breadth of flight mechanics simulation as PyFME in terms of functional scope. PyFME provides more comprehensive flight mechanics models and related parameters, which may be more suitable for highly specialized flight mechanics research [29]. FlightGear’s customization capabilities are not as flexible and powerful as those in PyFME, especially in research that requires highly customized simulation scenarios [30]. Compared to PyFME, which focuses on the depth and accuracy of flight mechanics simulation, X-Plane, as a commercial software program, has relatively limited customization capabilities, limiting users’ ability to deeply modify and expand the software [31]. After considering the existing flight control simulation environment, we found that although they perform well in some respects, there are limitations in terms of flexibility, user customization, and integration of specific algorithms. To overcome these limitations, we chose PyFME as our research tool.

We established a flight simulation environment based on PyFME, whose main idea is to model the physical scenarios involved in aircraft flight, including aircraft models, atmospheric models, dynamic models, aerodynamic models, and so on. It is capable of simulating the movement of aircraft in the air and, accordingly, replicating all the physical environments pertinent to flight. For further details, please refer to https://github.com/AeroPython/PyFME/wiki, accessed on 12 April 2024. Next, we introduce it from three perspectives: state quantity design, action quantity design, and the design of reward and cost functions.

2.3.1. State Quantity Design

The observables available to the aircraft include flight speed, altitude, climbing rate, pitch angle, yaw angle, roll angle, and the angular rates of these three angles. In reinforcement learning algorithms, the inputs to the agent’s action network include these nine state variables and control variables. During the learning process, the agent needs to utilize the deflection of the control surfaces to calculate the loss function. Therefore, in the experiment, we consider adding the deflection of the control surfaces to the state space, creating an inner feedback loop. Additionally, we utilize state stacking to allow for the agent to capture more dynamic information, obtaining high-order and integral quantities from past states, thereby generating better control outputs. To facilitate model training, during data processing, the mean and standard deviation of the data are dynamically calculated and updated based on the data in the state variables, rather than calculating the mean and standard deviation of all the data at once. This online computing method can better adapt to the dynamic changes of data, especially suitable for situations that require real-time updating of statistical information.

2.3.2. Action Quantity Design

In the attitude control task, the actions of the aircraft involve elevator deflection, rudder deflection, and aileron deflection. The objective of attitude control is to enable the aircraft to follow the target yaw angle and target pitch angle, as shown in Figure 4. In the task design, the target yaw angle and target pitch angle vary sinusoidally with simulation time, allowing for the aircraft to learn to track curved trajectories. The expressions for the target pitch angle

θ_{d e s}

and target yaw angle

ψ_{d e s}

are as follows:

\begin{matrix} θ_{des} = A_{θ} s i n (ω_{θ} t) \\ ψ_{des} = ψ_{0} + A_{ψ} s i n (ω_{ψ} t) \end{matrix}

(6)

where

A_{θ}

and

A_{ψ}

are the amplitudes of change for the target pitch angle and target yaw angle, respectively, while

ω_{θ}

and

ω_{ψ}

represent the angular frequencies of change for the target pitch angle and target yaw angle, respectively.

A_{θ}

randomly takes values between the positive and negative maximum pitch angles, while

A_{ψ}

randomly takes values between −π/2 and π/2.

ω_{θ}

and

ω_{ψ}

randomly take values within a certain frequency range, enabling the aircraft to learn following and attitude control strategies for different frequencies of trajectory changes.

2.3.3. Reward Function Design

We designed the reward function based on the incremental satisfaction of the target using the aircraft. Let the target state quantity in a certain task of the aircraft be

T_{i}

, the current corresponding state quantity be

S_{i}

, and the corresponding state quantity at the previous moment be

S_{i}^{-}

. Then, the reward value corresponding to the current state

S_{i}

is as follows:

R_{i} = k_{R i} (|T_{i} - S_{i}^{-}| - |T_{i} - S_{i}|)

(7)

where

k_{R i}

is the proportionality coefficient, related to the dimensions.

After the aircraft reaches the target attitude, the value of the aforementioned reward calculation method is 0, which is not conducive to the agent learning the policy of maintaining the attitude. Therefore, when the aircraft reaches the target attitude, we allow for the agent to obtain additional positive rewards. At this time, any action that causes the aircraft to deviate from the target attitude results in the aircraft receiving negative rewards, whereas maintaining the flight attitude at the target attitude enables the aircraft to accumulate positive rewards.

For the attitude angle control task, due to the high precision requirements, adopting a constant reward based on a threshold for the angle would make it difficult for the aircraft to meet the reward conditions, which is not conducive to learning control strategies for small errors. Therefore, we adopt a smoother reward function model similar to the Laplace distribution, which is expressed as follows:

R_{g o a l} = k_{R g} \exp (- \frac{|e|}{b})

(8)

where

k_{R g}

is the overall scaling factor for the target reward, used to balance the reward for approaching the target posture and the reward for maintaining the target posture;

b

is the precision control factor, used to adjust the sharpness of the Laplace function.

To ensure the safe operation of the aircraft, the maximum angle of attack during flight is set to ±0.5 rad. When the angle of attack exceeds the maximum angle of attack, the turn ends and a reward of −10 is generated. Set the minimum flight altitude of the aircraft to 2 m. If the flight altitude is below 2 m, the aircraft is considered to have crashed, the turn ends, and a reward of −10 is generated.

2.3.4. Cost Function Design

In the aircraft control task, we aim for the deflection of the control surfaces to be as smooth as possible, ensuring that the aircraft’s attitude remains stable and energy consumption is reduced. The cost function, designed specifically based on the deflection of the control surfaces, is formulated as follows:

C (s, a, s^{'}) = k_{c} | a - a^{-} |

(9)

where

k_{c}

is the amplification factor, s is the current state of the aircraft, a is the action at the current moment,

s^{'}

is the state of the aircraft at the previous moment, and

a^{-}

is the action at the previous moment.

Therefore, the optimization problem for reinforcement learning in aircraft control becomes achieving the target attitude while ensuring that the control surface deflections are within a certain limit.

3. Methodology

The online safe flight control method based on constrained reinforcement learning employs a two-stage process, which is divided into offline virtual trial-and-error training and online reinforcement learning model calibration. In the offline stage, constrained reinforcement learning with an additional safety budget is used to train and optimize the policy function and value function. Finally, the trained policy function is used to drive the aircraft in the simulation to complete the flight mission. At the online stage, the control law model trained in the offline stage is first used to construct a task set through online flights in a virtual environment. Then, real-time interactive data are used to update the task set and continue fine-tuning the model using a conditional trigger-based meta-learning online reinforcement learning method. A schematic diagram of the method is shown in Figure 5.

3.1. Constrained Policy Optimization Algorithm with Extra Safety Budget

The constrained policy optimization with extra safety budget (ESB-CPO) [32] algorithm first samples from the environment, calculates the normalized safety state and constraint equation gradients for each time step based on the sampled losses, and updates the factors

α_{i θ}

and

β_{i θ} (s_{t})

based on the normalized safety state and constraint equation gradients. After that, it calculates the Lyapunov Advantage estimation (LAE) corresponding to each time step. Finally, it solves the approximate constrained policy optimization (CPO) [33] problem using the first-order approximation of the optimization objective function and constraint equation, and the second-order approximation of the KL divergence, to obtain a new policy. Adaptive factors control safety constraints enabling the aircraft to initially ignore the constraints of unsafe states and quickly converge to a trajectory to complete the task. Subsequently, it can gradually meet the requirements of a safe state, ultimately obtaining the optimal trajectory. The LAE

A_{θ}^{{C_{i}}^{'}} (s, a)

is as follows:

A_{θ}^{C_{i}^{'}} (s, a) = \underset{s^{'} \sim P_{a}^{s}}{E} [V_{θ}^{C_{i}} (s^{'}) - V_{θ}^{C_{i}} (s) + α (V_{θ}^{C_{i}} (s) - β V_{θ}^{C_{i}} (s^{'}))]

(10)

where

α \in (0,1), β \in [0, 1]

are adaptive factors,

s

,

a

are the state and action at the current moment,

s^{'}

is the state at the next moment, and

V_{θ}^{C_{i}}

is the cost value function. Further,

P_{a}^{s}

is the distribution of the next state of

(s, a)

.

Describe the parameterized policy in terms of

π_{θ}

. The expected discounted cumulative return of the policy is as follows:

J^{R} (θ) = E_{τ ～ π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} R (s, a, s^{'})]

(11)

where

τ ～ π_{θ}

is the trajectory sampled from

π_{θ}

. The expected discounted cumulative cost of the policy is as follows:

J^{C_{i}} (θ) = E_{τ ～ π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} C_{i} (s, a, s^{'})]

(12)

The optimization objective of the safe RL algorithm is to find the optimal policy

π_{θ^{*}}

that maximises

J^{R}

and guarantees that

J^{C_{i}} \leq d_{i}

, where

d_{i}

is the upper cost limit of the ith constraint.

Thus, the optimization problem can be defined:

\begin{matrix} \underset{θ}{m a x} J^{R} (θ) \\ s . t . J^{C_{i}} (θ) \leq d_{i} \end{matrix}

(13)

The commonly used advantage functions are as follows:

\begin{matrix} A_{θ}^{R} (s, a) = Q_{θ} (s, a) - V_{θ} (s) \\ A_{θ}^{C_{i}} (s, a) = Q_{θ}^{C_{i}} (s, a) - V_{θ}^{C_{i}} (s) \end{matrix}

(14)

where

V_{θ}

is the value function,

Q_{θ}

is the state-action value function,

V_{θ}^{C_{i}}

is the cost value function, and

Q_{θ}^{C_{i}}

is the state-action cost value function.

Using the LAE to derive the optimization problem, the policy can be updated:

\begin{matrix} θ^{'} = a r g \underset{\tilde{θ}}{m a x} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [\frac{π_{\tilde{θ}} (a∣ s)}{π_{θ} (a∣ s)} A_{θ}^{R} (s, a)] \\ s . t . J^{C_{i}} (θ) + \frac{1}{1 - γ} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [Δ_{θ, θ^{'}} (s, a) \frac{A_{θ}^{C_{i}^{'}} (s, a)}{1 - α_{i θ}}] \leq d_{i} \\ \underset{s ～ ρ_{θ}}{E} [D_{K L} (π_{\tilde{θ}} (\cdot ∣ s) ∥ π_{θ} (\cdot ∣ s))] \leq δ \end{matrix}

(15)

where

α_{i θ}

decreases from

1^{-}

to 0 with updating and

Δ_{θ, θ^{'}} (s, a) = \frac{π_{\tilde{θ}} (a ∣ s)}{π_{θ} (a ∣ s)} - 1

.

Δ_{θ, θ^{'}} (s, a)

denotes the tendency of the policy to update from

π_{θ}

to

π_{\tilde{θ}}

. If the new policy tries to avoid choosing an action

a

under

s

, then

Δ_{θ, θ^{'}} (s, a) < 0

; conversely,

Δ_{θ, θ^{'}} (s, a) > 0

.

The relationship between

A_{θ}^{C_{i}} (s, a)

and

A_{θ}^{C_{i}^{'}} (s, a)

is as below:

\begin{matrix} \frac{A_{θ}^{C_{i}^{'}} (s, a)}{1 - α_{i θ}} = A_{θ}^{C_{i}} (s, a) + B_{1 θ}^{i} (s, a) + B_{2 θ}^{i} (s^{'}) \\ B_{1 θ}^{i} (s, a) = (1 - γ) V_{θ}^{C_{i}} (s, a) - C_{i} (s, a, s^{'}) \\ B_{2 θ}^{i} (s^{'}) = \frac{α_{i θ} (1 - β_{i θ} (s))}{1 - α_{i θ}} V_{θ}^{C_{i}} (s) \end{matrix}

(16)

Therefore, adding two gaps to the constraint function of (16) yields the following:

\begin{matrix} J^{C_{i}} (θ) + \frac{1}{1 - γ} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [Δ_{θ, θ^{'}} (s, a) \frac{A_{θ}^{C_{i}^{'}} (s, a)}{1 - α_{i θ}}] \leq d_{i} \\ ⟺ J^{C_{i}} (θ) + \frac{1}{1 - γ} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [\frac{π_{\tilde{θ}} (a ∣ s)}{π_{θ} (a ∣ s)} A_{θ}^{C_{i}} (s, a)] + G_{1 θ}^{i} (s, a) + G_{2 θ}^{i} (s, a) \leq d_{i} \end{matrix}

(17)

where

G_{1 θ}^{i} (s, a) = \frac{1}{1 - γ} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [Δ_{θ, θ^{'}} (s, a) B_{1 θ}^{i} (s, a)]

;

G_{2 θ}^{i} (s, a) = \frac{1}{1 - γ} \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [Δ_{θ, θ^{'}} (s, a) B_{2 θ}^{i} (s, a)]

. If these gaps are negative, they relax the constraints; otherwise, they tighten them.

The normalized safe state

z_{i θ} (s_{t})

is a sample-based internal state that directly shows the safety of the state at step t.

z_{i θ} (s_{t})

is defined as follows:

z_{i θ} (s_{t}) = \frac{d_{i} - \sum_{l = 0}^{t} γ^{l} C_{i} (s_{l}, a_{l}, s_{l + 1})}{γ^{t} d_{i}}

(18)

where

s_{l}

,

a_{l}

, and

s_{l + 1}

are in the trajectory sampled by

π_{θ}

. When the sum of costs is greater than the cost limit

d_{i}

,

z_{i θ} (s_{t})

is less than 0.

z_{i θ} (s_{t})

has an initial value of 1 before t = 0, and its update formula is as follows:

z_{i θ} (s_{t + 1}) = \frac{z_{i θ} (s_{t}) - \frac{C_{i} (s_{l}, a_{l}, s_{l + 1})}{d_{i}}}{γ}

(19)

Considering the range [0, 1],

β_{i θ} (s_{t})

is calculated as follows:

β_{i θ} (s_{t}) = 1 + m i n (t a n h (z_{i θ} (s_{t})), 0)

(20)

When

z_{i θ} (s_{t})

is less than 0,

β_{i θ} (s_{t})

decreases to 0.

The policy gradient directly reflects the effect of constraint on the policy, so the Lagrange multiplier

λ_{i}

can be introduced to compute

α_{i θ}

based on the policy gradient of the constraint function, constructing the following local optimization problem:

\underset{λ_{i}}{m i n} \underset{\tilde{θ}}{m a x} λ_{i} P_{i θ} (\tilde{θ})

(21)

where

P_{i θ} (\tilde{θ}) = \underset{\begin{matrix} s ～ ρ_{θ} \\ a ～ π_{θ} \end{matrix}}{E} [\frac{π_{\tilde{θ}} (a ∣ s)}{π_{θ} (a ∣ s)} A_{θ}^{C^{i}} (s, a)]

.

The dual problem of the above equation is as below:

\begin{matrix} \underset{λ_{i}}{m i n} \underset{\tilde{θ}}{m a x} λ_{i} P_{i θ} (\tilde{θ}) \\ s . t . λ_{i} \geq 0 \end{matrix}

(22)

The update formula for

λ_{i}

is as follows:

λ_{i, t + 1} = m a x (λ_{i, t} + η P_{θ} (\tilde{θ}), 0)

(23)

where

η

is the step size.

Considering the range of values of

α_{i θ}

, the formula can be defined:

α_{i θ} = t a n h (\frac{k_{i}}{e^{- λ_{i}}})

(24)

where

k_{i}

is a hyperparameter that globally controls the rate of decrease in

α_{i θ}

. As

λ_{i}

decreases,

α_{i θ}

changes from 1 to 0.

For small step sizes

δ

, the optimization problem can be solved approximately by first-order approximation of the objective and constraints and second-order approximation of the KL divergence. Let the objective gradient be

g

, the constraint gradient be

b

, and the Hessian of the KL divergence be

H

. Define

c_{i} \dot{=} J^{C_{i}} (θ) - d_{i}

; then, the approximation of Equation (15) is as follows:

\begin{matrix} θ^{'} = a r g \underset{\tilde{θ}}{m a x} g^{T} (\tilde{θ} - θ) \\ s . t . c_{i} + b_{i}^{T} (\tilde{θ} - θ) \leq 0 \\ \frac{1}{2} (\tilde{θ} - θ)^{T} H (\tilde{θ} - θ) \leq δ \end{matrix}

(25)

The dual problem of the above equation is as below:

\underset{\begin{matrix} μ_{1} \geq 0 \\ μ_{2} \geq 0 \end{matrix}}{m a x} \frac{- 1}{2 μ_{1}} (g^{T} H^{- 1} g - 2 τ^{T} μ_{2} + μ_{2}^{T} S μ_{2}) + μ_{2}^{T} c - \frac{μ_{1} δ}{2}

(26)

where

c = [c_{0}, c_{1}, . . .], τ \dot{=} g^{T} H^{- 1} B, S \dot{=} B^{T} H^{- 1} B,

and

B = [b_{0}, b_{1}, . . .] .

Equation (25) can be solved by approximating CPO update as follows:

If Equation (26) is feasible:

\hat{θ} = θ + \frac{1}{μ_{1}^{*}} H^{- 1} (g - μ_{2}^{*} b)

(27)

else:

{\hat{θ}}^{'} = θ - \sqrt{\frac{2 σ}{b^{T} H^{- 1} b}} H^{- 1} b

(28)

where

μ_{1}^{*}

and

μ_{2}^{*}

are solutions of Equation (26).

In summary, the aircraft attitude control based on ESB-CPO is shown in Algorithm 1. The corresponding framework is shown in Figure 6.

Algorithm 1 Aircraft attitude control based on ESB-CPO

Initialize the policy network

π_{θ}

, the value network

V_{\emptyset}

Initialize the replay buffer

B

and step counter t = 0

for k < in 0, 1, 2, … do

Use policy

π_{θ_{k}}

to carry out the flight mission and collect a batch of samples

{D = {τ} = {(s}_{t}, a_{t}, r_{t}^{e}, s_{t + 1})}

According to the FIFO principle, update replay buffer B with

D

Update step counter t = t + len(

D

)

for

τ

in

D

do

for s in

τ

do

β_{θ_{k}} (s) = 1 + \min (\tanh (z_{θ_{k}} (s)), 0)

end for

Compute

α_{θ_{k}}

by solving the local dual problem

Estimate

\hat{g}

,

\hat{b}

,

\hat{c}

and

\hat{H}

using the sample constructed with

D

if approximate ESB-CPO is feasible then

\hat{θ} = θ + \frac{1}{μ_{1}^{*}} H^{- 1} (g - μ_{2}^{*} b)

else

{\hat{θ}}^{'} = θ - \sqrt{\frac{2 σ}{b^{T} H^{- 1} b}} H^{- 1} b

end if

Obtain

θ_{k + 1}

by backtracking line search to enforce satisfaction of constraint function in (15)

Update

V_{\emptyset}

by TD-like critic learning
end for

3.2. Condition-Triggered Meta-Learning Online Learning Method

Meta-learning, also known as learning to learn, is characterized by the fact that the trained deep model’s structure is not designed to complete a specific task in a particular scenario, but rather to rapidly adapt and accomplish new tasks in different scenarios after only a few training samples and one or a few iterations. This fully embodies the idea of enabling machines to learn to learn.

Assuming the existence of a sample set related to the training task

T_{i}

, also known as a task set, where each task set contains its own training data and testing data, in meta-learning, these are referred to as the Support Set and Query Set, respectively. We define the initial parameters of the network as

ϕ

, and the model parameters trained on the i-th test task as

{\hat{θ}}^{i}

. Therefore, the overall loss function can be defined:

L (f_{ϕ}) = \sum_{i = 1}^{n} l^{i} (f_{{\hat{θ}}^{i}}),

(29)

using gradient descent to update

ϕ

.

As a result, an optimized model can be obtained. When a new task

T_{i}

arrives, a small training sample set can be used to train the network, enabling the rapid acquisition of the corresponding network model parameters for that task.

We apply meta-learning to the safe flight control method based on constrained reinforcement learning, using real-time interactive data to update the task set based on conditional triggers and continue fine-tuning the model. The online flight control is shown in Algorithm 2.

Algorithm 2 Online flight control based on meta-learning

Loading offline model parameters

Initialize the replay buffer

B

, step counter t = 0, learning rates α, β, and the batch counter l = 0.

while not done do

Use policy

π_{θ_{l}}

to carry out the flight mission and collect a batch of samples

{D = {(s}_{t}, a_{t}, r_{t}^{e}, s_{t + 1})}

According to the FIFO principle, update playback buffer B with

D

.

Update step counter t = t + len(

D

)

if attitude angle error exceeds threshold, then

l = l + 1

Using the samples in B to construct a task set, and dividing the task set into a support set and a query set
Utilize the support set to compute adaptive parameters

θ_{i}^{'} = θ - α \nabla_{θ} L (f_{θ})

Utilize the query set to update the policy network parameters

θ = θ - β \nabla_{θ} L (f_{θ_{i}^{'}})

end if
end

4. Results and Discussion

The experiments were conducted on a hardware platform equipped with AMD Ryzen 9 7945HX CPU using Windows 11 OS, version 23H2 and were based on the open-source deep learning framework PyTorch, version 1.10.1. Our experiments are also applicable to the TensorFlow framework and can run stably on an Ubuntu system.

The limitation on the deflection of control surfaces has greater practical significance in attitude control tasks. In the experiments involving attitude control tasks at this stage, the loss limit is set to 8, and the training results of the controller are shown in Figure 7.

As can be seen from the above figure, TRPO [34] has a larger rudder deflection loss, reflecting its high-frequency rudder control policy. CPO and PPO cannot complete training to learn a policy that achieves superior reward. SAC [35] has a greater rudder deflection loss than ESB-CPO, and its reward convergence is slower than TRPO and ESB-CPO. ESB-CPO’s performance is close to TRPO’s in the early stages of the training, and the rewards converge rapidly. Thereafter, rudder deflection loss decreases gradually, which is beneficial to the safe flight of the aircraft, but due to the limitation of the amount of rudder deflection, the return is close to that of TRPO. The training process of the controller significantly illustrates how our method works. By adding rudder deflection constraints to the attitude control task, in the early stage, the controller effectively explores under very loose constraints, with both high rewards and losses. In the later stage, the controller gradually tries to avoid unsafe situations, and the rudder deflection loss gradually decreases. The results show that this method allows for early loss overturning and violation of constraints, and the constraint policy ultimately returns to the safe zone, ensuring smooth rudder deflection and improving flight safety.

4.1. Assessment Method

To assess the control quality of the control algorithm, it is first necessary to determine whether it is stable or not based on the simulation results. If it is stable, the time domain control index is calculated, and then the control quality score of the algorithm is calculated; if it is unstable, the control quality score of the algorithm is recorded as 0.

The control quality score is calculated based on the satisfaction of time-domain indicators. For each indicator, if it meets the set scoring criteria, it can obtain the corresponding score for that indicator; otherwise, no score will be awarded for that indicator. The cumulative score of each indicator is the control quality score

q_{i}

of the algorithm, where i is the simulation case number. The scoring criteria and values for each indicator are shown in Table 1.

Under the disturbances of aerodynamic parameters, wind, and other parameters, a total of N simulation cases were run. Based on the stability, control quality of each case, and the distribution of the quality of a group of cases, the total score S of the algorithm was obtained according to the following method:

S = C_{1} S_{1} + C_{2} S_{2} + C_{3} S_{1} S_{3}

(30)

C_{1}, C_{2}

, and

C_{3}

represent the full scores for different individual items, as detailed in Table 2 below.

S_{1}

is the scoring coefficient for the stability proportion item, which represents the proportion of stable cases among all the cases:

S_{1} = \frac{N_{s}}{N}

(31)

where

N_{s}

is the number of stable cases and

N

is the total number of cases.

S_{2}

is the scoring coefficient for the control quality item, which represents the average score of the control quality of each case:

S_{2} = \frac{\sum_{i = 1}^{N} q_{i}}{N}

(32)

S_{3}

is the scoring coefficient for the indicator dispersion item, which is calculated based on the dispersion degree of the control quality scores of the stable cases:

S_{3} = 1 - V

(33)

where V is the dispersion coefficient of the control quality scores for the stable cases. Let

q_{s, i}

represent the control quality score of each stable case, and the array

q_{s} = \{q_{s, 1}, q_{s, 2}, . . .\}

. The dispersion coefficient V is the ratio of the standard deflection to the mean of

q_{s}

:

V = \frac{s t d (q_{s})}{m e a n (q_{s})}

(34)

Then, to evaluate the real-time performance of the algorithm, a total of N examples were run under the disturbance of aerodynamic parameters and wind. During online learning, when the pitch angle or yaw angle deviation exceeds the set threshold, the condition for triggering learning is met, and timing is started. When the next action ends, timing is stopped. The definition of the control algorithm’s single calculation time is as follows:

T_{i} = t_{o f f i} - t_{o n i}

(35)

where

T_{i}

is the single calculation time of the control algorithm when the i-th condition is triggered;

t_{o n i}

is the time when the i-th condition is triggered and the timer starts counting;

t_{o f f i}

is the time at which the timing stops when the i-th condition is triggered and the next action ends.

So, the average time consumption of the control algorithm is as follows:

\bar{T} = \frac{\sum_{i = 1}^{N} T_{i}}{N}

(36)

4.2. Experimental Details

During the simulation process, we obtain the state quantities of the aircraft by numerical integration using the fourth-order Runge–Kutta method with a fixed step size of

∆ t = 0.002 s

. The model parameters and initial state of the aircraft are shown in Table 3.

During the training phase, the target attitude angles were set as shown in Equation (6). The goal of training is to complete the transition process within 20 s, and the simulation step size is 0.02 s, i.e., 1000 steps. Specific network hyperparameters are shown in Table 4.

During the testing phase, the goal of testing is to complete the transition process within 30 s, and the simulation step size is 0.02 s, i.e., 1500 steps. Specific network hyperparameters are shown in Table 5. The target attitude angles, specifically the target pitch angle and the target yaw angle, were set to vary piecewise over time. The equations for these changes are as follows:

θ_{c x} = \{\begin{matrix} 10.0 ° & 0.0 s \leq t < 3.0 s \\ 15.0 ° & 3.0 s \leq t < 10.0 s \\ 12.0 ° & 10.0 s \leq t \leq 30.0 s \end{matrix}

(37)

ψ_{c x} = \{\begin{matrix} 0.0 ° & 0.0 s \leq t < 15.0 s \\ 20.0 ° & 15.0 s \leq t < 22.0 s \\ 0.0 ° & 22.0 s \leq t \leq 30.0 s \end{matrix}

(38)

Assuming that the components of the wind velocity vector

\vec{ω}

in the ground coordinate system are

{[V_{w_{x}}, V_{w_{y}}, V_{w_{z}}]}^{T}

, and the aircraft is in a horizontal steady wind field, the wind velocity vector lies within the horizontal plane, and the wind direction angle

A_{w}

is defined as follows: According to the right-hand rule, starting from the ground coordinate system’s O_gY_g axis, the angle rotated around the O_gY_g axis to the wind velocity vector, the wind speed in the ground coordinate system is calculated using the following formula:

[\begin{matrix} V_{w_{x}} \\ V_{w_{y}} \\ V_{w_{z}} \end{matrix}] = [\begin{matrix} v_{w} c o s A_{w} \\ 0 \\ {- v}_{w} s i n A_{w} \end{matrix}]

(39)

We conducted 100 simulations, with the wind direction angle

A_{w}

randomly selected from the interval [0°, 360°] and the wind speed

v_{w}

randomly selected from the interval [5, 60] (in m/s). Meanwhile, we added random white noise with mean 0 and variance from 1 to 10 to the aerodynamic parameters. The total scores for both the offline flight control method based on ESB-CPO and the meta-learning online flight control method based on ESB-CPO are presented in Table 6.

As shown in Table 6, the overall score for executing the aircraft attitude angle control task using the offline control law is 82.5 points, while the overall score for performing the same task using the online control law is 100 points. By introducing online learning, the adaptability of attitude control to comprehensive errors such as aerodynamic parameters and wind has improved by 21% compared to offline learning.

We calculated the computation time of the control algorithm 1000 times, and the average computation time per time is 0.6 ms. The CPU frequency is 2.5 GHz, which is converted to 1 GHz. The online stage control algorithm has a single computation time of about 1.5 ms, which meets the real-time requirements.

The attitude angle control test results for the offline flight control method based on ESB-CPO and the online flight control method based on meta-learning with ESB-CPO are shown in Figure 8 and Figure 9, respectively.

As can be seen from the above figures, the flight control method based on ESB-CPO can reduce the frequency of control surface movements by incorporating control surface losses to limit their rotation. Except when step commands arrive, the control surfaces will quickly deflect to track the attitude angles, and the control surface deflection remains stable during other time periods. In the offline phase of attitude control tasks, due to the large variation in the step commands for the attitude angles, the control law cannot quickly complete the attitude angle tracking task. However, in the online phase of attitude control tasks, the control law can learn online to adjust the aircraft control policy to complete the attitude angle control task and reduce the rudder deflection, which ensures the safety and stability of the aircraft during the flight process.

As CPO and PPO cannot perform the attitude angle tracking task, only TRPO and SAC test results are shown in Figure 10 and Figure 11.

As can be seen from Figure 10 and Figure 11, the flight control method based on TRPO achieves the purpose of controlling attitude accuracy by increasing the frequency and amplitude of the rudder deflection, which would lead to a greater perturbation of the aircraft during the flight process and cause a major potential hazard to flight safety. In contrast, with the SAC-based flight control method, the elevator rudder deflection frequency is significantly faster, leading to a decrease in its tracking stability.

5. Conclusions

To address the issues of inefficiency, safety, and stability in the intelligent flight process of UAVs, we propose a new architecture for online safe flight control based on constrained reinforcement learning. Firstly, a flight control simulation environment is established based on PyFME for offline training and online learning. Secondly, to avoid flight accidents or mission failures caused by online learning, the ESB-CPO algorithm is used for flight control, and a constrained optimization problem is constructed based on the trust region method, which ensures the safety and stability of the aircraft during flight by introducing the Lyapunov stability requirement and limiting the rudder deflection loss. Finally, meta-learning is combined with the ESB-CPO algorithm to perform attitude angle tracking tasks. Experimental results show that the overall score of the aircraft attitude angle control task is 100 points, and the adaptability of attitude control to comprehensive errors, including aerodynamic parameters and wind, improves by 21% compared to offline learning after introducing online learning, indicating that the online control law can adapt the control law according to the environment to ensure the safety and stability of the aircraft during flight. In the current study, research is conducted only in a simulated environment, which may not fully capture the complexities and uncertainties present in the real world. In future work, experiments need to be conducted in more complex scenarios, such as communication interference or motor faults [36,37,38], to explore and improve control strategies. At the same time, we will challenge various aircraft models in order to translate research results into practical applications.

Supplementary Materials

The following supporting information can be downloaded at: https://gitee.com/w776538047/online-safe-flight-control-method, Video S1: A video of the attitude angle control test result.

Author Contributions

Conceptualization, J.Z.; methodology, H.X.; software, H.X.; validation, J.Z.; formal analysis, Z.W.; investigation, Z.W.; resources, Z.W.; data curation, Z.W.; writing—original draft preparation, J.Z.; writing—review and editing, T.Z.; visualization, J.Z.; supervision, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to thank the other researchers who helped with this study, family and friends who were not involved in the editing of this article, and the reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheng, H.; Zhang, S.; Liu, T.; Xu, S.; Huang, H. Review of Autonomous Decision-Making and Planning Techniques for Unmanned Aerial Vehicle. Air Space Def. 2024, 7, 6–15+80. [Google Scholar]
Swaroop, D.; Hedrick, K.; Yip, P.P.; Gerdes, J.C. Dynamic surface control for a class of nonlinear systems. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
Xidias, E.K. A Decision Algorithm for Motion Planning of Car-Like Robots in Dynamic Environments. Cybern. Syst. 2021, 52, 533–552. [Google Scholar] [CrossRef]
Huang, Z.; Li, F.; Yao, J.; Chen, Z. MGCRL: Multi-view graph convolution and multi-agent reinforcement learning for dialogue state tracking. IEEE Trans. Autom. Control 2000, 45, 1893–1899. [Google Scholar] [CrossRef]
Hellaoui, H.; Yang, B.; Taleb, T.; Manner, J. Traffic Steering for Cellular-Enabled UAVs: A Federated Deep Reinforcement Learning Approach. In Proceedings of the 2023 IEEE International Conference on Communications (ICC), Rome, Italy, 28 May–1 June 2023. [Google Scholar]
Xia, B.; Mantegh, I.; Xie, W. UAV Multi-Dynamic Target Interception: A Hybrid Intelligent Method Using Deep Reinforcement Learning and Fuzzy Logic. Drones 2024, 8, 226. [Google Scholar] [CrossRef]
Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
Cui, Y.; Hou, B.; Wu, Q.; Ren, B.; Wang, S.; Jiao, L.C. Remote Sensing Object Tracking With Deep Reinforcement Learning Under Occlusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Zhu, Z.D.; Lin, K.X.; Jain, A.K.; Zhou, J.Y. Transfer Learning in Deep Reinforcement Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13344–13362. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; AI, S.; Yogamani, S.; Pérez, P. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Minsky, M. Steps toward Artificial Intelligence. Proc. IRE. 1961, 49, 8–20. [Google Scholar] [CrossRef]
Zhao, W.; He, T.; Chen, R.; Wei, T.; Liu, C. Safe Reinforcement Learning: A Survey. Acta Autom. Sin. 2023, 49, 1813–1835. [Google Scholar]
Liu, X.; Nan, Y.; Xie, R.; Zhang, S. DDPG Optimization Based on Dynamic Inverse of Aircraft Attitude Control. Comput. Simul. 2020, 37, 37–43. [Google Scholar]
Hao, C.; Fang, Z.; Li, P. Output feedback reinforcement learning control method based on reference model. J. Zhejiang Univ. Eng. Sci. 2013, 47, 409–414+479. [Google Scholar]
Huang, X.; Liu, J.; Jia, C.; Wang, Z.; Zhang, J. Deep Deterministic policy gradient algorithm for UAV control. Acta Aeronaut. Astronaut. Sin. 2021, 42, 404–414. [Google Scholar]
Choi, J.; Kim, H.M.; Hwang, H.J.; Kim, Y.D.; Kim, C.O. Modular Reinforcement Learning for Autonomous UAV Flight Control. Drones 2023, 7, 418. [Google Scholar] [CrossRef]
Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
Tang, J.; Liang, Y.; Li, K. Dynamic Scene Path Planning of UAVs Based on Deep Reinforcement Learning. Drones 2024, 8, 60. [Google Scholar] [CrossRef]
Wang, W.; Gokhan, I. Reinforcement learning based closed-loop reference model adaptive flight control system design. Sci. Technol. Eng. 2023, 23, 14888–14895. [Google Scholar]
Yang, R.; Du, C.; Zheng, Y.; Gao, H.; Wu, Y.; Fang, T. PPO-Based Attitude Controller Design for a Tilt Rotor UAV in Transition Process. Drones 2023, 7, 499. [Google Scholar] [CrossRef]
Burak, Y.; Wu, H.; Liu, H.X.; Yang, Y. An Attitude Controller for Quadrotor Drone Using RM-DDPG. Int. J. Adapt. Control Signal Process. 2021, 35, 420–440. [Google Scholar]
Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Chow, Y.; Nachum, O.; Faust, A.; Ghavamzadeh, M.; DuéñezGuzmán, E. Lyapunov-based safe policy optimization for continuous control. arXiv 2019, arXiv:1901.10031. [Google Scholar]
Yu, X.; Xu, S.; Fan, Y.; Ou, L. Self-Adaptive LSAC-PID Approach Based on Lyapunov Reward Shaping for Mobile Robots. J. Shanghai Jiaotong Univ. (Sci.) 2023, 1–18. [Google Scholar] [CrossRef]
PyFME. Available online: https://pyfme.readthedocs.io/en/latest/ (accessed on 12 April 2024).
Filipe, N. Nonlinear Pose Control and Estimation for Space Proximity Operations: An Approach Based on Dual Quaternions. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 2014. [Google Scholar]
Qing, Y.Y. Inertial Navigation, 3rd ed.; China Science Publishing & Media Ltd.: Beijing, China, 2020; pp. 252–284. [Google Scholar]
Gazebo. Available online: https://github.com/gazebosim/gz-sim (accessed on 28 July 2024).
Madaan, R.; Gyde, N.; Vemprala, S.; Vemprala, M.; Brown, M.; Nagami, K.; Taubner, T.; Cristofalo, E.; Scaramuzza, D.; Schwager, M.; et al. AirSim drone racing Lab. arXiv 2020, arXiv:2003.05654. [Google Scholar]
FlightGear. Available online: https://wiki.flightgear.org/Main_Page (accessed on 28 July 2024).
X-Plane. Available online: https://developer.x-plane.com/docs/ (accessed on 28 July 2024).
Xu, H.; Wang, S.; Wang, Z.; Zhang, Y.; Zhuo, Q.; Gao, Y.; Zhang, T. Efficient Exploration Using Extra Safety Budget in Constrained Policy Optimization. In Proceedings of the 2023 IEEE International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained Policy Optimization. In Proceedings of the 34nd International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Moritz, P.; Jordan, M.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Constrained Policy Optimization. arXiv 2018, arXiv:1801.01290. [Google Scholar]
Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
Gopi, S.P.; Magarini, M.; Alsamhi, S.H.; Shvetsov, A.V. Machine Learning-Assisted Adaptive Modulation for Optimized Drone-User Communication in B5G. Drones 2021, 5, 128. [Google Scholar] [CrossRef]
Zheng, Q.; Saponara, S.; Tian, X.; Yu, Z.; Elhanashi, A.; Yu, R. A real-time constellation image classification method of wireless communication signals based on the lightweight network MobileViT. Cogn. Neurodyn. 2024, 18, 659–671. [Google Scholar] [CrossRef]

Figure 1. The corresponding calculation diagram, where

C_{b}^{n}

is attitude matrix.

C_{11} = q_{0}^{2} + q_{1}^{2} - q_{2}^{2} - q_{3}^{2}

,

C_{12} = 2 (q_{1} q_{2} - q_{0} q_{3})

,

C_{13} = 2 (q_{0} q_{2} + q_{1} q_{3})

,

C_{21} = 2 (q_{1} q_{2} + q_{0} q_{3})

,

C_{22} = q_{0}^{2} - q_{1}^{2} + q_{2}^{2} {- q}_{3}^{2}

,

C_{23} = 2 (q_{2} q_{3} - q_{0} q_{1})

,

C_{31} = 2 (q_{1} q_{3} - q_{0} q_{2})

,

C_{32} = 2 (q_{2} q_{3} + q_{0} q_{1})

,

C_{33} = q_{0}^{2} - q_{1}^{2} - q_{2}^{2} {+ q}_{3}^{2}

.

Figure 1. The corresponding calculation diagram, where

C_{b}^{n}

is attitude matrix.

C_{11} = q_{0}^{2} + q_{1}^{2} - q_{2}^{2} - q_{3}^{2}

,

C_{12} = 2 (q_{1} q_{2} - q_{0} q_{3})

,

C_{13} = 2 (q_{0} q_{2} + q_{1} q_{3})

,

C_{21} = 2 (q_{1} q_{2} + q_{0} q_{3})

,

C_{22} = q_{0}^{2} - q_{1}^{2} + q_{2}^{2} {- q}_{3}^{2}

,

C_{23} = 2 (q_{2} q_{3} - q_{0} q_{1})

,

C_{31} = 2 (q_{1} q_{3} - q_{0} q_{2})

,

C_{32} = 2 (q_{2} q_{3} + q_{0} q_{1})

,

C_{33} = q_{0}^{2} - q_{1}^{2} - q_{2}^{2} {+ q}_{3}^{2}

.

Figure 2. Control system block diagram.

Figure 3. Action network block diagram.

Figure 4. Attitude control mission diagram.

Figure 5. Schematic diagram of online safe flight control method based on constrained reinforcement learning. During the offline training phase of the control law, constrained reinforcement learning with additional safety budget is utilized to update the control law strategy through the interaction between the control law and the offline training environment. In the online optimization stage of the control law, the control law model trained in the offline stage is used to construct a task set by flying online in a virtual environment. Then, real-time interactive data are used to update the task set using a conditional triggered meta-learning online reinforcement learning method and continue fine-tuning the control law strategy.

Figure 6. ESB-CPO algorithm framework. This method first calculates the adaptive factors

α_{i θ}

and

β_{i θ} (s_{t})

, and then obtains the LAE value from them. Finally, the approximate trust domain method is used to obtain the new policy

π_{θ'}

through backtracking search to ensure that the constraints are met and update the current policy.

Figure 6. ESB-CPO algorithm framework. This method first calculates the adaptive factors

α_{i θ}

and

β_{i θ} (s_{t})

, and then obtains the LAE value from them. Finally, the approximate trust domain method is used to obtain the new policy

π_{θ'}

through backtracking search to ensure that the constraints are met and update the current policy.

Figure 7. Training results of attitude control tasks.

Figure 8. Offline attitude angle control based on ESB-CPO. Theta, psi, and phi represent the pitch angle, yaw angle, and roll angle, respectively. Delta_elevator, delta_aileron, and delta_rudder represent the deflection amounts of the elevator, rudder, and aileron, respectively. Cur_average is the average of the attitude angle errors over the last 10 time steps.

Figure 9. Online attitude angle control based on ESB-CPO, see Supplementary Material.

Figure 10. Online attitude angle control based on TRPO.

Figure 11. Online attitude angle control based on SAC.

Table 1. Scoring criteria and values for each indicator.

Indicator Subjects		Scoring Criteria	Indicator Score
Pitch channel First incentive	Steady state error	$e_{s s, p i t c h, 1} \leq 1.0 °$	0.1
	Adjusting time	$t_{s, p i t c h, 1} \leq 4.0 s$	0.05
	Overshoot	$σ_{p i t c h, 1} \leq 3.0 °$	0.1
Pitch channel Second incentive	Steady state error	$e_{s s, p i t c h, 2} \leq 1.0 °$	0.1
	Adjusting time	$t_{s, p i t c h, 2} \leq 2.0 s$	0.05
	Overshoot	$σ_{p i t c h, 2} \leq 1.5 °$	0.1
Yaw channel First incentive	Steady state error	$e_{s s, y a w, 1} \leq 1.0 °$	0.1
	Adjusting time	$t_{s, y a w, 1} \leq 5.0 s$	0.05
	Overshoot	$σ_{y a w, 1} \leq 5.0 °$	0.1
Yaw channel Second incentive	Steady state error	$e_{s s, y a w, 2} \leq 1.0 °$	0.1
	Adjusting time	$t_{s, y a w, 2} \leq 5.0 s$	0.05
	Overshoot	$σ_{y a w, 2} \leq 5.0 °$	0.1
$Total$			1.0

Table 2. Full marks for each individual item.

Symbol	Meaning	Value
$C_{1}$	Full marks for the stabilization ratio term	30
$C_{2}$	Full marks for control quality items	40
$C_{3}$	Full marks for indicator dispersal items	30
Total		100

Table 3. Model parameters and initial state of the aircraft.

Parameter	Symbol	Value	Dimension
Mass	m	30	kg
Wingspan	$S_{w}$	3	$m$
Reference area	$S_{r e f}$	1.5	$m^{2}$
Reference chord length	$L_{r e f}$	0.469	$m$
Centre of mass in the theoretical vertex system X-axis coordinates	$x_{c}$	0.632	$m$
Centre of mass in the theoretical vertex system Y-axis coordinates	$y_{c}$	0.0473	$m$
Centre of mass in the theoretical vertex system Z-axis coordinates	$z_{c}$	0.0014	$m$
Initial position random ranges	$H_{0}$	(200, 400)	$m$
Initial speed random ranges	$V_{0}$	(25, 40)	$m / s$
Initial pitch angle	$θ_{0}$	0.0	$°$
Initial yaw angle random ranges	$ψ_{0}$	(−180, 180)	$°$
Initial roll angle	$Φ_{0}$	0.0	$°$
Initial pitch rate	$q_{0}$	0.0	$° / s$
Initial yaw rate	$r_{0}$	0.0	$° / s$
Initial roll a rate	$p_{0}$	0.0	$° / s$

Table 4. Hyperparameters used in training.

Name	Value	Name	Value
check_freq	25	min_rel_budget	1.0
cost_limit	8	safety_budget	15
entropy_coef	0.01	saute_discount_factor	0.99
epochs	500	test_rel_budget	1.0
gamma	0.99	unsafe_reward	−1.0
lam	0.95	save_freq	10
lam_c	0.95	seed	0
max_grad_norm	0.5	steps_per_epoch	10,000
num_mini_batches	16	target_kl	0.01
pi_lr	0.0003	train_pi_iterations	80
max_ep_len	1000	train_v_iterations	40
max_rel_budget	1.0	vf_lr	0.001

Table 5. Hyperparameters used in testing.

Name	Value	Name	Value
max_ep_len	1500	qry_size	80
buffer_size	1000	dist_angle	0.8
batch_size	200	learning_rate	1 × 10⁻⁵
minimal_size	200	gamma	0.99
sup_size	120	lam	0.95

Table 6. Attitude control mission assessment results.

Symbol	Meaning	Value
$S_{o f f}$	The total score of the offline flight algorithm	82.5
$S_{o n}$	The total score of the online flight algorithm	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, J.; Xu, H.; Wang, Z.; Zhang, T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones 2024, 8, 429. https://doi.org/10.3390/drones8090429

AMA Style

Zhao J, Xu H, Wang Z, Zhang T. Online Safe Flight Control Method Based on Constraint Reinforcement Learning. Drones. 2024; 8(9):429. https://doi.org/10.3390/drones8090429

Chicago/Turabian Style

Zhao, Jiawei, Haotian Xu, Zhaolei Wang, and Tao Zhang. 2024. "Online Safe Flight Control Method Based on Constraint Reinforcement Learning" Drones 8, no. 9: 429. https://doi.org/10.3390/drones8090429

Article Menu

Online Safe Flight Control Method Based on Constraint Reinforcement Learning

Abstract

1. Introduction

2. Mathematic Model

2.1. Aircraft Model

2.2. Controller Model

2.3. Flight Simulation Environment Model

2.3.1. State Quantity Design

2.3.2. Action Quantity Design

2.3.3. Reward Function Design

2.3.4. Cost Function Design

3. Methodology

3.1. Constrained Policy Optimization Algorithm with Extra Safety Budget

3.2. Condition-Triggered Meta-Learning Online Learning Method

4. Results and Discussion

4.1. Assessment Method

4.2. Experimental Details

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI