Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs

Zhao, Kai; Song, Jia; Hu, Yunlong; Xu, Xiaowei; Liu, Yang

doi:10.3390/math10152686

Open AccessArticle

Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs

by

Kai Zhao

¹

,

Jia Song

^1,*

,

Yunlong Hu

¹,

Xiaowei Xu

¹ and

Yang Liu

²

¹

School of Astronautics, Beihang University (BUAA), Beijing 100191, China

²

School of Automation Science and Electrical Engineering, Beihang University (BUAA), Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(15), 2686; https://doi.org/10.3390/math10152686

Submission received: 9 June 2022 / Revised: 26 July 2022 / Accepted: 27 July 2022 / Published: 29 July 2022

(This article belongs to the Special Issue Deep Learning and Adaptive Control)

Download

Browse Figures

Versions Notes

Abstract

:

Thanks to their hovering and vertical take-off and landing abilities, quadrotor unmanned aerial vehicles (UAVs) are receiving a great deal of attention. With the diversified development of the functions of UAVs, the requirements for flight performance with higher stability and maneuverability are increasing. Aiming at parameter uncertainty and external disturbance, a deep deterministic policy gradient-based active disturbance rejection controller (DDPG-ADRC) is proposed. The total disturbances can be compensated dynamically by adjusting the controller bandwidth and the estimation of system parameters online. The tradeoff between anti-interference and rapidity can be better realized in this way compared with the traditional ADRC. The process of parameter tuning is demonstrated through the simulation results of tracking step instruction and sine sweep under ideal and disturbance conditions. Further analysis shows the proposed DDPG-ADRC has better performance.

Keywords:

reinforcement learning; deep deterministic policy gradient; active disturbance rejection control; quadrotor ummanned aerial vehicle

MSC:

93-10

1. Introduction

Quadrotor unmanned aerial vehicles (UAVs) have attracted attention thanks to their ability to hovering and to take off and landing vertically. Due to their under-actuated nature, quadrotors’ position control is performed by controlling the attitude angles [1]. For this reason, attitude control of quadrotors has been a hot research topic in recent years. However, quadrotors are subject to parameter uncertainty and external disturbance, which threaten flight safety and pose huge challenges to the design of controllers [2]. In addition, with the popularity of quadrotors, higher requirements are being placed on the controllers. Thus, it is urgent to design an advanced controller to improve reliability and rapidity.

In the literature, plenty of approaches have been studied for the quadrotor attitude control problem. As a classical controller, proportion integration differentiation (PID) is widely used because of its simple structure and good control effect [3,4,5]. Taybe et al. [6] developed an augmented proportion differentiation (PD) attitude controller that guarantees exponential stability. Cao et al. [7] focused on the position control of quadrotors using an inner–outer loop control structure. The outer loop generates a saturated thrust, reference roll, and pitch angles, while the inner loop is designed to follow these reference angles using a traditional PID controller.

Due to nonlinearity and disturbances, the control effect of PID is unsatisfactory. As one of the most important control techniques, sliding mode control (SMC) is able to handle nonlinear systems with external disturbances. Based on second-order SMC, Zheng et al. [8] designed a controller for a small quadrotor unmanned aerial vehicle (UAV). Xiong et al. [9] designed a highly coupled and nonlinear controller for a fully actuated UAV through a novel robust terminal sliding mode control algorithm. Nevertheless, the oscillation caused by SMC is the main obstacle restricting its application.

To achieve robust performance and stabilization, the robust

H \infty

control method of George Zames has been widely studied [10]. Due to the uncertain nature of aircraft systems, Babar et al. [11] improved the traditional inner–outer loop strategy and adopted a robust controller for the inner control loop. Liu et al. [12] designed a distributed robust controller consisting of a position controller and an attitude controller for multiple quadrotors with nonlinearities and disturbances.

To deal with nonlinearities and disturbances, the main idea of active disturbance rejection control (ADRC) is to reduce the state feedback, whether linear or non-linear, to a cascade of integrators [13,14]. To solve the problem that UAV tracking control relies too much on mathematical modeling and the accuracy of measurements, Niu et al. [15] proposed a longitudinal pitching angle control system based on a nonlinear ADRC. Lotufo et al. [16] combined ADRC with embedded model control (EMC), relying on the disturbance rejector to bridge the gap between model and reality.

However, there are issues remaining that deserve attention [17].

The classical controller design relies on understanding the physics of flight, and has difficulty to handling the coupling multiple loops design task. In other words, the classical one-loop-at-a-time design cannot guarantee success when more loops are added and coupled.
Modern control techniques often require exact knowledge of models and are sensitive to parameter uncertainty and external disturbances [18]. However, different loads in each flight mission lead to uncertainty in system parameters. Meanwhile, parameters may be difficult to obtain, especially aerodynamic parameters. This sometimes leads to unstable behaviors, limiting the application of model-based controllers.
For modern robust controllers [12], it is usually difficult to obtain the upper bounds of external disturbance and parameter uncertainty, which causes unsatisfactory performance.
In the ADRC algorithm, the predefined bandwidth of the closed-loop system is unable to guarantee the tradeoff between robustness and transient tracking performance. Meanwhile, the estimation of parameters affects the ability of the controller to resist disturbances [14].

Aiming at the controller parameters tuning problem, many optimization algorithms have been used, including genetic algorithms (GA) [19], particle swarm optimization (PSO) [20], and grey wolf optimization (GWO) [21]. Bolandi et al. [22] used an analytical optimization method to tune a conventional PID controller for stabilization and disturbance rejection of quadrotors.

With the development of computer science and technology, reinforcement learning (RL) is able to autonomously learn optimal strategies through continuous interaction with the environment and is considered one of the most likely approaches for achieving general artificial intelligence [23]. Lee et al. [24] proposed an RL-based adaptive PID controller for dynamic positioning systems. The results showed that the system had better station-keeping performance without any deterioration in its control efficiency. Gheisarnejad et al. [25] proposed a deep deterministic policy gradient (DDPG)-based supplementary controller to enhance the adaptive capability of the tracking control problem. Zhao et al. [26] employed RL to update the optimal control weights in the fault-tolerant formation control law design. Zheng et al. [27] used the Q-learning algorithm to select the adaptive parameters for ADRC. However, as Q-learning can only deal with discrete problems, the states need to be stored in the Q table, and the action must be discrete. By itself, Q-learning cannot deal with complex continuous problems such as attitude control of UAVs. RL, which can solve the nonlinear optimal consensus control problem, is widely used in fault-tolerant control. Ma et al. [28] presented an adaptive model-free fault-tolerant control scheme based on integral RL by introducing the integral of the tracking error. Li et al. [29] designed direct adaptive optimal controllers by combining the backstepping technique with RL. The critic network is used to approximate the strategic utility functions and the action network is used to approximate the unknown and desired control input signals.

Motivated by the above discussions, ADRC based on DDPG is proposed in this paper. The main contributions of this paper are as follows:

A realistic and nonlinear model of quadrotors is established, considering parameter uncertainty and external disturbances.
Online continuous adjustment of the bandwidth of the closed loop is realized by DDPG, and is beneficial for balancing the robustness and transient tracking performance.
DDPG is adopted to achieve fast and accurate compensation for the total disturbance of the system, leading to the response speed and control accuracy being further improved.

The remainder of this paper is organized as follows. In Section 2, the proposed dynamic quadrotor model with internal and external disturbances is provided. The proposed DDPG-based ADRC is presented in Section 3. The simulation results are provided and analyzed in Section 4. Finally, Section 5 presents our conclusions.

2. Nonlinear Model of Quadrotors

In this section, a nonlinear dynamic model with internal and external disturbances is provided. Figure 1 shows the structure and coordinate system of the quadrotor.

2.1. Ideal Model of Quadrotors

The ideal dynamic model of quadrotors is established in Formula (1).

\begin{matrix} \begin{matrix} m \ddot{E^{I}} = R_{b}^{I} F_{b} \\ J \ddot{Θ} = C (J, \dot{Θ}) + M_{b} \end{matrix}, \end{matrix}

(1)

where m is the quadrotor mass,

E^{I} = {[E_{x}^{I}, E_{y}^{I}, E_{z}^{I}]}^{T}

is the position expressed in the Earth-inertial coordinate,

R_{b}^{I} \in S O (3)

denotes the rotation matrix from the body-fixed coordinate to the Earth-inertial coordinate, and

F_{b} = {[0, 0, f_{t}]}^{T} - {R_{b}^{I}}^{T} {[0, 0, m g]}^{T}

is the force established in the body-fixed coordinate, where

f_{t} = C_{w} \sum_{i = 1}^{4} w_{i}^{2}

.

C_{w}

is the lift coefficient and

w_{i} (i = 1, 2, 3, 4)

denotes the rotational speed of the

i t h

rotor. Above,

J = d i a g {J_{ϕ}, J_{θ}, J_{ψ}}

denotes the inertia matrix, while

Θ = {[ϕ, θ, ψ]}^{T}

indicates the Euler angles, i.e., the roll, pitch, and yaw angles, respectively; thus, the rotation matrix can be rewritten using the Euler angles [12].

R_{b}^{I} = [\begin{matrix} cos θ cos ψ & cos ψ sin ϕ sin θ - cos ϕ sin ψ & sin ϕ sin ψ + cos ϕ cos ψ sin θ \\ cos θ sin ψ & cos ϕ cos ψ + sin ϕ sin θ sin ψ & cos ϕ sin θ sin ψ - cos ψ sin ϕ \\ - sin θ & cos θ sin ϕ & cos ϕ cos θ \end{matrix}]

C (J, \dot{Θ})

denotes the Coriolis term, where

C (J, \dot{Θ}) = [\begin{matrix} (J_{y} - J_{z}) \dot{θ} \dot{ψ} \\ (J_{z} - J_{x}) \dot{ϕ} \dot{ψ} \\ (J_{x} - J_{y}) \dot{ϕ} \dot{θ} \end{matrix}]

;

M_{b} = {[M_{x b}, M_{y b}, M_{z b}]}^{T}

represents the torque in the body-fixed coordinate

[\begin{matrix} M_{x b} \\ M_{y b} \\ M_{z b} \end{matrix}] = [\begin{matrix} U_{2} + J_{r} q (- w_{1} + w_{2} - w_{3} + w_{4}) \\ U_{3} - J_{r} q (- w_{1} + w_{2} - w_{3} + w_{4}) \\ U_{4} \end{matrix}],

where

[\begin{matrix} U_{2} \\ U_{3} \\ U_{4} \end{matrix}] = [\begin{matrix} l (F_{2} + F_{3} - F_{1} - F_{4}) \\ l (F_{3} + F_{4} - F_{1} - F_{2}) \\ l d (w_{2}^{2} + w_{4}^{2} - w_{1}^{2} - w_{3}^{2}) \end{matrix}]

.

l_{c}

and d represent the distance from the motor to the center of mass and the anti-torque coefficient, respectively, while

J_{r}

is the moment of inertia of the motors and propellers. For now, the normal model of quadrotors has been established.

2.2. Internal and External Disturbances

Quadrotors usually carry various mission payloads to perform different missions, resulting in changes in parameters such as mass or moment of inertia. This can be modeled as

m^{*} = k_{m} m

and

J^{*} = k_{J} J

, where

m^{*}

and

J^{*}

are the actual mass and inertia matrix, respectively, and

k_{m}

and

k_{J}

are the scaling factors of uncertainty. At the same time, quadrotors are inevitably disturbed by the environment,

M_{e}

.

Thus, the actual dynamic model of quadrotors is expressed as follows:

\begin{matrix} \begin{matrix} m^{*} \ddot{E^{I}} = R_{b}^{I} F_{b} \\ J^{*} \ddot{Θ} = C^{*} (Θ, \dot{Θ}) + M_{b} + M_{e} \end{matrix} \end{matrix}

(2)

3. Construction of DDPG-Based ADRC

3.1. ADRC-Based Attitude Controller Design

Only the attitude control of quadrotors is considered here, and the construction of ADRC is designed as in Figure 2. To facilitate the control system design, the quadrotor is reduced to a second-order system with perturbations, which can be written in state equation form:

\begin{matrix} \begin{matrix} \dot{x} = A x + B u + E h \\ y = C x \end{matrix} \end{matrix}

(3)

where

A = [\begin{matrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{matrix}], B = [\begin{matrix} 0 \\ b \\ 0 \end{matrix}], C = [\begin{matrix} 1 & 0 & 0 \end{matrix}] .

h is the unknown disturbance, and

E = {[\begin{matrix} 0 & 0 & 1 \end{matrix}]}^{T}

.

Figure 2. Structure of ADRC;

\bar{θ}

is the desired angle of pitch and b is the estimation of system parameters.

Figure 2. Structure of ADRC;

\bar{θ}

is the desired angle of pitch and b is the estimation of system parameters.

The extended state observer (ESO) is designed from the ideal model of the quadrotor, which can be established as follows:

\begin{matrix} \begin{matrix} \dot{z} = A z + B u + L (y - \hat{y}) \\ \hat{y} = C z \end{matrix} \end{matrix}

(4)

where y is the state of the system and L is the observer gain vector,

L = {[\begin{matrix} β_{1} & β_{2} & β_{3} \end{matrix}]}^{T}

.

Let

e_{i} = x_{i} - z_{i}

and combine Equations (3) and (4); then, the error can be rewritten as

\begin{matrix} \dot{e} = A_{e} e + E h \end{matrix}

(5)

where

A_{e} = A - L C = [\begin{matrix} - β_{1} & 1 & 0 \\ - β_{2} & 0 & 1 \\ - β_{3} & 0 & 0 \end{matrix}]

.

It is obvious that the ESO is bounded-input bounded-output stable if the roots of the characteristic polynomial of

A_{e}

λ (s) = s^{3} + β_{1} s^{2} + β_{2} s + β_{3}

are all in the left half plane and h is bounded [14,30].

Thus,

β_{1}

,

β_{2}

,

β_{3}

can be designed using the pole placement technique. Let

λ (s) = s^{3} + β_{1} s^{2} + β_{2} s + β_{3} = {(s + w_{o})}^{3}

. Therefore, it can be obtained that

\begin{matrix} β_{1} = 3 w_{0}, β_{2} = 3 w_{o}^{2}, β_{3} = w_{o}^{3}, \end{matrix}

(6)

where

w_{o}

is the bandwidth of the observer.

For the controller, the ideal system can be written as

\ddot{y} = x_{3} + b u

, where

\dot{x_{3}} = h

. According to the proof above, the appropriate value (6) can make

e \to 0

; in other words,

z_{1} \to x_{1}, z_{2} \to x_{2}, z_{3} \to x_{3}

.

The controller is designed as

u = \frac{- z_{3} + u_{0}}{b}

. Thus,

\ddot{y} = x_{3} + b u = (x_{3} - z_{3}) + u_{0} \approx u_{0}

, where

u_{0}

is designed as a PD controller,

u_{0} = k_{p} (r - z_{1}) + k_{d} (\dot{r} - z_{2})

. It can be assumed that

\dot{r} = 0

. Then,

\ddot{y} = k_{p} (r - z_{1}) - k_{d} z_{2}

and the closed loop transfer function can be rewritten as

G_{c l} = \frac{k_{p}}{s^{2} + k_{d} s + k_{p}} .

When

K_{p} = w_{c}^{2}, k_{d} = 2 ξ w_{c}

, the closed-loop system is simplified into a standard second-order system.

Taken together, the effectiveness of the involved ESO and controller is demonstrated. Normally,

w_{o} \approx 5 \sim 10 w_{c}

, where

w_{c}

is the bandwidth of the controller. In this paper,

\begin{matrix} w_{o} = 5 w_{c} . \end{matrix}

(7)

However, as described in Section 2.2, when there are internal disturbances, B =

{[\begin{matrix} 0 & b & 0 \end{matrix}]}^{T}

in Formula (3) turns into

B_{0} = {[\begin{matrix} 0 & b_{0} & 0 \end{matrix}]}^{T}

. The difference between b and

b_{0}

reduces the robustness of the system. The presence of the observer in ADRC allows the total disturbances to be observed, which means that

z_{3} \to x_{3} + (b - b_{0}) u

. Then, the internal disturbance can be compensated for.

In practice, a step signal has a great impact on the system. To balance the contradiction between increasing the rapidity of the system and reducing the overshoot, a tracking differentiator (TD) can be adopted to track the desired signal, and a smooth tracking signal can be obtained and further used in the controller. In this paper, a standard second-order system is designed,

G_{T D} = \frac{w_{n}^{2}}{s^{2} + 2 ξ w_{n} s + w_{n}^{2}},

where

w_{n} = 20

is the natural frequency and

ξ = 1

is the damping ration.

3.2. Reinforcement Learning Theory

In this paper, the attitude control is regarded as a Markov decision process (MDP), which can be modeled as

(S, A, T, R, γ)

, where S represents the state space, A is the action space,

T : T (s_{t + 1} = s^{^{'}} | s_{t} = s, a_{t} = a)

is the state transition model,

R : S \times A \to R

signifies the reward function, and

γ

is the discount factor. The MDP means that, at every timestep, the agent in state

s_{t}

takes action

a_{t}

, reward

r_{t}

is obtained, and the state is transited to

s_{t + 1}

. A generic flowchart of the process is shown in Figure 3.

RL discusses how an agent can maximize its rewards in a complex and uncertain environment. The goal is to learn an optimal policy

π^{*}

, which in all states enables the agent to obtain the maximum discount return

G_{t} = \sum_{i = t}^{T} γ^{i - t} r (s_{i}, a_{i}); γ \in [0, 1]

; the action-value function is called Q function, and can be rewritten using the Bellman equation:

\begin{matrix} Q_{π} (s_{t}, a_{t}) = E_{r_{t}, s_{t + 1} \sim E} [r (s_{t}, a_{t}) + γ E_{a_{t + 1} \sim π} [Q_{π} (s_{t + 1}, a_{a + 1})]], \end{matrix}

(8)

where policy

π

maps a state

s_{t}

to action

a_{t}

, which can be learned by an off-policy learning algorithm called Q-learning [31].

The strategy used in this paper is the DDPG algorithm, which is an extension of the deep Q network (DQN). A model-free algorithm that is able to operate over continuous action spaces has previously been presented in [32] based on the deterministic policy gradient. The structure is shown in Figure 4.

Such a structure is called Actor–Critic. The policy network, which is called the Actor, outputs actions based on states

a = g_{ϑ} (S)

. The Q network is employed to export the action value

q_{w} (s, a)

, which is named Critic, and a replay buffer is used to eliminate correlations between inputs. Compared with the Actor network, the Critic network usually has a more complex structure to infer the underlying state from the measurements and deal with the state transition [33].

3.3. Structure of DDPG-Based ADRC

As a feedback-based controller, the inputs of DDPG-ADRC include the control command and the tracking error. Then, the outputs of DDPG are used to update the parameters in ADRC, which means the estimation of system parameters b and the bandwidth of the controller

w_{c}

. On one hand, the parameter b reflects the gain from the input to the output of the system, which is related by the system parameters. On the other hand, to compensate for the total disturbance,

- \frac{z_{3}}{b}

is added to the PD controller, which means that b affects the compensation for disturbances. Meanwhile, the bandwidth of the controller,

w_{c}

, directly determines the performance of the PD controller, and the bandwidth of the observer,

w_{o}

, determines the performance of the ESO, where

w_{o} = 5 w_{c}

. An overall structure of the proposed fault-tolerant controller is shown in Figure 5.

The reward function is a key element in RL, supervising agents to learn and obtain the optimal policy. In order to solve the difficulty of training caused by sparse rewards, the reward function is designed as follows:

\begin{matrix} R_{1} = - \sqrt{{(\bar{ϕ} - ϕ)}^{2} + {(\bar{θ} - θ)}^{2} + {(\bar{ψ} - ψ)}^{2}}, \end{matrix}

(9)

To solve the problem of slow convergence under small errors, step rewards are designed.

\begin{matrix} R_{2} = \{\begin{matrix} R_{1} + 5 & i f | \bar{ϕ} - ϕ |, | \bar{θ} - θ |, | \bar{ψ} - ψ | \leq 0 . 5^{\circ} \\ R_{1} + 3 & e l s e i f | \bar{ϕ} - ϕ |, | \bar{θ} - θ |, | \bar{ψ} - ψ | \leq 1^{\circ} \end{matrix} \end{matrix}

(10)

At the same time, a sparse penalty function is considered. When the attitude of the agent is too far away from the target, the current round of training is terminated in advance. To reduce ineffective exploration, a large penalty is introduced. Thus, the total reward is

\begin{matrix} R = \{\begin{matrix} R_{2} - 1000 & i f t h e t r a i n i s t e r m i n a t e d i n a d v a n c e \\ R_{2} & e l s e \end{matrix} \end{matrix}

(11)

Above all, the algorithm flow presented in this paper is shown in Algorithm 1.

The state input of DDPG is a two-dimensional vector, namely, control command,

\bar{θ}

, and tracking error, e. The action output is a two-dimensional variable, i.e., b and

w_{c}

.

Algorithm 1 DDPG-based ADRC controller

Randomly initialize Q network $q_{w}$ and policy network $g_{ϑ}$ parameters
Initialize the target network parameters $q_{w}^{t}$ and $g_{ϑ}^{t}$
Initialize the experience pool
for $e p i s o d e = 1, 2 \dots N$ do
Random initialization of control command and initial state
for $i = 1, 2 \dots T$ do
State $s_{t}$ is obtained
Select the action based on the current state and exploration noise $a_{t} = g_{ϑ} (s_{t}) + ξ_{i}$
Perform the action $a_{t}$ ,observe the return $r_{t}$ , get the next state $s_{t + 1}$
Put the sample $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in the experience pool D
Sample random mini-batch of $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from D
Optimize critic network parameters w:
$L o s s = M S E [q_{w} (s, a), r + γ q_{\bar{w}} (s^{'}, a^{'})]$
Optimize actor network parameters $θ$ :
$L o s s = - q_{w} (s, a)$
Every C steps update $\bar{w}, \bar{θ}$ :
$\bar{w} = τ w + (1 - τ) \bar{w}$ , $\bar{θ} = τ θ + (1 - τ) \bar{θ}$
end for
end for

4. Simulation and Results

To verify the effectiveness of the proposed controller, simulations with ideal conditions and under conditions of internal and external disturbance are presented. The parameters of the quadrotor are shown in Table 1.

An Intel Xeon(R)W-2123 CPU @ 3.60 GHz, NVIDIA GeForce RTX1080Ti GPU, and Windows 10 64 bit were used in the experiments. To evaluate the performance of the proposed method several common evaluation indicators were adopted, such as integrated time and absolute error (ITAE), integrated time and square error (ITSE), and integrated absolute error (IAE).

\begin{matrix} \begin{matrix} ITAE = \int_{t_{0}}^{t_{f}} t | e (t) | d t \\ ITSE = \int_{t_{0}}^{t_{f}} t e^{2} (t) d t \\ IAE = \int_{t_{0}}^{t_{f}} | e (t) | d t \end{matrix} \end{matrix}

(12)

These indicators take into account both the control accuracy and convergence speed; smaller values indicate better controller performance.

4.1. Simulations in the Presence of Internal Disturbances

In order to verify the effectiveness of the proposed DDPG-ADRC method, simulations under internal disturbance conditions are presented. The pitch channel of the quadrotor tracks a step command of

10^{\circ}

, and the command reduces to

8^{\circ}

2 s later. The response is shown in Figure 6a, and the outputs of RL, i.e, the bandwidth of the controller

w_{c}

and the estimation of the system b are displayed in Figure 6b.

It can be seen that with DDPG-ADRC the quadrotor can accurately track the instruction. In addition, the controller bandwidth

w_{c}

and the system parameter b can be adaptively adjusted according to the observations.

In order to demonstrate the advantages of dynamic parameter adjustment, the steady results are used as fixed parameters, i.e,

w_{c} = 22.2, b = 12

. Figure 7a shows the response of the system and Figure 7b reveals the differences between traditional ADRC and DDPG-ADRC in compensating for total disturbances.

By dynamically adjusting parameters, DDPG-ADRC can compensate for disturbances more accurately and quickly, which is the advantage of DDPG-ADRC compared to traditional ADRC.

In order to explore the influence of parameter uncertainty on controllers, simulations were carried out with different parameter estimates b; the results are shown in Figure 8a, and are compared with model predictive control (MPC), shown in Figure 8b. In the design of the MPC controller, the same second-order system with a gain b is used. ITAE, ITSE, and IAE are adopted to evaluate the tracking process, and are shown in Table 2.

It can be seen from Table 2 and Figure 8 that, under nominal conditions, all three controllers can achieve satisfactory control effect. Meanwhile, with the selection of appropriate parameters, with lower ITAE, ITSE, and IAE and higher rewards, the control effect of MPC is the best. However, MPC is less robust against parameter uncertainty compared with ADRC.

4.2. Simulations in the Presence of External Disturbances

In order to verify the performance of the proposed controller in the face of external disturbances, disturbance torque caused by windblast was added in the simulation time 1∼1.5 s. Figure 9a shows the control instruction and the response, and the action of RL is displayed in Figure 9b.

It can be seen that the designed DDPG-ADRC can respond in time when faced with external disturbances. The performance is compared with the traditional ADRC in Figure 10. The evaluation indicators are shown in Table 3.

Compared with the case of internal disturbances shown in Table 3, external disturbances have a greater affect on controller performance, although both traditional ADRC and DDPG-ADRC can counteract the disturbances in time. Similarly, the performance of DDPG-ADRC is more prominent in both control accuracy and convergence speed. Under the ITSE indicator, DDPG-ADRC is improved by

10.4 %

compared to ADRC in the presence of external disturbances. This means that DDPG-ADRC can achieve better performance than ADRC with fixed parameters in practice, which is demonstrated in Figure 11a as well. Although the MPC has better control performance under nominal conditions, it diverges when there are large external disturbances, as Figure 11b shows.

4.3. Simulation under Sine Sweep

In designing a control system, in order to know the response of the system under different frequency commands it is necessary to carry out frequency sweep experiments. A sine sweep is often used to measure the time-frequency characteristics of the system. Figure 12 shows the control instruction and response, while the evaluation indicators are shown in Table 4.

It can be seen from Figure 12 and Table 4 that the uncertainty of the parameters affects the control effect of MPC; generally speaking, ADRC has better performance and lower phase delay in the high-frequency bound than MPC. Meanwhile, from the point of view of indicators, DDPG-ADRC has stronger tracking ability thanks to the adaptive adjustment of compensation.

5. Conclusions

In this paper, a novel DDPG-based ADRC is proposed for the attitude control of quadrotors. First, a nonlinear mathematical model of quadrotors with internal disturbance and external disturbance is established. Then, by properly setting the reward function, online continuous adjustment of the bandwidth is realized to balance the robustness and transient tracking performance. Meanwhile, fast and accurate compensation for the total disturbance is achieved, further improving the response speed and control accuracy. Simulation results show that DDPG-ADRC has advantages on all indicators; in other words, it has advantages in terms of both control accuracy and convergence speed. This paper provides a new solution to the attitude control of quadrotors in the presence of disturbances. In the future, the proposed controller will be used to conduct hardware-in-the-loop simulation experiments to further verify the stability of the algorithm. However, the gap between the simulation and the real world presents additional challenges, such as the oscillation of the controller [34].

Author Contributions

Conceptualization, J.S. and K.Z.; methodology, K.Z. and Y.H.; software, K.Z.; validation, K.Z., X.X. and Y.L.; formal analysis, J.S. and X.X.; investigation, K.Z. and Y.H.; resources, K.Z. and Y.L.; data curation, J.S.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z.; visualization, K.Z.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grants 61473015, 91646108, and 62073020.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors thank their colleagues for their constructive suggestions and research assistance throughout this study.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	unmanned aerial vehicle
PID	proportion integration differentiation
PD	proportion differentiation
SMC	sliding mode control
ADRC	active disturbance rejection control
EMC	embedded model control
GA	genetic algorithm
PSO	particle swarm optimization
GWO	grey wolf optimization
RL	reinforcement learning
DDPG	deep deterministic policy gradient
ESO	extended state observer
TD	tracking differentiator
MDP	Markov decision process
DQN	deep Q network
ITAE	integrated time and absolute error
ITSE	integrated time and square error
IAE	integrated absolute error
MPC	model predictive control

References

Tian, B.; Liu, L.; Lu, H.; Zuo, Z.; Zong, Q.; Zhang, Y. Multivariable finite time attitude control for quadrotor UAV: Theory and experimentation. IEEE Trans. Ind. Electron. 2017, 65, 2567–2577. [Google Scholar] [CrossRef]
Liu, H.; Zhao, W.; Zuo, Z.; Zhong, Y. Robust control for quadrotors with multiple time-varying uncertainties and delays. IEEE Trans. Ind. Electron. 2016, 64, 1303–1312. [Google Scholar] [CrossRef]
Hoffmann, G.M.; Huang, H.; Waslander, S.L.; Tomlin, C.J. Precision flight control for a multi-vehicle quadrotor helicopter testbed. Control Eng. Pract. 2011, 19, 1023–1036. [Google Scholar] [CrossRef]
Mahony, R.; Kumar, V.; Corke, P. Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor. IEEE Robot. Autom. Mag. 2012, 19, 20–32. [Google Scholar] [CrossRef]
Pounds, P.; Mahony, R.; Corke, P. Modelling and control of a large quadrotor robot. Control Eng. Pract. 2010, 18, 691–699. [Google Scholar] [CrossRef] [Green Version]
Tayebi, A.; McGilvray, S. Attitude stabilization of a VTOL quadrotor aircraft. IEEE Trans. Control Syst. Technol. 2006, 14, 562–571. [Google Scholar] [CrossRef] [Green Version]
Cao, N.; Lynch, A.F. Inner–outer loop control for quadrotor UAVs with input and state constraints. IEEE Trans. Control Syst. Technol. 2015, 24, 1797–1804. [Google Scholar] [CrossRef]
Zheng, E.H.; Xiong, J.J.; Luo, J.L. Second order sliding mode control for a quadrotor UAV. ISA Trans. 2014, 53, 1350–1356. [Google Scholar] [CrossRef]
Xiong, J.J.; Zheng, E.H. Position and attitude tracking control for a quadrotor UAV. ISA Trans. 2014, 53, 725–731. [Google Scholar] [CrossRef] [PubMed]
Zames, G.; Francis, B. Feedback, minimax sensitivity, and optimal robustness. IEEE Trans. Autom. Control 1983, 28, 585–601. [Google Scholar] [CrossRef]
Babar, M.; Ali, S.; Shah, M.; Samar, R.; Bhatti, A.; Afzal, W. Robust control of UAVs using H∞ control paradigm. In Proceedings of the 2013 IEEE 9th International Conference on Emerging Technologies (ICET), Islamabad, Pakistan, 9–10 December 2013; IEEE: New York, NY, USA, 2013; pp. 1–5. [Google Scholar]
Liu, H.; Ma, T.; Lewis, F.L.; Wan, Y. Robust formation control for multiple quadrotors with nonlinearities and disturbances. IEEE Trans. Cybern. 2018, 50, 1362–1371. [Google Scholar] [CrossRef] [PubMed]
Song, J.; Zhao, M.; Gao, K.; Su, J. Error Analysis of ADRC Linear Extended State Observer for the System with Measurement Noise. IFAC-PapersOnLine 2020, 53, 1306–1312. [Google Scholar] [CrossRef]
Gao, Z. Scaling and bandwidth-parameterization based controller tuning. In Proceedings of the 2003 American Control Conference, Denver, CO, USA, 4–6 June 2003; pp. 4989–4996. [Google Scholar]
Niu, T.; Xiong, H.; Zhao, S. Based on ADRC UAV longitudinal pitching Angle control research. In Proceedings of the 2016 IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chongqing, China, 20–22 May 2016; IEEE: New York, NY, USA, 2016; pp. 21–25. [Google Scholar]
Lotufo, M.A.; Colangelo, L.; Perez-Montenegro, C.; Canuto, E.; Novara, C. UAV quadrotor attitude control: An ADRC-EMC combined approach. Control Eng. Pract. 2019, 84, 13–22. [Google Scholar] [CrossRef]
Zuo, Z.; Liu, C.; Han, Q.L.; Song, J. Unmanned aerial vehicles: Control methods and future challenges. IEEE/CAA J. Autom. Sin. 2022, 9, 601–614. [Google Scholar] [CrossRef]
Wang, X.; Van Kampen, E.J.; Chu, Q.; Lu, P. Stability analysis for incremental nonlinear dynamic inversion control. J. Guid. Control Dyn. 2019, 42, 1116–1129. [Google Scholar] [CrossRef] [Green Version]
Mudi, J.; Shiva, C.K.; Mukherjee, V. Multi-verse optimization algorithm for LFC of power system with imposed nonlinearities using three-degree-of-freedom PID controller. Iran. J. Sci. Technol. Trans. Electr. Eng. 2019, 43, 837–856. [Google Scholar] [CrossRef]
Dubey, B.K.; Singh, N.; Bhambri, S. Optimization of PID controller parameters using PSO for two area load frequency control. IAES Int. J. Robot. Autom. 2019, 8, 256. [Google Scholar]
Debnath, M.K.; Jena, T.; Sanyal, S.K. Frequency control analysis with PID-fuzzy-PID hybrid controller tuned by modified GWO technique. Int. Trans. Electr. Energy Syst. 2019, 29, e12074. [Google Scholar] [CrossRef]
Bolandi, H.; Rezaei, M.; Mohsenipour, R.; Nemati, H.; Smailzadeh, S.M. Attitude control of a quadrotor with optimized PID controller. Intell. Control Autom. 2013, 4, 335–342. [Google Scholar]
Koch, W.; Mancuso, R.; West, R.; Bestavros, A. Reinforcement learning for UAV attitude control. ACM Trans. Cyber-Phys. Syst. 2019, 3, 1–21. [Google Scholar] [CrossRef] [Green Version]
Lee, D.; Lee, S.J.; Yim, S.C. Reinforcement learning-based adaptive PID controller for DPS. Ocean. Eng. 2020, 216, 108053. [Google Scholar] [CrossRef]
Gheisarnejad, M.; Khooban, M.H. An intelligent non-integer PID controller-based deep reinforcement learning: Implementation and experimental results. IEEE Trans. Ind. Electron. 2020, 68, 3609–3618. [Google Scholar] [CrossRef]
Zhao, W.; Liu, H.; Wan, Y. Data-driven fault-tolerant formation control for nonlinear quadrotors under multiple simultaneous actuator faults. Syst. Control Lett. 2021, 158, 105063. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Z.; Huang, Z.; Sun, M.; Sun, Q. Active disturbance rejection controller for multi-area interconnected power system based on reinforcement learning. Neurocomputing 2021, 425, 149–159. [Google Scholar] [CrossRef]
Ma, J.; Peng, C. Adaptive model-free fault-tolerant control based on integral reinforcement learning for a highly flexible aircraft with actuator faults. Aerosp. Sci. Technol. 2021, 119, 107204. [Google Scholar] [CrossRef]
Li, H.; Wu, Y.; Chen, M. Adaptive fault-tolerant tracking control for discrete-time multiagent systems via reinforcement learning algorithm. IEEE Trans. Cybern. 2020, 51, 1163–1174. [Google Scholar] [CrossRef] [PubMed]
Gao, K.; Song, J.; Yang, E. Stability analysis of the high-order nonlinear extended state observers for a class of nonlinear control systems. Trans. Inst. Meas. Control 2019, 41, 4370–4379. [Google Scholar] [CrossRef] [Green Version]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Degrave, J.; Felici, F.; Buchli, J.; Neunert, M.; Tracey, B.; Carpanese, F.; Ewalds, T.; Hafner, R.; Abdolmaleki, A.; de Las Casas, D.; et al. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 2022, 602, 414–419. [Google Scholar] [CrossRef] [PubMed]
Wada, D.; Araujo-Estrada, S.A.; Windsor, S. Unmanned aerial vehicle pitch control under delay using deep reinforcement learning with continuous action in wind tunnel test. Aerospace 2021, 8, 258. [Google Scholar] [CrossRef]

Figure 1. Schematic of the quadrotor.

Figure 3. Learning process of RL.

Figure 4. The structure of DDPG.

Figure 5. Overall structure of the proposed DDPG-based ADRC controller.

Figure 6. (a) Control instruction and state response via DDPG-ADRC and (b) online parameters adjustment.

Figure 7. (a) Control instruction and state response via traditional ADRC and (b) comparison of compensation amount

z_{3} / b

between traditional ADRC and DDPG-ADRC.

Figure 7. (a) Control instruction and state response via traditional ADRC and (b) comparison of compensation amount

z_{3} / b

between traditional ADRC and DDPG-ADRC.

Figure 8. (a) Control instruction and state response via ADRC with different b and (b) control instruction and state response via MPC with different b.

Figure 9. (a) Control instruction and state response via DDPG-ADRC in the presence of external disturbances and (b) online parameters adjustment in the presence of external disturbances.

Figure 10. Comparison of compensation between traditional ADRC and DDPG-ADRC in the presence of external disturbances.

Figure 11. (a) Control instruction and state response via ADRC with different b in the presence of external disturbances; (b) control instruction and state response via MPC with different b in the presence of external disturbances.

Figure 12. (a) Control instruction and state response via ADRC and DDPG-ADRC under sine sweep; (b) control instruction and state response via MPC with different b under sine sweep.

Table 1. Quadrotor model parameters.

Variable	Value	Measuring Unit
mass	$m = 1.4$	kg
acceleration of gravity	$g = 9.8$	m/s²
moment of inertia $J_{x x}$ and $J_{y y}$	$J_{x x} = J_{y y} = 0.01724$	kg·m²
radius of the quadrotor	$r = 0.24$	m
thrust coefficient $C_{T} = T_{p} / w^{2}$	$C_{T} = 1.227 \times 10^{- 5}$	N/(rad/s) $^{2}$
moment coefficient $C_{M} = M_{p} / w^{2}$	$C_{M} = 2.215 \times 10^{- 7}$	N·m/(rad/s) $^{2}$
moment of inertia of motor and propeller $J_{r}$	$J_{r} = 2.13 \times 10^{- 4}$	kg·m²

Table 2. Evaluation indicators of MPC, traditional ADRC, and DDPG-ADRC.

Parameter and Indicator	MPC				Traditional ADRC				DDPG ADRC
b	12	13	14	15	12	13	14	15	/
ITAE	9.5016	13.047	10.629	14.049	9.4512	9.4512	9.4512	9.4512	9.357
ITSE	0.0664	0.2393	0.2504	0.2781	0.19384	0.19337	0.19293	0.19252	0.1848
IAE	13.895	22.555	20.976	24.614	18.868	18.868	18.868	18.868	18.273
Rewards	24,470	22,230	22,010	21,459	21,943	21,949	21,956	21,966	22,010

Table 3. Evaluation indicators of traditional ADRC and DDPG-ADRC in the presence of disturbances.

Evaluation Indicator	Traditional ADRC	DDPG-ADRC
ITAE	11.971	11.87
ITSE	0.2064	0.1848
IAE	20.73	20.131
Total rewards	21,829	21,904

Table 4. Evaluation indicators of MPC, ADRC, and DDPG-ADRC.

Parameter and Indicator	MPC				Traditional ADRC				DDPG ADRC
b	12	13	14	15	12	13	14	15	/
ITAE	375.85	361.27	351.32	344.28	288.38	288.62	288.89	289.18	286.3
ITSE	41.985	38.145	35.623	33.911	23.772	23.811	23.856	23.905	23.456
IAE	192.85	188.61	186.37	185.33	155.2	155.25	155.31	155.38	154.25
Rewards	−19,203	−18,821	−18,513	−18,267	−16,996	−17,027	−17,071	−17,104	−16,946

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, K.; Song, J.; Hu, Y.; Xu, X.; Liu, Y. Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs. Mathematics 2022, 10, 2686. https://doi.org/10.3390/math10152686

AMA Style

Zhao K, Song J, Hu Y, Xu X, Liu Y. Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs. Mathematics. 2022; 10(15):2686. https://doi.org/10.3390/math10152686

Chicago/Turabian Style

Zhao, Kai, Jia Song, Yunlong Hu, Xiaowei Xu, and Yang Liu. 2022. "Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs" Mathematics 10, no. 15: 2686. https://doi.org/10.3390/math10152686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Deterministic Policy Gradient-Based Active Disturbance Rejection Controller for Quad-Rotor UAVs

Abstract

1. Introduction

2. Nonlinear Model of Quadrotors

2.1. Ideal Model of Quadrotors

2.2. Internal and External Disturbances

3. Construction of DDPG-Based ADRC

3.1. ADRC-Based Attitude Controller Design

3.2. Reinforcement Learning Theory

3.3. Structure of DDPG-Based ADRC

4. Simulation and Results

4.1. Simulations in the Presence of Internal Disturbances

4.2. Simulations in the Presence of External Disturbances

4.3. Simulation under Sine Sweep

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI