1. Introduction
Quadrotor unmanned aerial vehicles (UAVs) have shown great potential for both civil and military applications, because of low cost, small volumes, simple structures, flexible operation, and vertical take-off and landing capability [
1,
2]. However, quadrotor UAVs present control challenges because of high-dimensional nonlinear dynamics, difficulties in measuring system parameters, and ever-changing noise and disturbances. Various control algorithms have been tested on quadrotor UAVs, among which the classic proportional-integral-derivative (PID) controller is the most widely used. For example, paper [
3] presents a quadrotor UAV attitude control method to realize complex acrobatic UAV maneuvers using the double-ring PID algorithm and smith predictor. In paper [
4], quadrotor attitude and position controller using rotation matrix based PID algorithm are designed. The linear quadratic regulator (LQR) algorithm has also being used in quadrotor UAVs with some success [
5]. Researchers compared the performance of a Qball-X4 quadrotor with PID controller and LQR controller, and the tracking trajectory with LQR controller is more smooth and it has less disturbance [
6]. LQR algorithm can also be applied on a morphing quadrotor and achieve a success [
7]. Nonlinear control algorithms, such as back-stepping (BS), sliding mode control, and H-infinity control, have been applied to quadrotors. Researchers propose an approach of the backstepping control running parallel with a sliding mode observer for a quadrotor UAV [
8,
9]. Paper [
10] presents quadrotor altitude control with sliding mode control and paper [
11] designed attitude and position control using sliding mode control. H-infinity control is another optimal control algorithm that provides a robust stable closed-loop system with high performance tracking and disturbance rejection. Paper [
12] present H-infinity style quadrotor attitude controller, where the system transfer function is identified based on input and output experimental data. In paper [
13], a H-infinity observer is designed to compensate the effect of actuator failures, sensor failures, and external disturbances. However, these algorithms rely heavily on model accuracy, which severely affects application to real quadrotor platforms.
The active disturbance rejection controller (ADRC) technique is based on the traditional PID: a tracking differentiator (TD) is set to arrange the transition process and extract the differential signals, and an extended state observer (ESO) is used to estimate the state of the control object and the unknown disturbances. A nonlinear state error feedback (NLSEF) is formed by a nonlinear combination of the results from the TD and the ESO, including the compensation for external disturbances if needed [
14]. Both external disturbances and model errors can be estimated while using ESOs. The critical gain parameter (CGP) is the gain in the input channel that characterizes the proportional relation between the input signal and the derivative of the process output of a specific order [
15]. In this paper, the CGP is the proportional relation between the control signal and the angular accelerations of three axes. Studies have shown that the ADRC can be used to stably control a quadrotor in the presence of external disturbances [
16,
17,
18].
Acceptable control performance have been demonstrated for the control systems proposed by previous researchers. However, model-based control algorithms, such as LQR and BS, heavily rely on model accuracy, which hinders their application to real quadrotor platforms. It is difficult to tune the parameters of model-free algorithms, such as PID and ADRC. When considering the traditional double-ring PID algorithm as an example, typically three or more parameters must be tuned for each axis, and the tuning process can be time consuming and dangerous. For quadrotor systems, automatic tuning algorithms, such as the genetic algorithm (GA) and particle swarm optimization (PSO), can be tested in simulation environments, but it is unsafe to automatically explore control parameters during real flight experiments. In this paper, we present an Reinforcement learning (RL) method to achieve single-parameter-tuned attitude control. Thus, the difficulty of multi-parameter tuning in PID-based algorithms is reduced.
RL is an automated self-optimization method that allows for agents to explore the environment and collect experience to improve performance. Agents output actions and receive rewards during the exploration process, and the controller, referred to as the policy in this frame, constantly optimizes itself based on experience to obtain higher rewards. The process can present optimal or at least locally optimal controllers, which is sufficient for solving a complex control problem in many cases. RL algorithms have been introduced into the UAV control area with remarkable achievements [
19,
20]. Andrew Ng et al. used the RL method to design a flight controller for the Stanford autonomous helicopter [
21]. A data-driven approach was used to determine the system model, which is, flight data were collected and fitted using a locally weighted linear regression algorithm. These approaches are effective for some specific experimental platforms, but cannot be directly transferred to other types of platforms, and measuring of system parameters or identifying system models can be difficult and expensive. An alternative approach is to train a policy that is sufficiently robust in order to neglect model errors as well as potential disturbances. For example, computer-aided design (CAD) has been used to obtain the parameters of a quadrupedal robot in [
22], and noise and disturbances can be added to either the system models or sensor data during the training process. We tested this method on a quadrotor in our previous study, and the trained policy was effective for platforms with different system parameters in a simulation environment. However, the performance for each platform may be less ideal than a full tuned PID controller.
In this paper, a new method is presented in order to achieve simple tuning control with RL: that is, we extract the core difference of platforms as critical system parameters that are then included in the state list during the RL training process. This method can be used to apply the trained controller to different platforms, and excellent performance can be achieved with simple parameter tuning. The critical system parameters for quadrotor UAV systems include the moment of inertia, the lift coefficient, the torque coefficient, etc. We extract these parameters as CGPs for the three axes, which is a key step in the ADRC algorithm. This method can be used to apply the RL controller to different quadrotors, where only one parameter CGP needs to be tuned for each axis.
In many previous studies, RL controllers have exhibited excellent performance, including a fast convergence rate with hardly any overshoot, but they have failed to eliminate steady-state errors. These errors remain distinct, irrespective of how the hyperparameters are set and the types of exploration strategies applied [
23]. In [
23], integral compensators are added to the quadrotor system to eliminate steady-state errors, but our simulation experimental results show that this method may reduce the final performance. Thus, we develop novel methods, i.e., reward function shaping and mirror sampling, to solve the steady-state error problem that is incurred using RL algorithms. Ideal results with nearly no steady-state error are obtained, and the training process is considerably accelerated.
External disturbances to a quadrotor are compensated for by designing a set of ESOs to estimate the total disturbance of the roll and pitch axes, where the total controller output is the combination of the RL controller output and the ESO feedback. However, it is difficult to tune the parameters of an ESO. Inappropriate ESO parameters cause high frequency vibration problems in both simulation and real flight experiments, which may reduce the control quality. In this paper, an efficient and powerful metaheuristic algorithm, which is referred to as the covariance matrix adaptation evolution strategy (CMA-ES), is introduced to automatically adjust the ESO parameters. The CMA-ES is a type of random numerical derivative-free optimization algorithm with a superior efficiency and success rate to GA and PSO [
24].
In this paper, the CMA-ES algorithm is designed to reduce the error between the ESO output and the real disturbance. Disturbance measurement is challenging. We meet this challenge by designing a specialized test: the quadrotor is placed on the ground in the absence of an external disturbance, and control signals are sent to the controller; however, the motors are not actuated and they remain still during the test. In this situation, from the perspective of the controller, a disturbance compensates for the control signals to keep the quadrotor still, such that the value of the disturbance is simply the negative of the current control signals multiplied by the CGP. This test prevents problems, such as safety risks, computational costs, and battery capacity limitation, and we can run the CMA-ES algorithm directly on a real quadrotor.
In this paper, the quadrotor attitude control is divided into three second-order linear systems with different CGPs and disturbances. We lock the pitch and yaw axis of a quadrotor as a standard second-order linear rotation system rather than using a full quadrotor during RL training process. Our RL algorithm is based on the deterministic policy gradient (DPG) algorithm with two actor neural networks (NNs) and two critic NNs. Mirror sampling and reward shaping methods are introduced to the training process to eliminate the steady-state errors of the RL controller and accelerate the training process. The CGPs of the quadrotor platforms are recognized while using TDs, and the CGPs can also be manually tuned in a real flight environment. The CMA-ES algorithm is used to automatically conduct ESO parameter tuning in both a simulation environment and a real quadrotor. The controllers are well-trained and tuned and they are also tested in a quadrotor simulation system. Finally, the controller is applied to a real F550 quadrotor. The program is run on a Raspberry Pi microcomputer, which is sufficiently fast to run the NNs in real time. The quadrotor can hover and move around stably and accurately in the air even under severe disturbances.
The main contributions of this paper are summarized below:
- (1)
In this paper, RL is used to develop a single-parameter-tuned quadrotor attitude control system that exhibits excellent performance in both simulation and real flight experiments. This method offers the considerable advantage of requiring one parameter CGP to be tuned for each axis. Critical system parameters are used to extend the RL state list, making the same RL network suitable for platforms with different system parameters, which represents an advance in RL theory.
- (2)
The introduction of reward function shaping and mirror sampling eliminate the steady-state errors incurred using RL control systems and considerably accelerate the training process.
- (3)
ADRC methods are used to eliminate external disturbances to the quadrotor attitude control system in real time. A practical method is developed to automatically tune the ESO parameters using CMA-ES for a real quadrotor.
The remainder of this paper is organized, as follows. In
Section 2, a nonlinear quadrotor system model is established, and identification methods are introduced. A RL control algorithm and a CMA-ES parameter tuning algorithm are developed in
Section 3. In
Section 4, the training details and results are presented, and the simulation experiments are discussed.
Section 5 presents the ESO tuning details for a real quadrotor platform and the performance of real flight experiments using the RL controller.
Section 6 concludes the paper.
2. System Model and Identification
In this paper, an “X” type F550 quadrotor is used and tested, and its structure and frame are presented in
Figure 1. The attitude of the quadrotor represents the angular error between the body frame and the earth frame.In
Figure 1,
represent the roll, pitch, and yaw angles, respectively, while
to
represent angular velocities of four rotors.
The rotation of a quadrotor can only be controlled by changing the lift force and torque of the four rotors. The rigid-body dynamic model of the quadrotor in this paper can be derived from the Newton–Euler equation:
where
are angular velocities of the quadrotor.
are the moment of inertia in three axes.
are torque generated by rotors.
are disturbance torque in three axes, including aerodynamic drag, side wind, center of gravity shift, physical impact, and model error. The torque of three axes are determined by the angular velocities of four rotors:
where
is the lift coefficient and
is the torque coefficient of the rotors.
d denotes the length of arm of the quadrotor.
and
are intermediate variables.
are scale constants,
to
are theangular velocities of the four rotors in
Figure 1.
are the results of attitude controller. The relationships between
and
can be modeled as a first order system:
where
and
denote the current angular velocity and the desired angular velocity,
is the dynamic response constant of the motor-rotor system.
The accuracy of these parameters is critical. The mass and length can be measured directly, but the measurement of other parameters is difficult. Mechanical parameters, such as moment of inertia, are directly calculated in the Solidworks software package in this paper. The physical parameters of motors and propellers are measured with a motor test beach. The remaining of unknown parameters, including
and
, can be found in a quadrotor data base online:
https://flyeval.com/[
25].
A simulation environment program in which the quadrotors’ states can be updated with the control signals of motors is built with the system model and parameters. The attitude of quadrotors is represented with unit quaternions and updated with the traditional Runge-Kuta method, as shown in Equation (
5).
where
is a group of unit quaternion to describe the rotation between the quadrotor body and earth frame. The quaternion in this paper is a unit quaternion:
where
is the total rotation (the combination of roll, pitch, and yaw rotation) and
are projections of
in the
axes. We limit
between
and
, so
is positive all the time. When
is not too large, the attitude angle of roll, pitch, and yaw can be simplified as
, respectively.
3. Quadrotor Attitude Control System Design
The attitude controller receives the desired attitude
, the current attitude
, and the current angular velocities
and then outputs the control values
of each axis according to the states. The CGPs of each axis
are estimated or tuned manually. The disturbance of roll and pitch axes
is estimated by the ESOs. Finally, the control distributor transfers
and
to the
(
and, in return, sends the attenuated control values
to the ESOs because the actual control value of each axis may be attenuated by the control distributor if
is too large or too small.
Figure 2 presents the control diagram.
The typical ADRC comprises three major components: the nonlinear tracking differentiator (TD), extended state observer (ESO), and state error feedback (SEF) [
14]. In the traditional ADRC, the transient process is set to avoid a sharp deviation between reference signals, and the process is realized through the TD. In this paper, the transient process is automatically employed by the RL method. The CGP is the estimation of the system response capability, and the accuracy of CGP is critical to the performance of both the ESO and SEF. In previous studies, CGPs were the manually adjusted parameters. We propose a method to estimate the CGP value for each quadrotor platform with a TD. With ESOs and CGPs, the RL controller is able to control different quadrotors with the same actor network. In order to optimize the final performance, the CMA-ES algorithm is presented to tune the parameters of ESOs, and succeeded eliminating the high frequency vibration problem.
3.1. First-Order ESO and SEF for Quadrotor Attitude Control
In a quadrotor attitude control system, taking the x axis as an example, the system can be written as:
where
is the estimation of the CGP and
is the estimation of the total disturbance. The ADRC algorithm takes the nonlinearities, uncertainties, and external disturbances as the total disturbance. The total disturbance can be observed by the ESO in real time with the quadrotor state input, the control output, and an estimation of CGP. The quadrotor control system receives angular velocity data directly from the gyroscope sensors, so the form of the ESO for the quadrotor attitude control system in a single axis is a first-order ESO (Algorithm 1):
Algorithm 1 Extended state observer for quadrotor attitude control. |
y: angular velocity data from sensors : angular velocity output : disturbance output h: time step : critical gain parameter u: control output : intermediate variables : adjustable parameters : a non-linear function [ 14] - 1:
- 2:
- 3:
- 4:
|
Using the ESOs, the disturbance of three axes are estimated, and the disturbance feedback should be added to the controller output:
where
a is the final control value and
is the output of the controller.
In this paper, only roll and pitch ESOs are designed, because the yaw axis is hardly affected by the external disturbances in our outdoor flight experience. In severe flight situations, such as propeller breakage and load swinging, the control system should focus on maintaining the balance of the roll and pitch axes, and the performance of the yaw axis may not be so important.
3.2. Estimation of the CGPs
In this paper, the CGP is the proportional relation between the control signal
and the angular accelerations in three axes
. The accuracy of the CGPs is critical to both the RL controller and ESO feedback. The CGPs of the F550 quadrotor can be simply calculated with Equation (
8), but the measurement of different quadrotors or platforms can be difficult and expensive and, thus, a method to estimate the CPGs is presented in this section.
As Equation (
7) shows, the CGP can be calculated given the state of a quadrotor, including angular acceleration, control value, and disturbance. When considering the ill-conditioned equation problem and noise of sensors, the system should obtain enough excitation during the estimation process, but too much excitation may cause a quadrotor crash. To precisely estimate angular acceleration data, a TD is designed to calculate the differential of angular velocity data in Algorithm 2:
Algorithm 2 Tracking differentiators for angular velocity data. |
: angular velocity data from sensors : angular velocity output : angular acceleration output : a non-linear function [ 14] : intermediate variables h: time step : adjustable parameters, usually two or three times h - 1:
- 2:
- 3:
- 4:
|
The estimation process starts when the quadrotor hovers stably above an open and wide ground. The controller produces an excitation for s on a single axis, and then the CGP in this axis can be calculated with these flight data.
3.3. Single-Parameter-Tuned Attitude Control Method
The RL method is a frame with these key factors: the agent, environment, state
, policy
, action
, reward
, and state-action value
. During the training process, the agent takes action
according to state
and policy
in the environment, and obtains a reward
in return. Value
is the sum of the expected future rewards:
where discount factor
is introduced to prevent the infinite sum of future rewards. However, the future reward is unknown, so we assume that an optimal policy
can maximize
:
where
A is the action space. During the training process, we use the current policy
instead, and this process can make the performance of
gradually close to
.
With the ADRC methods, the quadrotor attitude control is divided into three second-order linear systems with different CGPs and disturbances, so we lock the pitch and yaw axis of a quadrotor as a standard second order linear rotation system rather than a full quadrotor. The agent is similar to an inverted pendulum, which explores the rotation space while discovering the optimal policy and tracking the desired roll angle. Accordingly, the system state variables in training process are represented as:
is the attitude angle,
is the angular velocity,
is the CGP of this axis and
is the control output. Actuator delay is also considered as a first order system in Equation (
4). During the training process, no disturbance is added in the simulation.
3.3.1. Network Structures
The RL controller in this paper is based on the actor-critic scheme [
26], in which the critic network judges the value of a state-action-pair, and the actor network outputs the control values according to the states input. The state in the attitude control process consists of the error between the desired attitude and the current attitude, along with the current angular velocity. The CGP is added to the state in this paper, so the state list is written as:
. The action list is expressed as
, where
is the control value in the controlled axis.
We have two critic networks with the same structure: target critic network and online critic network [
26]. The critic networks are fully connected networks (FCNs), with a hidden layer of 32 nodes and ReLU activation function. All of the experience is used to train the online critic network to fully observe the system and compute the value
. After that, we define the cost function as a squared error function and lower the cost using the Adam optimizer. The target critic network is soft updated with an online critic network.
The online actor network and target actor network are also FCNs with a 32-node ReLU hidden layer. The tanh activation function is set in the output layer to limit the output between −1 and 1. The state lists are sent directly to the actor network and action lists are computed. The gradients of the online actor network are computed with sampled policy gradients directly and then update the weights with the Adam optimizer [
27]. The target actor network is soft updated [
26] with the online actor network as well.
Figure 3 presents the structures of actor networks and critic networks.
3.3.2. Reward Function Shaping and Mirror Sampling Methods
Reward function shaping and mirror sampling methods are presented in order to eliminate the steady-state error problem of RL algorithms in this section.
Traditionally, the reward functions are squared error functions such as
, where
and
are positive parameters. However, these squared error reward functions lead to slow convergence rates and severe steady-state error problems in our former experiments. From the perspective of this paper, the state with a low angle error or low angular velocity should be rewarded positively rather than be punished with less penalty. This paper presents a simple but powerful piecewise function
to meet this demand:
where
x is the angle error or angular velocity to be rewarded,
is the upper level of
x, and
is the boundary line. If
, then
x should be rewarded positively. With
, the reward function is designed as:
where
and
are positive parameters to be tuned.
Because the quadrotor attitude control system is a symmetrical system, whenever a list of experience is explored, the mirror sample experience is also correct. Moreover, the zero state-action point is an obvious result, but may never be explored in the training process, which is an important reason for the steady-state error problem. These samples should be considered to accelerate the training process and eliminate steady-state errors.
3.3.3. Training Algorithm
The training algorithm in this paper is the DPG algorithm [
26], and we add some Gaussian noise to the action as the exploration strategy. Algorithm 3 lists the whole algorithm.
Algorithm 3 Training algorithm with mirror sampling. |
: max number of episodes : max number of steps in an episode : max number of samples : reward function in Equation ( 13) : soft update rate - 1:
initialize weights of online critic network and online actor network randomly - 2:
initialize weights of target critic network and target actor network - 3:
initialize memory box B - 4:
fordo - 5:
initialize the agent with random state - 6:
put the zero state-action point into the memory box - 7:
for do - 8:
take action according to the exploration strategy and current state - 9:
receive the next state from the environment - 10:
put E and the mirror sample into the memory box - 11:
- 12:
for do - 13:
sample experience from B - 14:
compute reward of experience - 15:
- 16:
update by minimizing the loss: - 17:
update using the sampled policy gradients: - 18:
soft update with - 19:
soft update with - 20:
end for - 21:
end for - 22:
end for
|
3.4. ESO Parameter Tuning with CMA-ES Algorithm
Although there are some rules about parameter tuning in ESO, it is still repetitive work, even for designers with considerable experience. The parameters to tune are a list of in Algorithm 1 for the roll and pitch axes. In this section, only the ESO in the roll axis is optimized.
In the simulation environment, the disturbance can be directly obtained, so we optimize the process during flight, and the controller is the actor network trained in
Section 3.3, along with the disturbance feedback that is calculated by the ESO. A task is designed to test and quantify the performance of the ESO. The system starts on the zero point state, and the controller tries to hold the system on the zero point state for a period of time, with the same time-varying square wave disturbance added to the system in each test. The costs in each task are computed with the error of ESO output
and disturbance
in every time step during the task.
In the real quadrotor platform, we designed a specific test: the quadrotor is placed on the ground with no external disturbance, and control signal
is sent to the controller; but, the motors are not actuated and remain still during the test. It is worth noting that the control signal
here is not computed by the controller in
Section 3.3; we set
as a square wave. In this situation, from the perspective of the controller, a disturbance compensates for the control signals to keep the quadrotor still, so the value of disturbance
is just the negative of current control signal
multiplied by the CGP
. Using this test, we avoid problems, such as safety risks, computational costs, and battery capacity limitation, etc. The task for real quadrotor optimization is that the system starts on the zero point state and holds still for a period of time, and the same time-varying control signal
is added to the system in each test. The costs in each task are also computed with the error of
and
in every time step during the task.
We present ESO parameter tuning with the CMA-ES algorithm in this section. The CMA-ES algorithm is a stochastic or randomized method for real-parameter (continuous domain) optimization of nonlinear, nonconvex functions. The CMA-ES algorithm consists of five parts: sampling, calculating costs, updating the mean value, adapting the covariance matrix, and step-size control [
24].
3.4.1. Sampling
In the CMA-ES, a population of
new search points (individuals, offspring), is generated by sampling a multivariate normal distribution
:
The parameter is the mean value, is the step-size (scale value), is a search point sampled from multivariate normal distribution , and is sampled from . results from an eigendecomposition of the covariance matrix . The columns of are an orthonormal basis of the eigenvectors. The diagonal elements of the diagonal matrix are the square roots of the corresponding positive eigenvalues. In fact, CMA-ES controls the search directions with the matrix , and controls the search range with and .
3.4.2. Calculating Costs
It is critical to lower the error between the real disturbance and the estimation output of the ESO in order to eliminate the high frequency vibration problem. The tasks in the simulation and real quadrotor platform presented above are tested for all search points. For every search point , the cost in every time step is computed with , and the total cost is the sum of all costs in every time step.
3.4.3. Updating the Mean Value
The mean value of next generation
is updated based on the costs above. We sort these costs and only the first
samples are recorded:
. A list of weights
is designed in the beginning of the algorithm, and
is calculated with these weights:
3.4.4. Adapting the Covariance Matrix
The covariance matrix
determines the search directions. Two methods to update the
: rank-1 update and rank-
update are presented in this section. The rank-1 update means that we should search new points in the direction of the new mean value:
where
is the evolution path,
c is the learning rate of the evolution path update, and
. The rank-
update means we should search new points in the direction of the search points with low costs. As a result, the covariance matrix
updates with the equation below:
where
and
are the learning rates for the rank-1 update and the rank-
update.
3.4.5. Step-Size Control
This section presents a simple method to control the step size
. We scale
according to the mean value
:
where
is the max step-size, and
is the max cost we set.
Algorithm 4 lists the total algorithm:
Algorithm 4 Covariance matrix adaptation evolution strategy (CMA-ES) algorithm for extended state observer (ESO) parameter tuning. |
: max number of generations - 1:
fordo - 2:
compute , from - 3:
for do - 4:
sample point with , and - 5:
compute the cost of sample point - 6:
end for - 7:
sort the points based on the costs: - 8:
update the mean value using Equation ( 15) - 9:
update the evolution path using Equation ( 16) - 10:
update the covariance matrix using Equation ( 17) - 11:
update the step-size using Equation ( 18) - 12:
end for
|