Next Article in Journal
Performance Analysis of UAV RF/FSO Co-Operative Communication Network with Co-Channel Interference
Previous Article in Journal
A Deep Learning Approach of Intrusion Detection and Tracking with UAV-Based 360° Camera and 3-Axis Gimbal
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Supervised Reinforcement Learning Algorithm for Controlling Drone Hovering

1
College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2
Electric Power Research Institute of Guangxi Power Grid Co., Ltd., Nanning 530023, China
3
College of Electronic Engineering, Nanjing XiaoZhuang University, Nanjing 211171, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(3), 69; https://doi.org/10.3390/drones8030069
Submission received: 31 December 2023 / Revised: 28 January 2024 / Accepted: 6 February 2024 / Published: 20 February 2024

Abstract

:
The application of drones carrying different devices for aerial hovering operations is becoming increasingly widespread, but currently there is very little research relying on reinforcement learning methods for hovering control, and it has not been implemented on physical machines. Drone’s behavior space regarding hover control is continuous and large-scale, making it difficult for basic algorithms and value-based reinforcement learning (RL) algorithms to have good results. In response to this issue, this article applies a watcher-actor-critic (WAC) algorithm to the drone’s hover control, which can quickly lock the exploration direction and achieve high robustness of the drone’s hover control while improving learning efficiency and reducing learning costs. This article first utilizes the actor-critic algorithm based on behavioral value Q (QAC) and the deep deterministic policy gradient algorithm (DDPG) for drone hover control learning. Subsequently, an actor-critic algorithm with an added watcher is proposed, in which the watcher uses a PID controller with parameters provided by a neural network as the dynamic monitor, transforming the learning process into supervised learning. Finally, this article uses a classic reinforcement learning environment library, Gym, and a current mainstream reinforcement learning framework, PARL, for simulation, and deploys the algorithm to a practical environment. A multi-sensor fusion strategy-based autonomous localization method for unmanned aerial vehicles is used for practical exercises. The simulation and experimental results show that the training episodes of WAC are reduced by 20% compared to the DDPG and 55% compared to the QAC, and the proposed algorithm has a higher learning efficiency, faster convergence speed, and smoother hovering effect compared to the QAC and DDPG.

1. Introduction

With the continuous development of drone control technology, the application mode of “drone+” is becoming increasingly widespread [1]. Aerial robots can become flight carriers for various aerial operations in different industries by carrying various devices and targeted configurations that meet application requirements. Many aerial drones are used as operational tasks for flight platforms, such as inspection, monitoring, obstacle removal, and cargo handling [2,3,4,5,6,7], which require drones to be able to hover for long periods of time. In addition to high requirements for the platform’s translational stability, the large inertia of the pitch and yaw channels caused by the operational tools loaded on the platform, as well as the impact of contact forces during operation, also require the platform to have strong anti-interference abilities. The use of flight control strategies with disturbance resistance and higher efficiency has also received attention. At present, the flight control methods for drones mainly include linear control methods [8,9,10], nonlinear control methods [11,12,13], and intelligent control methods [14,15,16,17]. The learning-based robot control method has received widespread attention in the field of automatic control [18,19,20], as it ignores the dynamic model of the robot and learns control methods through a large amount of motion data. In the latest Nature journal, Elia Kaufmann et al. [21], from the Robotics and Perception Group at the University of Zurich, studied the unmanned aerial vehicle autonomous system, Swift, which used deep reinforcement learning (DRL) algorithms to successfully defeat human champions in racing competitions, setting new competition records. It can be seen that reinforcement learning frameworks have become a promising algorithmic tool for improving the autonomous behavior of aerial robots [22,23,24,25]. Deep reinforcement learning, as a currently popular unsupervised learning framework, utilizes the perception ability of deep learning and the decision-making ability of reinforcement learning to achieve end-to-end control from input to output. It can be effectively applied to high-level decision-making systems and is becoming a commonly used framework for scholars from all walks of life to study aerial robot control algorithms [26].
At present, many researchers are trying to combine reinforcement learning with unmanned aerial vehicle systems to solve the autonomous control problem of flight robot systems [27,28,29,30]. Most reinforcement learning algorithms applied to unmanned aerial vehicle systems cannot be applied to real scenarios due to simple simulation scenario settings, and overly complex simulation scenarios can reduce learning efficiency. In response to this issue, Lee Myoung Hoon et al. [31] proposed a soft actor-critic (SAC) algorithm with post experience replay (HER), called SACHER, which is a type of deep reinforcement learning (DRL) algorithm. Their proposed SACHER can improve the learning performance of SAC and adopt SACHER in the path planning and collision avoidance control of drones. In addition, the paper also demonstrates the effectiveness of SACHER in terms of success rate, learning speed, and collision avoidance performance in drone operation. Zhongxing Li et al. [32] proposed using the reinforcement learning method to control the motor speed of quadcopter drones for hover control in order to improve the intelligent control level of quadcopter drones. Jia Zhenyu et al. [33] proposed introducing traditional control models to improve the training speed of reinforcement learning algorithms. By optimizing the proximal strategy in deep reinforcement learning, the control learning task of attitude stability was achieved, and it was verified that the algorithm can effectively control the stability of quadcopters in any attitude and is more versatile and converges faster than general reinforcement learning methods. The above references provide ideas for the innovative algorithm in this article.
In this article, we will apply a new reinforcement learning algorithm, the WAC (watcher-actor-critic), to drone hover control, which can achieve high robustness of drone hover control while improving learning efficiency and reducing learning costs, preventing human manipulation from being affected by uncontrollable factors. The hover control of drones is part of low-level flight control with continuous and large-scale behavior space. Basic algorithms such as Monte Carlo reinforcement learning based on complete sampling and temporal differential reinforcement learning based on incomplete sampling will have low efficiency and may even fail to achieve good solutions. Even the deep Q-network algorithm (DQN) finds it difficult to learn a good result. The policy gradient algorithm based on Monte Carlo learning is not widely used in practical applications, mainly because it requires a complete state sequence to calculate the harvest. At the same time, using harvest instead of behavioral value also has high variability, which may lead to many parameter updates in a direction that is not the true direction of the policy gradient. The DQN proposed by Mnih V. et al. [22] is value-based reinforcement learning. Although it can solve many problems excellently, it still appears inadequate when facing learning problems with continuous behavior space, limited observation, and random strategies. To address this issue, Lillicrap, T.P. et al. [34] proposed a joint algorithm, the DDPG, based on value function and policy function, known as the actor-critic algorithm. It combines DQN on the basis of the PG algorithm proposed by Peters, J. et al. [35] and the DPG proposed by Silver, D. et al. [36]. In this context, this article will directly learn the strategy, which is viewed as a policy function with parameters for the drone state and propeller power output. By establishing an appropriate objective function and utilizing the rewards generated by drone interaction with the environment, the parameters of the policy function can be learned.
This article first uses the actor-critic algorithm based on behavioral value (QAC) and the deep deterministic policy gradient algorithm (DDPG) for drone hover control learning. By establishing approximate functions and policy functions for drone state value, on the one hand, policy evaluation and optimization can be based on the value function, and on the other hand, the optimized policy function will make the value function more accurately reflect the value of the state. The two mutually promote and ultimately obtain the optimal hovering strategy for drones. However, due to the unsupervised nature of this algorithm in reinforcement learning, the consequence of doing so is that the initial learning stage begins with a random exploration strategy, not to mention correct off-field guidance. The update of the strategy is also arbitrary or even explores in the opposite direction, resulting in a lower initial iteration speed and ultimately reducing the convergence speed of the objective function. In response to the above issues, this article proposes a deep deterministic policy gradient algorithm that adds a supervised network. Considering that most aircrafts currently use PID as the control algorithm in practical combat and the control effect is quite good and the excessive supervision can also affect the flexibility of the learning process, a PID controller with parameters provided by a three-layer neural network is used as the monitor in this article. The monitor will perform dynamic supervision during the strategy learning stage—that is, it will change the supervision intensity as the strategy learning progress changes. At the beginning of each learning episode, the supervision intensity is the highest, and by the end of the episode, the supervision is mostly removed. After the learning process reaches a certain stage, the supervision task is withdrawn, achieving unsupervised online learning. Finally, this article uses a classic reinforcement learning environment library, Python, and a current mainstream reinforcement learning framework, PARL, for simulation and deploys the algorithm to actual environments. A lightweight, drift-free multi-sensor fusion strategy for unmanned aerial vehicles as an autonomous localization method is used for practical exercises. The simulation and experimental results show that the reinforcement learning algorithm designed in this paper has a higher learning efficiency, faster convergence speed, and smoother hovering effect compared to the QAC and DDPG. This article takes improving the learning efficiency of drone hovering tasks as the starting point and adopts a PID controller combined with neural networks for supervised learning with reinforcement learning as the main approach and supervised control as the auxiliary to achieve the effect of rapid learning and smooth hovering of drones. From the above description, it can be seen that the advantages of the algorithm in this article are fast convergence speed and high training efficiency. Figure 1 illustrates the control strategy of this article. The right section describes how the algorithm guides the drone in simulation by providing the predicted actions for the drone propeller. This process is presented in the visualization window through the render() function in Gym. The left part is to deploy the trained strategy in the experimental environment.
The rest of this article is organized as follows. Section 2 introduces the proposed control method. The performance of this method is verified through simulation results in Section 3. Section 4 deploys this method to real drone systems. Section 5 focuses on concluding remarks.

2. Models and Methods

Section 2.1 describes the problem formulation of policy search in the reinforcement learning (RL) framework based on policy gradients. Section 2.2 describes the QAC and DDPG algorithms and introduces the proposed WAC method. In this article, the strategy is the controller learned by RL, which is formulated by combining the RL framework with a supervised controller.

2.1. Problem Formulation

Let the flight system of the drone be a continuous dynamic system with state feature, s t s , and system input, a a , set as follows:
s t + 1 = g s t , a + N t
Among them, s t + 1 is the prediction of the next state based on the current state feature, s t , after taking action, a , where g is an unknown function and N t is a random noise function that explores a small range around the control variable, while t represents the time steps recorded in all the previous steps during the training process.
Assuming that each temporal process during the drone flight has a Markov property, the Markov decision process tuple, S , A , P , R , γ , for the drone is established as follows:
s t = [ p b x , p b y , p b z , p ˙ b x , p ˙ b y , p ˙ b z , p ¨ b x , p ¨ b y , p ¨ b z , ω b x , ω b y , ω b z , θ , ϕ , φ , h ] T , s t S a = f p 1 , f p 2 , f p 3 , f p 4 , a A P s t s t + 1 a = E R t + 1 | S t = s t , A t = a R s t a = E R t + 1 | S t = s t , A t = a γ 0 , 1
where S in tuple S , A , P , R , γ is a finite state set, and s t is a set of state representations in S . p b x , p b y , and p b z are the position coordinates of the drone in the three coordinate axes of the body coordinate system. p ˙ b x , p ˙ b y , and p ˙ b z are the velocities of the drone in the three coordinate axes of the body coordinate system, and p ¨ b x , p ¨ b y , and p ¨ b z are the accelerations of drone in the three coordinate axes of the body coordinate system. h represents the flight altitude of the drone relative to the ground, while s t = [ p b x , p b y , p b z , p ˙ b x , p ˙ b y , p ˙ b z , p ¨ b x , p ¨ b y , p ¨ b z , ω b x , ω b y , ω b z , θ , ϕ , φ , h ] Τ describes the flight status of the drone in the aircraft coordinate system, which is composed of the drone’s current position p b x , p b y , p b z , velocity p ˙ b x , p ˙ b y , p ˙ b z , acceleration p ¨ b x , p ¨ b y , p ¨ b z , angular velocity ω b x , ω b y , ω b z , Euler angle θ , ϕ , φ , and flight altitude, h . The A in tuple S , A , P , R , γ is a finite set of control variables, where a represents a set of control variables in A , where f p 1 , f p 2 , f p 3 , and f p 4 represent the lift provided by the four propellers of the drone, and a = f p 1 , f p 2 , f p 3 , f p 4 describes the control variables executed in the four propeller axis directions under system control. The P in tuple S , A , P , R , γ is a state transition probability matrix based on the control variable: P s t s t + 1 a represents a set of transition probabilities in P ; P s t s t + 1 a represents the transition probability from the current state, s t , to the next state, s t + 1 , through a . The reward obtained during this process is R t + 1 , and the subscripts, t and t + 1 , in the upper variable represent the current time step and the next time step, respectively. The R in tuple S , A , P , R , γ is a reward function based on the state and control variables; R s t a represents a set of reward values in the reward function, R , which is the reward obtained from s t after passing through a . The γ in tuple S , A , P , R , γ is the decay factor, S t is the finite state set of the current time step, t , and A t is the finite control variable set of the current time step, t .
The reason for setting it this way is so that the actual drone can obtain the above state values from sensors, such as airborne gyroscopes and accelerometers, which facilitates further algorithm deployment work. The motion state update equation for drone hovering flight is established as follows:
s t + 1 = g S t = s t , A t = a | π θ s t , a + N t π θ s t , a = P A t = a | S t = s t , θ
where π θ s t , a is the hover flight control strategy, θ is the strategy parameter, P A t = a | S t = s t , θ is the probability distribution matrix for executing the control variable, a , under given states, s t and θ , g S t = s t , A t = a | π θ s t , a is the dynamic equation obtained by continuously updating s t , a , s t + 1 for the strategy, and N t is the noise function.

2.2. Propose Strategies

2.2.1. Actor-Critic Algorithm

The actor-critic algorithm relies on policy gradients. During this process, the use of a time difference (DT) algorithm to estimate the value of behavior is called QAC, which follows an approximate strategy gradient for learning. Firstly, the objective function, J θ , is designed based on the policy, π θ , and then optimized accordingly [35]:
θ t + 1 = θ t + γ θ J θ t J θ t E π θ θ log π θ S , A | θ t Q π θ S , A
The objective function gradient is transformed through sampling as follows:
θ t + 1 = θ t + γ θ log π θ s t , a t | θ t Q ω s t , a t
where θ is the policy parameter, is the gradient sign, γ is the learning rate, θ log π θ S , A | θ t is the score function, Q ω S , A is an approximate value function based on the policy, π θ , of the critic: Q ω S , A Q π θ S , A with ω as its parameter.
The QAC algorithm utilizes both value-based and policy-based approaches, including a policy function and a behavioral value function. The actor interacts with the environment as a policy function to generate behavior, while the critic evaluates the policy as a value function. The QAC method learns increasingly accurate critics from its interaction with the environment, enabling single-step updates. Compared to simple policy gradients, the QAC can more fully utilize data. However, the critic in the QAC algorithm is still an approximate value function, which may introduce bias to deviate the predicted direction from the direction of the policy gradient. The DDPG also combines value function approximation (critic) and policy function approximation (actor). The use of target network and online network makes the learning process more stable and convergence more guaranteed. Therefore, it is based on actor-critic algorithm [35], but it is different from it (Figure 2).

2.2.2. WAC Algorithm

Supervised learning generates corresponding outputs for a given input based on a predictive model. For example, in reference [33], the authors introduced traditional control as an aid, which belongs in supervised learning. This article proposes that the supervisor in the algorithm plays this role. So, the proposed WAC algorithm is a supervised reinforcement learning algorithm and also belongs to the category of actor-critic algorithms. The specific control strategy is as follows:
π WAC s t , a = P A t = a μ a W | S t = s t , θ μ , θ W
where θ μ is the policy network parameter, and θ W is the supervised network parameter. The WAC strategy advocates that the predicted behavior is the dynamic weighted behavior value, A t = a μ a W , given by the actor and the output behavior of the watcher network. The dynamic weights play a role in dynamic supervision, while the critic network estimates the behavior value based on the dynamic weighted behavior from the actor and watcher, calculates the gradient and critic loss of the objective function, and updates the parameters of each network.
Assuming the actor network function is μ θ μ , the critic network function is Q θ Q , the watcher network function is W θ W , the target actor network function is μ θ μ , and the target critic network function is Q θ Q , the specific control process of the WAC strategy is shown in Figure 3: in the randomly selected state-action pairs s i , a i , r i + 1 , s i + 1 from the experience pool, state s i generates a specific behavior, a μ , through the actor network as one of the action input values of the critic network; the next state s i + 1 generates a predicted behavior, a , through the target actor network as the action input value of the target critic network for calculating the estimated value; the watcher network inputs the pre- and post-state values and error values from historical data into the PID supervisor, generating a specific behavior, a W , as another action input value for the critic network; the behavior value Q s i , a i corresponds to the critical network’s calculation state, s i , and weighted behavior, a i ; the target critical network generates the behavioral value, Q s i + 1 , a , to calculate the target value, Q T arg e t , based on the subsequent states, s i + 1 and a .
In the above process, if the reward obtained by interacting with the environment under the strategy π θ is R s , a , then the objective function is designed as follows:
J θ = Ε π θ R s , a
The task of the actor and watcher is to find the optimal behavior, a i = α a μ + β a W , (where α and β are dynamic weights) to maximize the value of the output behavior. The reward obtained by executing a i is r i + 1 , and this conversion process is stored as s i , a i , r i + 1 , s i + 1 . After the system stores enough of the conversion process, N random samples are taken to calculate the target value, Q T arg e t , and the loss, L , of the critical network [35]:
Q T arg e t = r i + 1 + χ Q s i , μ s i | θ μ | θ Q
L = 1 N i Q T arg e t Q s i , a i | θ Q 2
The update method of the actor network adopts the gradient ascent method.
θ A J 1 X a Q s , a | θ Q | s = s i , a = μ s i W s i θ μ μ s | θ μ | s i
The weighted behavior executed by the system is as follows:
a t = α μ s t | θ μ + β W s t | θ W + N t
α and β are constantly changing according to their learning progress, meeting the following requirements:
α + β = 1
The target network update is as follows:
ς θ Q + 1 ς θ Q θ Q
ς θ μ + 1 ς θ μ θ μ
where N t is the noise function, and χ and ς are weights. θ Q , θ μ , and θ W are the weights of the critic network, Q s , a | θ Q , actor network, μ s | θ μ , and supervision network, W s | θ W , respectively. θ Q and θ μ are the weights of the target critic network, Q s , a | θ Q , and target actor network, μ s | θ μ .
The PID controller corresponding to the watcher network is defined as follows:
W ( s i ) = K P I D e k e k = s i s d
where
K P I D = K P K I K D 0 0 0 0 0 0 0 0 0 K P K I K D 0 0 0 0 0 0 0 0 0 K P K I K D a × s
K P , K I , and K D are the adjustment parameters corresponding to the state error, cumulative state error, and derivative of state error. e k is the state error, and s i and s d are the sampled state and its reference values, respectively. The supervised neural network trains three parameters and passes them to the PID controller to calculate the specific behavior values and output them to the critical network. The training process for the supervised network applies a kinematic and dynamic analysis in traditional flight control, which will not be repeated in this article.
Algorithm 1 provides an overview of the proposed algorithm.
Algorithm 1: WAC Algorithm
Input: χ , ς , γ , θ Q , θ μ , θ W
Output:optimized θ Q , θ μ , θ W
initialize W s | θ W with weights θ W
Initial training to complete PID controller parameters via Watcher net by Equations (15) and (16)
randomly initialize Q s , a | θ Q , μ s | θ μ , with weights θ Q , θ μ
initialize target network Q s , a | θ Q , μ s | θ μ with weights θ Q θ Q , θ A θ A
initialize the experience cache space M
for episode from 1 to Limit do
  Initialize a noise function and receive the initial state
  for t = 1 to T do
Perform action a t via Equation (11), get reward r t + 1 , next state s t + 1 , store transition s t , a t , r t + 1 , s t + 1 in M
sample a batch of random N transition s i , a i , r i + 1 , s i + 1 in M , where i = 0 , 1 , 2 , N
  set target value function via Equation (7)
    update critic by minimizing the loss via Equations (8) and (9)
  update actor policy by policy gradient via Equation (10)
    update networks via Equations (13) and (14)
  end for
end for
We compared the performance of using the QAC algorithm, DDPG algorithm, and WAC algorithm in Section 3.

3. Simulations and Results Discussion

The open-source toolkit OpenAI Gym for reinforcement learning algorithms and the current mainstream reinforcement learning framework PARL are our simulation tools. PARL is a reinforcement learning framework based on PaddlePaddle, characterized by the following: (1) high flexibility and support for most reinforcement learning algorithms; (2) an open-source algorithm library for reinforcement learning with large-scale distributed capabilities; (3) the algorithm library has strong usability for industrial-level scenarios. The render function in the Gym library can be used to render the simulation environment and visually display the simulation effect. The drone in the PARL framework is a quadcopter aircraft equipped with an inertial module, which obtains its real-time position p b x , p b y , p b z , velocity p ˙ b x , p ˙ b y , p ˙ b z , acceleration p ¨ b x , p ¨ b y , p ¨ b z , angular velocity ω b x , ω b y , ω b z , Euler angle θ , ϕ , φ , and flight altitude, h , in the body coordinate system by simulating sensors such as accelerometers and gyroscopes. This makes it easy to obtain the current drone’s flight status and transmit it to the input of the WAC algorithm. Output behavior, a = f p 1 , f p 2 , f p 3 , f p 4 , describes the control variables executed in the four propeller axis directions under system control, and its learning process is the repetitive strategy function solving process. During drone hovering, the inertial module provides the current state data at each time step and determines the reward value: the greater the hovering error, the lower the reward value. The simulated computer configuration is an Intel(R) Core(TM) i5-7300HQ, and the simulation environment is Python 3.7.
In the WAC, the actor and critic network structures are the same as the target actor and target critic network structures, respectively. The architecture of the actor-critic network (AC Net) is shown in Figure 4. Considering the need for the drone to extract features from observed states, we designed a total of three hidden layers for the critic. The hidden layers for the processing states and behaviors are first separated for operation, and the final hidden layer is fully connected together to output the value of state and behavior. The input received by the actor network is the number of features observed by the drone, and the specific value of each behavioral feature is the output. The actor network is designed with three hidden layers and fully connected layers. In order to achieve exploration, the algorithm uses the Ornstein–Uhlenbeck process as the noise model and adds a random noise to the generated behavior to achieve a certain range of exploration around the exact behavior. The neural network in the watcher is designed with three hidden layers and fully connected layers. The architecture is shown in Figure 5.
In the simulation, the reward rule set is that the closer the drone hovering height, h , is to the set hovering height (5 m), the greater the reward obtained. The target range is set to 0.5 m. If the hovering height is within 0.5 m of the target height, a reward value of 10 is obtained. If it exceeds the target range, a penalty value of −20 is obtained. If the height difference exceeds 20 m, a negative distance difference is chosen as the reward. The time step limit for one episode is T = 1000 . If the time step exceeds the limit or the drone crashes ( h 0 ), the current episode is stopped and reset to the next episode. The original intention was to adopt a dynamically supervised intervention behavior strategy, where the weights of watcher behavior and actor behavior gradually change with the training progress. Since our goal is to train high-quality actor networks, we place more emphasis on actor behavior in simulations. At the beginning of the initial training setting, both behaviors accounted for 50% each. Over time, the actor behavior gradually becomes dominant with increasing weights. At the same time, the watcher behavior weight gradually decreases and eventually tends to zero. The simulation time step is set to t and the total number of simulation training steps to Γ with Γ = 1   ×   10 6 . The learning rate setting is as follows: χ = 0.001 and ς = 0.002 , assuming α 0 = 0.5 and β 0 = 1 α 0 . The weight in formula 12 is set as follows:
α = α 0 + β 0 t / T β = β 0 β 0 t / T
Due to the fact that during the simulation process, the more time steps (iterations) spent on each episode (reaching the time step setting for hovering or crashing), the better the hovering effect, in episodes that reach the specified time step, the drone is always in the hovering state. If a crash occurs, it means that the next episode starts with less than 1000 steps per episode. Therefore, under the same total training steps, the more the crashes, the more the episodes, and the fewer time steps you take each time, the fewer the rewards you will receive. Figure 6 shows the reward values trained using the three algorithms with the horizontal axis representing the number of episodes. When the total time steps are the same, the fewer the episodes, the longer the duration of the hover state during each training process, and the more times the limited time steps are reached. Figure 7 shows the number of iterations taken per episode, while Figure 8 shows the total number of iterations. The training episodes of the WAC are reduced by 20% compared to the DDPG and 55% compared to the QAC. It is evident that the WAC algorithm has the lowest number of episodes, the highest reward value, and the most time steps in each episode, followed by the DDPG and QAC, which confirms the above statement. This indicates that our algorithm has significantly fewer crashes during training compared to the other two, and the chance of crashes during the learning process directly reduces training efficiency and effectiveness, indirectly indicating that the WAC has the highest training efficiency and effectiveness.
After simulation training, the network models of the three algorithms are separately loaded and flight data collected from the drone hovering state for 1000 iterations for analysis. Related simulation videos can be found in the Supplementary Materials section at the end of the article.
Figure 9a–c show the hovering trajectories of the drone after loading the three network models, respectively. It can be seen that none of the three algorithms have experienced a crash, while the hovering trajectory using the QAC algorithm has significant drift. From the trajectory integrated display in Figure 9d, it can also be seen that the drone under the WAC algorithm framework has the smallest hovering range, the best effect, and the lowest degree of drift. The following is a detailed introduction to the simulation results.
Figure 10a,c,e depict the displacement of the three algorithms on various coordinate axes. From the graph, it can be analyzed that the QAC has an increasing displacement deviation in the X and Y directions (the X-axis direction exceeds 4 m, and the Y-axis direction exceeds 10 m), while the other two algorithms are significantly more stable and hover near the specified point. The hovering altitude in Figure 10e indicate that the DDPG and WAC have better stability than the QAC in hovering altitude, and compared to the DDPG, the average hovering altitude of the WAC is closer to the target altitude. In order to more clearly demonstrate these deviations, we plotted a bar chart comparing the average deviations of each axis under the three algorithms in Figure 11. It is not difficult to see from the figure that the WAC exhibits good hovering characteristics followed by the DDPG, while the QAC is the most unstable. During drone hovering, the comparison of velocities in various axis directions under different algorithm frameworks in the geodetic coordinate system (Cartesian coordinate system) is shown in Figure 10b,d,f. From the figure, it can be concluded that the WAC and DDPG are equally stable in hovering flight speed and significantly superior to the QAC. And regarding the height aspect that we are most concerned about, the speed of the three algorithms is better than the other two coordinate axes, and the Z-axis speed of the WAC is basically stable at zero.
Since our research focuses on the hovering effect of drones, and the variation of drone yaw angle has little effect on hovering stability, we assume it to be always zero. Figure 12a,b show a comparison of pitch angle and roll angle changes during hovering using the three algorithms. The changes in pitch angle and roll angle represent the stationarity of the drone hovering flight. It can be seen that the amplitude of the WAC change is the smallest, and it is superior in the stationarity of hovering flight. The above simulation results all indicate that the WAC algorithm applied to drone hover control is more stable, and the hovering effect has been greatly improved compared to the DDPG and QAC. Next, we studied the training time cost of the three algorithms.
Figure 13 shows the cumulative training time curves of the various algorithms implemented by the computer. The WAC algorithm has a slightly longer training time than the other two algorithms due to the addition of the target network and supervised network. Compared to the QAC, DDPG also increases computational costs due to the addition of two backup networks. In order to quantitatively analyze the computational time cost of each algorithm, in the comparison of the computational cost per 10,000 iterations in Figure 14, the training time cost of the three algorithms is basically in an arithmetic sequence, and the elongation of this computational cost is within an acceptable range. From the simulation results in the previous text, it can be seen that the learning efficiency and hovering effect of using the WAC algorithm are significantly better than the other algorithms. However, our algorithm did not excessively increase the time cost due to the increase in supervised networks, and it is worth investing a small amount of computational cost to obtain larger learning results. With the increase of training frequency in the later stage and relying on high-quality models in the early stage, the training effects of the three algorithms will become better and better, but the WAC is undoubtedly the first model that meets the requirements. The three training models are loaded separately and the optimization time for each step is analyzed, as shown in Figure 15. The single step optimization time of the three algorithms shown in the figure remains below 1 millisecond with no significant difference, indicating that the proposed algorithm did not lag behind in optimization time.
We will deploy the WAC algorithm on a real drone in Section 4 and demonstrate its practical performance.

4. Algorithm Deployment

In order to verify the effect of our algorithm, we deployed the WAC algorithm on a self-developed quadcopter. The computing platform on the drone uses NVIDIA Jetson Xavier NX, and the flight control firmware is Pixhawk4, which uses Ethernet communication. Due to the tendency of IMU integration to cause deviation in the attitude, which affects the experimental results, and in order to more accurately describe the position and attitude of the drone, we added a visual system in the experiment. IMU can provide visual positioning during rapid motion, and visual positioning information can be used to correct IMU drift. The two are actually complementary. We adopted a lightweight and drift-free fusion strategy for drone autonomy and safe navigation [37]. The drone is equipped with a front facing camera, which can create visual geometric constraints and generate 3D environmental maps. The GNSS receiver is used to continuously provide pseudorange, Doppler frequency shift, and UTC time pulse signals to the drone navigation system. The Kanade–Lucas algorithm is used to track multiple visual features in each input image. Then, the position and attitude of the drone itself is monitored through multiple sensors installed on the body. The specific algorithm deployment process is shown in Figure 16. Our deployment environment is a laboratory of length × width × height of 10 m × 10 m × 4 m. With the help of SLAM mapping already completed by the drone, drone hovering experiments are carried out in this environment. The desired state is obtained as the state input of the WAC strategy through the combination of inertial module, visual module, and GNSS module, forming a cyclic iteration of state-decision-action, and hover flight verification is completed. The detailed configuration of the deployed drone is shown in Table 1. Related experimental videos can be found in the Supplementary Materials section at the end of the article.
To verify the stability of the proposed algorithm, we loaded the trained network model into the quadcopter system and conducted three hovering experiments. Due to the height limitation of the experimental environment, the hovering altitude was set to 2 m, and the first 20 s of hovering flight data were selected for analysis. Figure 17a–c show the three hovering flight trajectories, while Figure 17d depicts the hovering trajectory in the same space for macroscopic observation. It can be seen that the three hovering flights of the drone revolve around the hovering target point, and the operation is good without any crashes. The hovering altitude curves of the three experiments (Figure 18c) also indicate that the hovering flight is very stable with a hovering altitude maintained around 2 m and a small fluctuation amplitude, which is basically consistent with the simulation experiment results.
Figure 18d–f illustrate the velocity changes along the three coordinated axes of three sets of experiments in the world coordinated system. The variation patterns of the three sets of flight data are fundamentally consistent with significant velocity changes on the X and Y axes, but they soon adjust to a slightly stable state. The Z-axis velocity that we are concerned about remained largely unchanged, which also provides stability to the drone’s hovering and achieves our expected hovering effect. Figure 19a,b show the performance of the pitch angles and roll angles of the drone in three sets of experiments, respectively. The translation range of the two Euler angles (±20°) is within our tolerance range. From the above curve, it can be seen that our algorithm has been deployed to real drones and performs well in all aspects, which is fundamentally consistent with the simulation results.

5. Conclusions

In this article, a new reinforcement learning algorithm (watcher-actor-critic, WAC) is applied to the hover control of drones. After using the actor-critic algorithm-based behavior value (QAC) and deep deterministic policy gradient algorithm (DDPG) for drone hover control learning, we proposed a deep deterministic supervised policy gradient algorithm that can quickly lock the exploration direction. Using a PID controller with parameters provided by a neural network as a dynamic monitor, the supervision intensity could be changed as the policy learning progress changes, ensuring that the exploration function of the policy network could be maximally maintained. Finally, a simulation was conducted using a classic reinforcement learning environment library, Python, and a current mainstream reinforcement learning framework, PARL. The training results show that the training episodes of the WAC are reduced by 20% compared to the DDPG and 55% compared to the QAC, after which the algorithm was deployed to a practical environment. A multi-sensor fusion strategy-based autonomous localization method for unmanned aerial vehicles was used for practical exercises. The simulation and experimental results show that the reinforcement learning algorithm designed in this paper has a higher learning efficiency, faster convergence speed, and smoother hovering effect compared to the QAC and DDPG without consuming too much computational cost and optimization time.
While proposing a better algorithm in this article, there are inevitably some aspects that need improvement or can make the proposed algorithm better, such as the following: (1) the significant drift phenomenon during hovering needs improvement; (2) the fluctuation of the Euler angle indicates that the drone flight was not smooth enough and needs to be strengthened; (3) the PID controller in the watcher can be replaced by other superior traditional controllers to achieve better hovering performance. We will focus on addressing these areas in future research.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/keepfoolisher/drone_hover_control/releases/tag/s1, accessed on 28 January 2024. Video S1: Simulation and physical deployment experiments video.

Author Contributions

Conceptualization, J.W. and Z.Y.; methodology, J.W.; software, J.W. and H.Z.; validation, N.H., L.L. and Z.W.; formal analysis, J.W.; investigation, H.Z., C.Z. and J.W.; resources, Z.Y.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W., C.X. and Z.Y.; visualization, J.W., C.X. and C.Z.; supervision, Z.Y.; project administration, Z.Y.; funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Guangxi Power Grid Company’s 2023 Science and Technology Innovation Project under Grant GXKJXM20230169 and in part by the Guizhou Provincial Science and Technology Projects, grant number Guizhou-Sci-Co-Supp [2020]2Y044. The authors fully appreciate their financial support.

Data Availability Statement

The data supporting the findings of this study are available within the Supplementary Materials.

Acknowledgments

The authors would like to thank the reviewers for their constructive comments and suggestions that helped improve this paper.

Conflicts of Interest

Author Haoze Zhuo was employed by the company Electric Power Research Institute of Guangxi Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Ding, X.; Guo, P.; Kun, X.U.; Yushu, Y.U. A review of aerial manipulation of small-scale rotorcraft unmanned robotic systems. Chin. J. Aeronaut. 2019, 32, 15. [Google Scholar] [CrossRef]
  2. Seo, J.; Duque, L.; Wacker, J. Drone-enabled bridge inspection methodology and application. Autom. Constr. 2018, 94, 112–126. [Google Scholar] [CrossRef]
  3. Sebastian, D.O.; Irene, M.; Klaus, P.; Johannes, R. Unmanned Aerial Vehicle (UAV) for Monitoring Soil Erosion in Morocco. Remote Sens. 2012, 4, 3390–3416. [Google Scholar]
  4. Liao, L.; Yang, Z.; Wang, C.; Xu, C.; Xu, H.; Wang, Z.; Zhang, Q. Flight Control Method of Aerial Robot for Tree Obstacle Clearing with Hanging Telescopic Cutter. Control Theory Appl. 2023, 40, 343–352. [Google Scholar]
  5. Xu, W.; Wu, X.; Liu, J.; Yan, Y. Design of anti-load perturbation flight trajectory stability controller for agricultural UAV. Front. Plant Sci. 2023, 14, 1030203. [Google Scholar]
  6. Villa, D.K.D.; Brandão, A.S.; Sarcinelli-Filho, M. A Survey on Load Transportation Using Multirotor UAVs. J. Intell. Robot. Syst. 2020, 98, 267–296. [Google Scholar] [CrossRef]
  7. Kang, K.; Prasad, J.V.R.; Johnson, E. Active Control of a UAV Helicopter with a Slung Load for Precision Airborne Cargo Delivery. Unmanned Syst. 2016, 4, 213–226. [Google Scholar] [CrossRef]
  8. Shehzad, M.F.; Bilal, A.; Ahmad, H. Position & Attitude Control of an Aerial Robot Quadrotor with Intelligent PID and State feedback LQR Controller: A Comparative Approach. In Proceedings of the 2019 16th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 8–12 January 2019. [Google Scholar]
  9. López, J.; Dormido, R.; Dormido, S.; Gómez, J.P. A Robust H∞ Controller for an UAV Flight Control System. Sci. World J. 2015, 2015, 403236. [Google Scholar] [CrossRef]
  10. Rahimi, M.R.; Hajighasemi, S.; Sanaei, D. Designing and simulation for vertical moving control of uav system using pid, lqr and fuzzy logic. Int. J. Electr. Comput. Eng. 2013, 3, 651–659. [Google Scholar]
  11. Ahmed, B.; Pota, H.R.; Garratt, M. Flight control of a rotary wing UAV using backstepping. Int. J. Robust Nonlinear Control 2010, 20, 639–658. [Google Scholar] [CrossRef]
  12. Xu, C.; Yang, Z.; Jiang, Y.; Xu, H.; Zhou, D.; Liao, L.; Zhang, Q. Active disturbance rejection control of a novel multi-rotor aerial robot with a manipulator. Control Theory Appl. 2022, 39, 581–592. [Google Scholar]
  13. Kang, B.; Miao, Y.; Liu, F.; Duan, J.; Wang, K.; Jiang, S. A Second-Order Sliding Mode Controller of Quad-Rotor UAV Based on PID Sliding Mode Surface with Unbalanced Load. J. Syst. Sci. Complex. 2021, 34, 520–536. [Google Scholar] [CrossRef]
  14. Irmawan, E.; Eko Prasetiyo, E. Kendali Adaptif Neuro Fuzzy PID untuk Kestabilan Terbang Fixed Wing UAV (Adaptive Control of Neuro Fuzzy PID for Fixed Wing UAV Flight Stability. J. Nas. Tek. Elektro Dan Teknol. Inf. 2020, 9, 73–78. [Google Scholar] [CrossRef]
  15. Yue, Q.; Baitong, Z.; Chao, C. Self-learning method of uav track planning strategy in complex environment with multiple constraints. Comput. Eng. 2021, 47, 44–51. [Google Scholar]
  16. Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a Quadrotor With Reinforcement Learning. IEEE Robot. Autom. Lett. 2017, 2, 2096–2103. [Google Scholar] [CrossRef]
  17. Zhang, T.; Kahn, G.; Levine, S.; Abbeel, P. Learning deep control policies for autonomous aerial vehicles with mpc-guided policy search. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; Volume 2016, pp. 528–535. [Google Scholar]
  18. He, N.; Yang, Z.; Fan, X.; Wu, J.; Sui, Y.; Zhang, Q. A Self-Adaptive Double Q-Backstepping Trajectory Tracking Control Approach Based on Reinforcement Learning for Mobile Robots. Actuators 2023, 12, 326. [Google Scholar] [CrossRef]
  19. Yeh, Y.L.; Yang, P.K. Design and Comparison of Reinforcement-Learning-Based Time-Varying PID Controllers with Gain-Scheduled Actions. Machines 2021, 9, 319. [Google Scholar] [CrossRef]
  20. Wang, M.; Zeng, B.; Wang, Q. Research on Motion Planning Based on Flocking Control and Reinforcement Learning for Multi-Robot Systems. Machines 2021, 9, 77. [Google Scholar] [CrossRef]
  21. Kaufmann, E.; Bauersfeld, L.; Loquercio, A.; Müller, M.; Koltun, V.; Scaramuzza, D. Champion-level drone racing using deep reinforcement learning. Nature 2023, 620, 982–987. [Google Scholar] [CrossRef]
  22. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
  23. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef]
  24. Wu, Q.; Geng, Z.; Ren, Y.; Feng, Q.; Zhong, J. Multi-UAV Redeployment Optimization Based on Multi-Agent Deep Reinforcement Learning Oriented to Swarm Performance Restoration. Sensors 2023, 23, 9484. [Google Scholar] [CrossRef]
  25. Hasselt, H.V.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. arXiv 2015, arXiv:1509.06461. [Google Scholar] [CrossRef]
  26. Li, B.; Huang, J.; Wan, K.; Song, C. A review of research on the application of UAV system based on deep reinforcement learning. Tactical Missile Technol. 2023, 1, 58–68. [Google Scholar]
  27. Yang, Z.; Li, B.; Gan, Z.; Liang, S. Route Following of Quadrotor UAV Based on Deep Reinforcement Learning. J. Command. Control 2022, 8, 477–482. [Google Scholar]
  28. Kong, W.; Zhou, D.; Yang, Z.; Zhao, Y.; Zhang, K. UAV Autonomous Aerial Combat Maneuver Strategy Generation with Observation Error Based on State-Adversarial Deep Deterministic Policy Gradient and Inverse Reinforcement Learning. Electronics 2020, 9, 1121. [Google Scholar] [CrossRef]
  29. Qi, H.; Hu, Z.; Huang, H.; Wen, X.; Lu, Z. Energy Efficient 3-D UAV Control for Persistent Communication Service and Fairness: A Deep Reinforcement Learning Approach. IEEE Access 2020, 36, 53172–53184. [Google Scholar] [CrossRef]
  30. Wu, J.; Yang, Z.; Liao, L.; He, N.; Wang, Z.; Wang, C. A State-Compensated Deep Deterministic Policy Gradient Algorithm for UAV Trajectory Tracking. Machines 2022, 10, 496. [Google Scholar] [CrossRef]
  31. Lee, M.H.; Moon, J. Deep reinforcement learning-based model-free path planning and collision avoidance for UAVs: A soft actor-critic with hindsight experience replay approach. ICT Express 2023, 9, 403–408. [Google Scholar] [CrossRef]
  32. Li, Z.; He, J.; Ma, H.; Huang, G.; Li, J. Research on motor speed control algorithm of UAV based on reinforcement learning. In Proceedings of the International Conference on Advanced Mechanical, Electronic and Electrical Engineering (ICAMEE 2022), Qingdao, China, 26–28 August 2022. [Google Scholar] [CrossRef]
  33. Zhenyu, J.; Zilong, L. Quadrotor Attitude Control Algorithm Based on Reinforcement Learning. J. Chin. Comput. Syst. 2021, 42, 2074–2078. [Google Scholar]
  34. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  35. Peters, J.; Schaal, S. Policy Gradient Methods for Robotics. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, 9–15 October 2006; pp. 2219–2225. [Google Scholar]
  36. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
  37. Zhang, C.; Yang, Z.; Zhuo, H.; Liao, L.; Yang, X.; Zhu, T.; Li, G. A Lightweight and Drift-Free Fusion Strategy for Drone Autonomous and Safe Navigation. Drones 2023, 7, 34. [Google Scholar] [CrossRef]
Figure 1. WAC control policy.
Figure 1. WAC control policy.
Drones 08 00069 g001
Figure 2. Schematic diagram of DDPG algorithm.
Figure 2. Schematic diagram of DDPG algorithm.
Drones 08 00069 g002
Figure 3. Schematic diagram of WAC algorithm.
Figure 3. Schematic diagram of WAC algorithm.
Drones 08 00069 g003
Figure 4. Actor-critic network architecture settings in simulation.
Figure 4. Actor-critic network architecture settings in simulation.
Drones 08 00069 g004
Figure 5. Neural network structure in watcher.
Figure 5. Neural network structure in watcher.
Drones 08 00069 g005
Figure 6. Training episode rewards of three algorithms.
Figure 6. Training episode rewards of three algorithms.
Drones 08 00069 g006
Figure 7. Iterations taken per episode.
Figure 7. Iterations taken per episode.
Drones 08 00069 g007
Figure 8. Accumulation of all iterations for three algorithms.
Figure 8. Accumulation of all iterations for three algorithms.
Drones 08 00069 g008
Figure 9. Simulation trajectories: (a) QAC algorithm, (b) DDPG algorithm, (c) WAC algorithm, (d) integrated display.
Figure 9. Simulation trajectories: (a) QAC algorithm, (b) DDPG algorithm, (c) WAC algorithm, (d) integrated display.
Drones 08 00069 g009
Figure 10. Comparison of simulation results of three algorithms: (a) X-axis displacement, (b) X-axis velocity, (c) Y-axis displacement, (d) Y-axis velocity, (e) Altitude, (f) Z-axis velocity.
Figure 10. Comparison of simulation results of three algorithms: (a) X-axis displacement, (b) X-axis velocity, (c) Y-axis displacement, (d) Y-axis velocity, (e) Altitude, (f) Z-axis velocity.
Drones 08 00069 g010
Figure 11. Average deviation of three algorithms.
Figure 11. Average deviation of three algorithms.
Drones 08 00069 g011
Figure 12. Comparison of simulation results of three algorithms: (a) pitch angles, (b) roll angles.
Figure 12. Comparison of simulation results of three algorithms: (a) pitch angles, (b) roll angles.
Drones 08 00069 g012
Figure 13. Training time cost for the total iterations.
Figure 13. Training time cost for the total iterations.
Drones 08 00069 g013
Figure 14. Training time cost for per 10,000 iterations.
Figure 14. Training time cost for per 10,000 iterations.
Drones 08 00069 g014
Figure 15. Optimize running time for each step.
Figure 15. Optimize running time for each step.
Drones 08 00069 g015
Figure 16. Algorithm deployment process.
Figure 16. Algorithm deployment process.
Drones 08 00069 g016
Figure 17. Hover experiment trajectories: (a) Expt.1, (b) Expt.2, (c) Expt.3, (d) integrated display.
Figure 17. Hover experiment trajectories: (a) Expt.1, (b) Expt.2, (c) Expt.3, (d) integrated display.
Drones 08 00069 g017
Figure 18. Results of experiments: (a) X-axis displacement, (b) Y-axis displacement, (c) altitude, (d) Expt.1 velocity, (e) Expt.2 velocity, (f) Expt.3 velocity.
Figure 18. Results of experiments: (a) X-axis displacement, (b) Y-axis displacement, (c) altitude, (d) Expt.1 velocity, (e) Expt.2 velocity, (f) Expt.3 velocity.
Drones 08 00069 g018
Figure 19. Results of experiments: (a) pitch angles, (b) roll angles.
Figure 19. Results of experiments: (a) pitch angles, (b) roll angles.
Drones 08 00069 g019
Table 1. Deployed drone configuration.
Table 1. Deployed drone configuration.
ModuleSensorType
VisualDepth CameraIntel D435i
InertiaIMUMPU-9250
GNSSAntenna
Receiver
BenTian-3070
u-blox ZED-F9P
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, J.; Yang, Z.; Zhuo, H.; Xu, C.; Zhang, C.; He, N.; Liao, L.; Wang, Z. A Supervised Reinforcement Learning Algorithm for Controlling Drone Hovering. Drones 2024, 8, 69. https://doi.org/10.3390/drones8030069

AMA Style

Wu J, Yang Z, Zhuo H, Xu C, Zhang C, He N, Liao L, Wang Z. A Supervised Reinforcement Learning Algorithm for Controlling Drone Hovering. Drones. 2024; 8(3):69. https://doi.org/10.3390/drones8030069

Chicago/Turabian Style

Wu, Jiying, Zhong Yang, Haoze Zhuo, Changliang Xu, Chi Zhang, Naifeng He, Luwei Liao, and Zhiyong Wang. 2024. "A Supervised Reinforcement Learning Algorithm for Controlling Drone Hovering" Drones 8, no. 3: 69. https://doi.org/10.3390/drones8030069

Article Metrics

Back to TopTop