Next Article in Journal
Comparison of Machine Learning Models for Predicting Interstitial Glucose Using Smart Watch and Food Log
Previous Article in Journal
MADDPG-Based Deployment Algorithm for 5G Network Slicing
Previous Article in Special Issue
Optimal Voltage Recovery Learning Control for Microgrids with N-Distributed Generations via Hybrid Iteration Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reinforcement Learning-Based Energy-Saving Path Planning for UAVs in Turbulent Wind

Electric Power Science Research Institute of Guangxi Power Grid Co., Ltd., Nanning 530023, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(16), 3190; https://doi.org/10.3390/electronics13163190
Submission received: 10 July 2024 / Revised: 8 August 2024 / Accepted: 9 August 2024 / Published: 12 August 2024
(This article belongs to the Special Issue Intelligent Mobile Robotic Systems: Decision, Planning and Control)

Abstract

:
The unmanned aerial vehicle (UAV) is prevalent in power inspection. However, due to a limited battery life, turbulent wind, and its motion, it brings some challenges. To address these problems, a reinforcement learning-based energy-saving path-planning algorithm (ESPP-RL) in a turbulent wind environment is proposed. The algorithm dynamically adjusts flight strategies for UAVs based on reinforcement learning to find the most energy-saving flight paths. Thus, the UAV can navigate and overcome real-world constraints in order to save energy. Firstly, an observation processing module is designed to combine battery energy consumption prediction with multi-target path planning. Then, the multi-target path-planning problem is decomposed into iterative, dynamically optimized single-target subproblems, which aim to derive the optimal discrete path solution for energy consumption prediction. Additionally, an adaptive path-planning reward function based on reinforcement learning is designed. Finally, a simulation scenario for a quadcopter UAV is set up in a 3-D turbulent wind environment. Several simulations show that the proposed algorithm can effectively resist the disturbance of turbulent wind and improve convergence.

1. Introduction

1.1. Background

In recent years, the unmanned aerial vehicle (UAV) technology has gradually matured, demonstrating diverse potentials and gaining widely attention in engineering research and market applications [1,2]. The endurance time and path planning of UAVs is crucial for information acquisition and task execution of path planning. In practical applications, such as power inspection, aerial photography, and logistics, UAVs need to have sufficient flight range and endurance time to complete the task of path planning [3]. However, the endurance issue has always been a major factor restricting the development of UAV technology research and market applications. Therefore, how to effectively improve the endurance time of UAVs has become an urgent problem to be solved in the industry.
UAVs can be classified into several common types based on their mechanical structures, including multi-rotor drones, fixed-wing drones, and unmanned helicopters. Multi-rotor drones and unmanned helicopters are characterized by a flexible attitude, the capability to ascend and descend vertically, the ability to hover freely, low requirements for the environment, and a site for take-off and landing [4]. However, they have some limitations in terms of load capacity, flight speed, and endurance. Therefore, it becomes crucial for UAV energy-saving path planning to understand how the power consumption of drones changes with different flight states.
The traditional battery energy consumption management system of UAVs cannot be directly applied to drones. Due to different practical application scenarios, for example, the UAV has different motion modes and thus different energy consumption losses during autonomous landing and horizontal flight. Furthermore, the influence of three-dimensional(3-D) turbulent wind environmental factors during the flight of the UAV makes the traditional battery energy consumption management system for the UAV. At present, researchers have established battery energy consumption management systems for fixed-wing UAVs and single-rotor UAVs or provided heuristic battery energy consumption management solutions for multi-rotor UAVs [5]. The power consumption model of the UAV has been extensively studied by domestic and foreign scholars [6], mainly divided into three categories: ① the data-driven heuristic battery energy consumption management model. In [7], a quadcopter battery energy consumption management model based on collected data-driven approaches is proposed. By collecting information such as flight speed data and heading angle, the algorithm infers the battery energy-loss model under flight conditions, and then designs an energy consumption management model. However, this algorithm fails to consider designing the battery energy consumption management model from the perspective of task optimization, such as path-planning tasks. Similarly, given the case of a UAV, a power consumption model involving its speed and working conditions is presented in [8] based on experimental results. However, the power consumption models presented in [8,9] are solely related to speed, overlooking the impact of other crucial factors on the power consumption of UAVs, such as the number of rotors and payload weight. Therefore, the lack of more accurate closed-form expressions limits their applications. In addition, by analyzing battery performance, some power consumption models for electric UAVs have been derived in [5,10,11], and the aforementioned shortcomings have been identified. Therefore, in order to theoretically investigate the power consumption of UAVs and improve the applicability of the obtained models, theoretical power consumption models based on kinematics and aircraft theory are considered. ② Battery energy consumption management model based on UAV kinematics theory. In [9], the authors propose a general fixed-wing energy-loss model based on speed and acceleration from the perspective of UAV kinematics theory, but this model is not suitable for rotor drones. In [12], a closed-form power consumption model is derived for a single-rotor UAV under constant-speed 1-D forward flight, which is used to study energy-efficient UAV communications. This model is then extended in [13,14] to derive analytical models for a single-rotor UAV performing 2-D forward flight. Furthermore, the model in [12] was validated in [15] by fitting measured data from UAVs collected from extensive experiments. Additionally, a power consumption model for a single-rotor UAV during vertical flight was derived in [14]. All these works provide valuable guidance for establishing UAV power consumption models. However, considering that multi-rotor UAVs are the most popular type of UAV today, the existing works [5,7,8,10] only provide heuristic power consumption models without rigorous mathematical derivations, which limits their application in many research areas. Therefore, it becomes necessary to derive a theoretical power consumption model for multi-rotor UAVs. Furthermore, previous studies [9,13,14] only considered the flight status of UAVs in 1-D or 2-D scenarios, overlooking the general 3-D scenario. This prompts the need to investigate corresponding power consumption characteristics in order to study the power consumption behavior of UAVs in 3-D scenarios. ③ Battery energy consumption management model for multi-rotor drones. The authors in [16] proposed a closed-form mathematical expression model for battery energy loss in 1-D or 2-D real-world flight scenarios, which can be extended to 3-D real-world scenarios. However, this algorithm neglects the research on the closed-loop learning framework between path planning and data collection (factors such as the UAV’s motion model and 3-D turbulent wind), making it difficult to effectively guide the UAV in adaptive path planning under limited energy consumption. Evidently, the battery energy consumption management system for UAVs has evolved from 1-D to 3-D real-world scenarios, extending from a data-collection and data-driven energy management model to a generalizable closed-form mathematical expression for calculating battery energy loss, thereby effectively optimizing energy consumption.
With the advancement of artificial intelligence algorithms, the optimization of existing UAV power consumption model strategies has also become more intelligent. For instance, Hung et al. [17] controlled the UAV’s propulsion system using a fuzzy-based equivalent energy consumption minimization strategy and a control strategy based on the optimal operating curve of the engine. Lee et al. [18] and Khayyam et al. [19] have designed rule-based power management systems and neural network control structures, respectively. Bongermino et al. [20,21] proposed a real-time iterative algorithm based on dynamic programming. Although Lei Tao et al. [22] analyzed various management strategies such as state machines, fuzzy logic, dynamic programming, and minimum equivalent fuel consumption, their methods are still limited to rule-based energy management strategies, and the shortcomings of such methods have not been fully addressed. In [23], the authors adopted the multiagent deep deterministic policy gradient (MADDPG) algorithm and prioritized experience replay (PER) to optimize the energy consumption and cooperation strategy of the UAV-assisted multiaccess edge computing (MEC) system.
In summary, researchers have devoted significant effort to studying the management strategies for UAV battery energy consumption and have achieved remarkable results. However, both the complex 3-D turbulent wind environmental factors and the changes in the UAV’s own flight status can affect the battery energy consumption management system. Traditional research has rarely combined path-planning targets with known acquisition parameters for autonomous learning.
Reinforcement learning (RL) is a recently emerged artificial intelligence (AI) technique that describes and solves the problem of an agent achieving specific goals through interactions with the environment by learning to maximize rewards. Therefore, this paper proposes a reinforcement learning-based energy-saving path-planning algorithm (ESPP-RL) for the UAV in turbulent wind. Based on RL, the ESPP-RL algorithm considers the 3-D turbulent wind environmental and the UAV’s flight movement patterns (including horizontal flight, vertical descent, and vertical ascent), introducing local observation conditions to enable the policy network to efficiently process observation features of dynamic dimensions, thus making path planning adaptive.

1.2. Motivation and Contribution

In recent years, RL has been successfully applied in the fields of robot planning, control, and navigation. RL has been shown to be a powerful tool for solving path-planning problems. Through interaction with the environment, the agent can learn how to make optimal decisions in uncertain environments. However, most existing RL methods still face challenges when dealing with high-dimensional state spaces and dynamic environments. Moreover, ensuring the robustness of the algorithm under various wind intensities while maintaining real-time performance is also a current research hotspot. A reinforcement learning-based energy-saving path-planning algorithm (ESPP-RL) in a turbulent wind environment is proposed. The main contributions of this paper are as follows:
  • Considering the complex factors of 3-D turbulent wind and the UAV’s motion models (including horizontal flight, vertical descent, and vertical ascent), an optimal energy-saving path planning for the UAV on the basic of the framework of RL technology is proposed.
  • Utilizing end-to-end training, combined with carefully crafted state features and reward functions, an efficient single-policy network is introduced, which not only achieves adaptive path planning but also optimizes dynamic battery management strategies to tackle challenges under partially observable conditions. The policy network adeptly handles dynamic and dimensional observational features, significantly enhancing the UAV’s adaptability in path planning under limited battery capacity and improving energy-saving efficiency.
  • Conducting sets of experiments on the built simulation platform, compared to other algorithms, the ESPP-RL algorithm demonstrates superior performance and notable robustness advantages under a complex 3-D turbulent wind environment.

2. System Model and Problem Formulation

2.1. UAV Model

Firstly, a dynamic analysis of autonomous drones is crucial. In this paper, the dynamics model of the four-rotor UAV is taken as the analysis object, which has six degrees of freedom. As shown in Figure 1, the geodetic coordinate system and AUV coordinate system are set up to describe the motion and force of the UAV. O B X B Y B Z B is the geodetic coordinate system and O E X E Y E Z E is the body coordinate system used to describe the motion of the drone for the 3-D environment.
It is assumed that for small roll and pitch rotation, the angular dynamics are considered to be consistent across frames. For the 3-D environment, the linear position of the UAV in the inertial coordinate system is d =   x   ,   y   ,   z T and the linear velocity in the body coordinate system is v =   u   ,   v   ,   w T , the attitude vector is Θ =   φ   ,   θ   ,   ψ T , and the angular speed vector is Ω . The kinetics can be simplified into [24,25,26]:
d ˙ = R v
Θ ˙ = Ω
f = T h e 3 + R 1 G e 3 + R 1 δ
G v ˙ = f g G Ω × v
Ω ˙ = J 1 ( τ Ω × J Ω )
where G is the gravity of the vehicle and g is the acceleration of gravity. The transformation matrix R is responsible for converting the data from the volume coordinate system to the inertial coordinate system. The vector f contains all the linear forces acting on the quadrotor, director vector e 3 = 0 , 0 , 1 T , while δ = δ x , δ y , δ z T describes all the external disturbances acting on the vehicle. The total thrust Th is generated by four motors and can be expressed as T h = j = 1 4 T h j .
In the hovering state of the four-rotor UAV, it is presumed that the weight of each rotor is uniform. Since the four-rotor UAV flies in a 3-D scene, its own gravity and the total thrust are balanced, so the thrust on each rotor is equal. In order to simplify the calculation, the rotor-related parameters (rotor solidity h , rotor disk area A s ) on each rotor are the same, and the weight of each rotor of a quadrotor UAV is the same, where C T is the thrust coefficient and ρ is the air density. The main notations used in this section are summarized in Table 1.
The quadrotor UAV flies horizontally with a constant speed V u in the 3-D environment, which is decomposed into three speeds on the world coordinate axis, V x , V y , and V z . When a quadrotor UAV is hovering, the energy consumption on each rotor is equal, so the total energy consumption during hovering can be expressed as the sum of the energy consumption of the four rotors. The energy consumption of the jth rotor P h o v j and the total energy consumption during hovering can be formulated as [16]:
P h o v j = G 4 C T 3 / 2 λ h 8 ρ A s + 1 + k G 4 3 / 2 1 2 ρ A s
j 4 P h o v j = 4 × G 4 C T 3 / 2 λ h 8 ρ A s + 1 + k G 4 3 / 2 1 2 ρ A s
= G C T 3 / 2 λ h 16 ρ A s P b l + 1 + k G 3 / 2 1 2 2 ρ A s P i n
where k represents the incremental correction factor to induced power. The total energy consumption during hovering can be composed of two parts: one is the rotor profile power P b l and the other is the induced power P i n .
When a quadrotor UAV flies horizontally with a constant speed, its energy consumption can be expressed as [16]:
Δ P h V h = 3 4 δ G ρ A s C T h V h 2 + P i n 1 + V h 4 4 v 0 4 V h 2 2 v 0 2 1 / 2 1 + 4 ρ γ h V h 3
When a quadcopter UAV is flying vertically, there are two situations: vertical ascent and vertical descent. In both cases, the problem of velocity direction and thrust direction is dealt with by the sgn function. Therefore, in vertical flight, due to the influence of fuselage resistance, the energy loss is [16]:
Δ P v V v = 1 2 G V v + 2 s g n V v ρ γ v V h 3 + G 2 + 2 s g n V v ρ γ v V v 2
× 1 + 2 γ v s g n V v ρ A s V v 2 + G 2 ρ A s + s g n V v 1 G 2 G 2 ρ A s

2.2. Turbulent Wind Model

In changeable and challenging 3-D space, the flight trajectory and attitude of autonomous drones are not only limited by their internal dynamics model but also by various factors in the surrounding environment. In particular, the wind field plays a crucial role in the course control and stability of the UAV. The following content will explore the theoretical basis of constructing a turbulent wind model in depth. In a turbulent wind scenario, the motion of a quadrotor UAV is affected by wind resistance. The turbulent wind disturbance model is noted as follows [27]:
δ t u r = 1 2 ρ B t u r A s ( d ˙ v ω t u r ) 2 s i g n ( d ˙ v ω t u r )
where B t u r = [ x t u r , y t u r , z t u r ] T represents the drag coefficient and v ω t u r is the turbulent wind speed in the volume coordinate system.

2.3. Reinforcement Learning Model

Reinforcement learning algorithms acquire rewards in the environment through continuous trial and error, guiding the training and optimization direction of the algorithms based on the reward values, ultimately enabling the RL algorithms to prioritize actions or paths that maximize the reward values. RL is a type of learning approach that acquires corresponding reward values by observing the environment and taking actions and ensures the convergence of the algorithm through maximizing reward expectations.
As shown in Figure 2, RL is generally described by a Markov process which consists of components such as state s , action a , policy π , and reward r , and achieves state transitions in a probabilistic manner based on the state transition function. At each time step t , the agent acquires the current state information s t from the environment; based on the state s t and policy π , the agent selects an action a t + 1 to output within the action space. According to the state transition function p s t + 1 | s t , a , the environment then returns a reward value r and the state for the next time step s t + 1 . Ultimately, the policy π aims to optimize itself by maximizing the expected reward.
A Markov decision process (MDP) is the foundation of RL algorithms, which expresses the state transition probabilities and processes after making a decision [28]. The MDP is given by the tuple S , A , p , r , including the state space S , action space A , state transition probability p , and reward value r . Among them, the state transition probability, denoted as p : S × S × A , represents the probability density of transitioning to the next state when an action is taken a t A in the current state s t S . For every completed transfer of state, the environment generates an immediate reward value r based on the state and action, and the range of reward values can be expressed as r : S × A r min , r max .

3. Algorithm

Based on a UAV motion model and 3-D turbulent wind model, this paper presents the development of a UAV’s motion simulation platform and introduces a UAV path-planning scheme ESPP-RL. The specific technical content is as follows.

3.1. Turbulent Wind Environment

To increase the realism of the simulation environment and enhance the practical deployability of the algorithm on a UAV, this paper constructs a turbulent wind environment based on PyBullet. As a physics simulation library, PyBullet integrates rich physical and mathematical constraints, enabling complex real-time physical calculations and rendering. It has been widely applied in the fields of robot motion simulation and deep learning. Therefore, PyBullet serves as the foundational tool for environment construction, catering to the dynamic environmental computation needs of UAVs during complex task execution.
A continuous 3-D environment space of dimensions 100 × 100 × 100 is constructed within PyBullet, and subsequently, a 3-D turbulent wind environment is simulated within it. The model data of turbulent wind are derived utilizing formula (10), and subsequent interpolation processing is applied to this data, ensuring that a corresponding turbulent wind strength I ( u I , v I , w I ) is assigned to any position within the spatial domain. And u I , v I , w I represent the linear velocities of turbulent wind in the x-, y-, and z-axes, respectively.

3.2. State Transition Function

For a quadcopter UAV, this paper sets the action space to four dimensions a ( Ω u , Ω v , Ω w , T a ) , including the angular velocity outputs ω u , ω v , ω w and thrust T a in the x-, y-, and z-axes in the body coordinate system. The thrust component on each axis is expressed as:
[ T u , T v , T w ] T = [ Ω u , Ω v , Ω w ] T T a
m [ a u , a v , a w ] T = [ T u , T v , T w ] T [ G , G , G ] T
where a u , a v , a w represents the acceleration of the UAV affected by thrust [ T u , T v , T w ] in the horizontal, vertical, and vertical axis directions. In the context of environmental observation, the state space of the environment is assumed to be partially observable, and the state space acquired by the UAV at a given time step is defined as:
S = [ d , v ,   Θ , Ω , p t a r , I ]
which includes the linear position component d , linear velocity component v , attitude angle component Θ , angular velocity component Ω , and the target’s position information of the UAV p t a r . It also includes the current turbulence wind strength I ( u I , v I , w I ) at the location. The state transition is influenced by the UAV state transition equation, the algorithm’s action output, and the turbulence wind interference. Upon the execution of an action, the subsequent position d =   x   ,   y   ,   z T of the UAV is determined by a confluence of factors: the UAV’s current velocity v u , v v , v w T , the acceleration generated by the algorithmic output, and the turbulence wind strength.
x y z = x y z + v u + a u v v + a v v w + a w + u I v I w I

3.3. Reward Function

In reinforcement learning, the purpose of algorithm optimization is to maximize the reward expectation and then affect the gradient updating direction of the neural network. Therefore, the design of a reward function is very important to the RL algorithm, which directly determines the effect of environment-guided algorithm training. To this end, this paper carefully designed the reward function, including the following four items: distance reward r e w d i s , collision reward r e w c o l l i s i o n , energy consumption reward r e w e n e r g y , and goal reward r e w t a r g e t .
In order to improve the efficiency of the algorithm to explore the environment, distance reward is designed to guide the algorithm to approach the target point region. The distance reward r e w d i s can be expressed by the formula:
r e w d i s = min D I S ( P U A V , P t a r i ) / s i z e e n v , i = 1 , 2 , n
where D I S ( P U A V , P t a r i ) represents the Euclidean distance of the UAV from the target and s i z e e n v corresponds to the environmental dimensions. The collision reward is a reward for the auxiliary algorithm to perceive the environmental boundary and obstacles, and the value of the obstacle reward changes according to the perception of the agent, where r p e r c e p t i o n is the obstacle perception radius of the UAV and r f a i l is the collision radius. The collision reward r e w c o l l i s i o n can be expressed by the formula:
r e w c o l l i s i o n = 0 ( D I S ( P U A V , P obstacle ) > r p e r c e p t i o n 1 / D I S ( P U A V , P obstacle ) , r f a i l < D I S ( P U A V , P obstacle ) < r p e r c e p t i o n 100 , D I S ( P U A V , P obstacle ) < r f a i l
In order to make the UAV complete the task more efficiently, this paper designs the energy consumption reward r e w e n e r g y , which is used to evaluate the number of running steps and action quality of the algorithm. Δ h ( P U A V , P t a r ) is the vertical update component of the UAV and the target position after the execution of each step and Δ v ( P U A V , P t a r ) corresponds to the horizontal update component ψ = 1.2 . The energy consumption reward r e w c o l l i s i o n can be expressed by the formula:
r e w e n e r g y = 0.1 + ψ Δ h ( P U A V , P t a r ) + Δ v ( P U A V , P t a r i )
When the UAV reaches the target point, a larger positive reward will be given, for example, r e w t a r g e t = 100 , and the episode will end. Therefore, the comprehensive reward function is expressed as:
r e w t o t a l = r e w d i s + r e w c o l l i s i o n + r e w e n e r g y + r e w t a r g e t

3.4. Reinforcement Learning Algorithm

A significant feature of UAV systems is energy limitation, which requires the path planning of the UAV to have a higher flexibility of algorithms. How to self-adapt the battery management, select the movement path, and adapt to the change-of-wind field can effectively improve the intelligence and adaptability of the algorithm. The traditional path-planning algorithm has limited capability and difficulty in dealing with a UAV multi-target path planning problem in a dynamic environment. Therefore, a reinforcement learning-based energy-saving path-planning algorithm in a turbulent wind environment is designed, The framework of the proposed ESPP-RL algorithm is shown in Figure 3.
In general, the ESPP-RL algorithm includes an observation processing module, a neural network module, and an action optimizer. This section will explain each module of the ESPP-RL algorithm. Algorithm 1 summarizes the training process.
Firstly, the power consumption of the UAV in 3-D space is expressed as P t o t a l ( V t o t a l ) = P h o v + Δ P h ( V h ) + Δ P v ( V v ) , where the total power is decomposed into three components: hover power P h o v , vertical descent and ascent power P h , and horizontal flight power P v . The energy consumed by the UAV moving to the target point is:
E = P h o v ( Δ t h + Δ t v ) + Δ P h ( V h ) Δ t h + Δ P v ( V v ) Δ t v
Under multi-target path planning, the position of the UAV changes dynamically and the corresponding expectation of total energy consumption can be expressed as:
E t o t a l = min i P h o v i ( Δ t h i + Δ t v i ) + Δ P h i ( V h i ) Δ t h i + Δ P v i ( V v i ) Δ t v i
E t o t a l = min i P h o v i ( Δ t h i + Δ t v i ) + 3 4 δ G ρ A s C T h V h i 2 + P i n 1 + V h i 4 4 v 0 4 V h i 2 2 v 0 2 1 2 1 + 4 γ r V h i 3 Δ t h i + 1 2 G V v i + 2 γ s g n V v i V v i 3 + G 2 + 2 γ s g n V v i V v i 2 × 1 + 2 γ s g n V v i ρ A s V v i 2 + G 2 ρ A s + s g n V v i 1 G 2 G 2 ρ A s Δ t v i
According to the energy consumption expectation E t o t a l and the position state transfer of the UAV, the observation processing module decomposed the multi-target path-planning problem into a dynamic single-objective path-planning subproblem, calculated and stored the solution of the subproblem, and iteratively optimized to obtain the discrete path optimal solution of the energy consumption expectation E t o t a l . At the same time, the observation processing module splices the optimal solution of the discrete path with the environment observation and inputs it into the neural network module.
Algorithm 1: ESPP-RL Algorithm
1Initialize the multi-target path-planning environment for the UAV
2 Initialize   the   state   action   value   network   Q θ 1 ,   Q θ 2 ,   target   action   state   value   network   Q θ 1 Q θ 2 ,   policy   network   π ϕ ,   target   policy   network   π ϕ ,   and   random   parameters   θ 1 , θ 2 , ϕ
3Initialize the replay buffer Β
4 Initial   the   network   parameters   θ 1 θ 1 , θ 2 θ 2 , ϕ ϕ
5Obtain information about the observation processing module
6for t = 1 to T do
7   Select   action   with   exploration   noise   a π ϕ ( s ) + ε ,   ε c l i p ( N ( 0 , σ ) , c , c )   and   observe   reward   r   and new   state   s
8   Store   transition   tuple   s , a , r , s in Β
9  Sample mini-batch of  N   transitions   s , a , r , s from Β
10   a ˜ π ϕ ( s ) + ε , ε c l i p ( N ( 0 , σ ˜ ) , c , c )
11   y r + γ min i = 1 , 2 Q θ i ( s , a ˜ )
12   Update   critics   θ i a r g m i n θ i N 1 y Q θ i s , a 2
13  if t mod d == True then
14    Update ϕ by the deterministic policy gradient:
15     ϕ J ( ϕ ) = N 1 a Q θ 1 s , a | a = π ϕ s ϕ π ϕ s
16    Update target networks
17     ϕ τ ϕ + 1 τ ϕ
18     θ i τ θ i + 1 τ θ i
19  end if
20end for
21Input the action to the PID controller
The information of the observation processing module is inputted into the neural network module, which includes the state action value network Q θ i ( s t , a t ) , i = 1 , 2 , the target action state value network Q θ i ( s t , a t ) , i = 1 , 2 , the policy network π ϕ , and the target policy network π ϕ , with the corresponding network parameters being θ 1 , θ 2 , θ 1 , θ 2 , ϕ , ϕ , respectively. The updated observation space input target strategy network π ϕ with maximum reward expectation, the target strategy network, is expressed as:
π ϕ = arg max π ϕ t Ε ( s t , a t ) ρ π ϕ [ r ( s t , a t ) ]
The network π ϕ selects the action output based on the observation a π ϕ ( s ) + ε , where the action noise is represented by ε , in order to increase the environmental exploration rate ε c l i p ( N ( 0 , σ ) , c , c ) . The algorithm interacts with the environment to feedback the corresponding reward value obtained by the action r ( s t , a t ) . The state action value network Q θ i ( s t , a t ) , i = 1 , 2 in the neural network module evaluates the action output effect of the current policy network according to the reward value r ( s t , a t ) . It is defined as:
Q θ i ( s t , a t ) r ( s t , a t ) + γ Ε s t + 1 p [ V ( s t + 1 ) ] , i = 1 , 2
V ( s t ) Ε a t π ϕ [ Q θ i ( s t , a t ) ]
where γ represents the discount factor, which is used to represent the impact of future n steps on the value function of the current action a t + n , And V ( s t ) is the state value function, which evaluates the expected value of the policy network action output in the current state. Further, this method introduces the target state action value network Q θ i ( s t , a t ) , i = 1 , 2 , designs the objective function to stabilize the algorithm training, and reduces the overestimation problem caused by the influence of the state value action network on the policy network. The specific objective function y t a r is expressed as:
y t a r = r ( s t , a t ) + γ min i = 1 , 2 Q θ i ( s t + 1 , π ϕ ( s t + 1 ) )
The state action value network parameters θ i , i = 1 , 2 are updated according to the objective function y t a r :
θ i arg min θ i N 1 ( y t a r Q θ i ( s t , a t ) ) 2
At the same time, the strategy parameters and the target action value network parameters in the algorithm are updated in a delayed way to smooth the training process. Policy parameters ϕ are updated by gradient optimization:
ϕ ϕ J ( ϕ ) = N 1 a Q θ 1 ( s , a ) | a = π ϕ ( s ) ϕ π ϕ ( s )
Finally, the target action value network parameters θ i , i = 1 , 2 and target policy parameter ϕ are updated, and η represents the soft update weight.
θ i η θ i + ( 1 τ ) θ i
ϕ η ϕ + ( 1 τ ) ϕ
In addition, the action output of the neural network module will pass through an action optimizer. The action optimizer is composed of a PID controller, and the action output of the algorithm will be converted into the actual output parameters after the PID controller, in order to smooth the action output and ensure the relative stability of the UAV attitude.

4. Experiment

4.1. Experimental Setup

In this section, we focus on optimizing the path-planning performance of a four-axis UAV in a 3-D turbulent wind environment and optimizing the energy consumption performance while ensuring that the path planning is completed. The turbulent wind model is imported into a 3-D continuous space of dimensions 100 × 100 × 100 utilizing the PyBullet physics simulation library. By integrating the motion space model, state transition model, and reward function specific to the four-axis UAV, a comprehensive RL simulation environment is constructed. The algorithms are implemented on Ubuntu 18.04, with the hardware configuration including an Intel i9-11900k processor, NVIDIA GTX3080 graphics card, and software based on Python 3.8, Ray 1.8.0, and TensorFlow 2.3.1. And soft actor-critic (SAC) and twin delayed deep deterministic (TD3) algorithms are implemented within the framework as comparison algorithms. The algorithm parameters are shown in Table 2.
The performance of various algorithms in path planning is quantified through the application of four key metrics, which serve as the basis for evaluating the results
  • Steps: The UAV selects actions at fixed intervals according to the policy and then moves to the next position according to the state transition function. The number of steps reflects the number of interactions and the time consumed.
  • Path length: This paper records the cumulative position offset of the agent in the environment and as the path length. The path length reflects the algorithm’s path-planning ability and the ability to resist wind-field interference.
  • Target reach number: The target reach number can accurately reflect the adaptability of the agent to the environment and the quality of the decision. A high goal reach number means that the agent can effectively identify and utilize information in the environment to reach the goal quickly and with minimal cost. Secondly, this index is also an important basis for evaluating the stability of the algorithm.
  • Reward value: The reward value is a core indicator to measure the performance of the agent. The reward value reflects the agent’s performance in the environment, i.e., the total reward it receives after performing a series of actions.

4.2. Analysis of Experimental Results

Sufficient training and experiments have been conducted on the SAC, TD3, and ESPP-RL algorithm. Since the training is an oscillating process, in order to better visualize the training results, appropriate smoothing processing is carried out. The training process is recorded in Figure 4, where Figure 4a represents the average number of actions of the algorithm in the training process, Figure 4b corresponds to the reward value of the UAV in a round, Figure 4c reflects the average distance of the UAV in each round, and, finally, Figure 4d represents the average number of target points that the algorithm can reach in the environment.
In the initial stage of algorithm training, the agent randomly selects actions to explore the environment, and when it touches the environmental boundary or obstacles, it will result in the end of the turn and obtain a large negative reward, which makes the average number of steps and reward value of the algorithm at the beginning of the training stage low. With the influence of the distance reward and random exploration, the UAV accidentally explores the target point and obtains a large target reward, and these exploration processes of the environment are combined into experiences and stored in the experience playback pool. The algorithm regularly extracts experience from the experience playback pool for training and updates its own network parameters. It can be seen that the ESPP-RL algorithm benefits from the observation processing module, with a faster training speed and higher cumulative reward value.

4.2.1. Single-Target Path-Planning Environment

To facilitate observation of the algorithm’s action strategy output performance, the environment is refined into a single-target path-planning scenario. The algorithm does not need to assign tasks to multiple objective points and mainly considers the path-planning effect when the target is determined. Specifically, a static target point and multiple obstacles are configured within the environment, requiring the UAV to navigate from a designated starting point, bypass the obstacles, and ultimately reach the target location. The experimental results show that all algorithms can avoid obstacles to reach the target point in the single-target path-planning task.
Based on the algorithmic execution outcomes presented in Figure 5 and Table 3, within the context of single-target path-planning tasks, all algorithms successfully avoided obstacles and reached the target. Notably, the TD3 and ESPP-RL algorithms exhibited comparable results, whereas the SAC algorithm demonstrated lower reward values and required more steps, which is indicative of higher energy consumption. More specifically, despite requiring a comparable number of steps as TD3, the ESPP-RL algorithm achieved higher reward values, attributed to its PID-regulated action output, which facilitated optimal maneuvers, resulting in reduced vertical deviations and consequently higher rewards. Conversely, as depicted in Figure 5a, while the SAC algorithm initially demonstrated comparable performance to other algorithms during the initial stages of path planning, it struggled to reach the target’s decision boundary towards the end due to disturbances caused by turbulent wind, hindering its ability to precisely navigate within the target’s vicinity.

4.2.2. Multi-Target Path-Planning Environment

To evaluate the decision-making capability and path optimization efficiency of the UAV, the number of target points is varied within the multi-target path-planning environment. The training results are shown in Figure 6 and Table 4. In single-target path planning, emphasis is placed on the algorithm’s resilience against disturbances in turbulent wind. Conversely, in multi-target scenarios, the algorithm dynamically selects path nodes based on the locations of the targets, necessitating intricate decision-making.
As depicted in Table 4, the ESPP-RL algorithm surpasses other methods in terms of both step count and reward value when tasked with 3, 4, and 5 target points. This superiority stems from the algorithm’s imposition of energy consumption constraints, which are decomposed via Equation (17) to minimize overall energy expenditure during path planning, thereby yielding superior traversal solutions. In contrast, the SAC algorithm’s action output is significantly perturbed by turbulent wind, hindering its ability to swiftly reach target points, resulting in a higher cumulative step count and lower reward value. While the TD3 algorithm exhibits comparable performance to ESPP-RL in single-target path planning, it notably underperforms in multi-target tasks due to its focus on optimizing individual target paths, neglecting intertarget relationships, leading to locally optimal solutions in the broader context.
The proposed algorithm can output actions by combining the environment information and PID controller to ensure more accurate actions. Therefore, the algorithm converges faster under turbulent wind interference and maintains a high path-planning success rate, which reflects the strong adaptability of the proposed algorithm. The proposed algorithm has a smoother path. When approaching a moving target, its strong robustness enables it to accurately reach the target location, reducing the number of steps required and the cumulative distance. The experimental results demonstrate that in a multi-target path-planning environment, the ESPP-RL algorithm exhibits outstanding performance. It efficiently switches between multiple targets, planning both short and safe paths, resulting in a reduced accumulated distance. In summary, the ESPP-RL algorithm, with its precise action control, fast convergence speed, high success rate in path planning, smooth path characteristics, and robustness, demonstrates remarkable superiority in multi-target scenarios.

5. Conclusions

This paper proposes a reinforcement learning-based energy-saving path-planning algorithm for a UAV in complex 3-D turbulent wind. Based on the RL framework, the algorithm takes into account various flight models of the UAV, including vertical descent, vertical ascent, and horizontal flight, while paying particular attention to battery energy limitations. Through end-to-end training methods, the algorithm designs state features and reward functions for the entire mission process, utilizing a single policy network to achieve adaptive path planning and dynamic battery management. An optimization strategy suitable for locally observable conditions is also introduced, enabling the policy network to efficiently process dynamic dimensional observation features, thus improving the adaptability of path planning for drones with limited energy consumption. Experimental simulation results demonstrate that the proposed algorithm outperforms other drone algorithms in terms of robustness, especially in complex 3-D turbulent wind, ensuring that drones can complete tasks in a time-efficient and collision-free manner, further bridging the gap between drone theoretical research and practical applications.
Future research will include exploring multi-UAVs’ collaboration in complex 3-D turbulent wind to keep the UAVs stable, achieving better energy consumption, realizing autonomous obstacle avoidance and path planning. Additionally, we will strive to elevate the versatility of our algorithms to accommodate a wider range of UAV types, ensuring their applicability across diverse platforms and missions, and verifying the proposed control strategies with real UAVs.

Author Contributions

Conceptualization, S.C.; methodology, S.C. and Y.M.; software, Y.M.; validation, S.C., Y.M. and X.W.; formal analysis, X.W.; investigation, Q.L.; data curation, J.X.; writing—original draft preparation, S.C. and X.W.; writing—review and editing, X.W. and J.X.; visualization, Q.L.; supervision, S.C. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Technology Project of China Southern Power Grid Company Limited, specifically under the grant titled “Development and Application of Distribution Network Inspection Drone Based on Electric Field Coupled Wireless Charging Technology”, grant number GXKJXM20222172.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Shaonan Chen, Yuhong Mo, Xiaorui Wu, Jing Xiao and Quan Liu were employed by Electric Power Science Research Institute of Guangxi Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Zeng, Y.; Zhang, R.; Lim, T.J. Wireless communications with unmanned aerial vehicles: Opportunities and challenges. IEEE Commun. Mag. 2016, 54, 36–42. [Google Scholar] [CrossRef]
  2. Agrawal, N.; Bansal, A.; Singh, K.; Li, C.-P.; Mumtaz, S. Finite block length analysis of RIS-assisted UAV-based multiuser IoT communication system with non-linear EH. IEEE Trans. Commun. 2022, 70, 3542–3557. [Google Scholar] [CrossRef]
  3. Zhang, Y.; Li, J.; Zhang, L.; Zhao, N.; Tang, W.; Wang, R.; Xiong, K. Energy consumption optimal design of power grid inspection trajectory for UAV mobile edge computing node. In Proceedings of the 2021 6th Asia Conference on Power and Electrical Engineering (ACPEE), Chongqing, China, 8–11 April 2021; pp. 1316–1321. [Google Scholar]
  4. Aboudonia, A.; Rashad, R.; El-Badawy, A. Composite hierarchical anti-disturbance control of a quadrotor UAV in the presence of matched and mismatched disturbances. J. Intell. Robot. Syst. 2018, 90, 201–216. [Google Scholar] [CrossRef]
  5. Tseng, C.-M.; Chau, C.-K. Personalized prediction of vehicle energy consumption based on participatory sensing. IEEE Trans. Intell. Transp. Syst. 2017, 18, 3103–3113. [Google Scholar] [CrossRef]
  6. Aharon, I.; Kuperman, A. Topological Overview of Powertrains for Battery-Powered Vehicles with Range Extenders. IEEE Trans. Power Electron. 2011, 26, 868–876. [Google Scholar] [CrossRef]
  7. Ahmed, S.; Mohamed, A.; Harras, K.; Kholief, M.; Mesbah, S. Energy efficient path planning techniques for UAV-based systems with space discretization. In Proceedings of the 2016 IEEE Wireless Communications and Networking Conference, Doha, Qatar, 3–6 April 2016; pp. 1–6. [Google Scholar]
  8. Di Franco, C.; Buttazzo, G. Energy-aware coverage path planning of UAVs. In Proceedings of the 2015 IEEE International Conference on Autonomous Robot Systems and Competitions, Vila Real, Portugal, 8–10 April 2015; pp. 111–117. [Google Scholar]
  9. Zeng, Y.; Zhang, R. Energy-efficient UAV communication with trajectory optimization. IEEE Trans. Wirel. Commun. 2017, 16, 3747–3760. [Google Scholar] [CrossRef]
  10. Chan, C.; Kam, T. A procedure for power consumption estimation of multi-rotor unmanned aerial vehicle. Proc. J. Phys. Conf. Ser. 2020, 1509, 012015. [Google Scholar] [CrossRef]
  11. Abeywickrama, H.V.; Jayawickrama, B.A.; He, Y.; Dutkiewicz, E. Comprehensive energy consumption model for unmanned aerial vehicles, based on empirical studies of battery performance. IEEE Access 2018, 6, 58383–58394. [Google Scholar] [CrossRef]
  12. Zeng, Y.; Xu, J.; Zhang, R. Energy minimization for wireless communication with rotary-wing UAV. IEEE Trans. Wirel. Commun. 2019, 18, 2329–2345. [Google Scholar] [CrossRef]
  13. Yan, H.; Chen, Y.; Yang, S.-H. New energy consumption model for rotary-wing UAV propulsion. IEEE Wirel. Commun. Lett. 2021, 10, 2009–2012. [Google Scholar] [CrossRef]
  14. Yang, Z.; Xu, W.; Shikh-Bahaei, M. Energy efficient UAV communication with energy harvesting. IEEE Trans. Veh. Technol. 2020, 69, 1913–1927. [Google Scholar] [CrossRef]
  15. Gao, N. Energy model for UAV communications: Experimental validation and model generalization. China Commun. 2021, 18, 253–264. [Google Scholar] [CrossRef]
  16. Gong, H.; Huang, B.; Jia, B.; Dai, H. Modelling Power Consumptions for Multi-rotor UAVs. arXiv 2022, arXiv:2209.04128v1. [Google Scholar]
  17. Hung, J.Y.; Gonzalez, L.F. On parallel hybrid-electric propulsion system for unmanned aerial vehicles. Prog. Aerosp. Sci. 2012, 8, 1–17. [Google Scholar] [CrossRef]
  18. Lee, B.; Kwon, S.; Park, P. Active power management system for an unmanned aerial vehicle powered by solar cells, a fuel cell, and batteries. IEEE Trans. Aerosp. Electron. Syst. 2014, 50, 3167–3177. [Google Scholar] [CrossRef]
  19. Khayyam, H.; Bab-Hadiashar, A. Adaptive intelligent energy management system of plug in hybrid electric vehicle. Energy 2014, 69, 319–335. [Google Scholar] [CrossRef]
  20. Bongermino, E.; Mastrorocco, F.; Tomaselli, M.; Monopoli, V.G.; Naso, D. Model and energy management system for a parallel hybrid electric unmanned aerial vehicle. In Proceedings of the 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), Edinburgh, UK, 19–21 June 2017. [Google Scholar]
  21. Bongermino, E.; Tomaselli, M.; Monopoli, V.G. Hybrid aeronautical propulsion: Control and energy management. IFAC Pap. Online 2017, 50, 169–174. [Google Scholar] [CrossRef]
  22. Lei, T.; Min, Z.; Fu, H. Dynamic balance energy management strategy for hybrid Power Supply of fuel cell UAV. Acta Aeronaut. Sin. 2020, 41, 324048. (In Chinese) [Google Scholar]
  23. Yan, M.; Zhang, L.; Jiang, W.; Chan, C.A.; Gygax, A.F.; Nirmalathas, A. Energy Consumption Modeling and Optimization of UAV-Assisted MEC Networks Using Deep Reinforcement Learning. IEEE Sens. J. 2024, 24, 13629–13639. [Google Scholar] [CrossRef]
  24. Khaghani, M.; Skaloud, J. VDM-based UAV attitude determination in absence of IMU data. In Proceedings of the European Navigation Conference, ENC 2018, Gothenburg, Sweden, 14–17 May 2018; pp. 84–90. [Google Scholar]
  25. Mahony, R.; Hamel, T.; Pflimlin, J.-M. Nonlinear complementary filters on the special orthogonal group. IEEE Trans. Autom. Control 2008, 53, 1203–1217. [Google Scholar] [CrossRef]
  26. Lyu, P.; Lai, J.; Liu, J.; Liu, H.H.T.; Zhang, Q. A thrust model aided fault diagnosis method for the altitude estimation of a quadrotor. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 1008–1019. [Google Scholar] [CrossRef]
  27. Miranda-Moya, A.; Castañeda, H.; Wang, H. Fixed-Time Extended Observer-Based Adaptive Sliding Mode Control for a Quadrotor UAV under Severe Turbulent Wind. Drones 2023, 7, 700. [Google Scholar] [CrossRef]
  28. Altman, E. Constrained Markov Decision Processes: Stochastic Modeling, 1st ed.; Routledge: New York, NY, USA, 1999. [Google Scholar]
Figure 1. Coordinate systems for a quadrotor UAV.
Figure 1. Coordinate systems for a quadrotor UAV.
Electronics 13 03190 g001
Figure 2. An agent interacting with its environment.
Figure 2. An agent interacting with its environment.
Electronics 13 03190 g002
Figure 3. The framework of the ESPP-RL algorithm.
Figure 3. The framework of the ESPP-RL algorithm.
Electronics 13 03190 g003
Figure 4. The training process using algorithms.
Figure 4. The training process using algorithms.
Electronics 13 03190 g004
Figure 5. The results of SAC, TD3, and ESPP-RL in a single-target path-planning environment. (a) SAC; (b) TD3; (c) ESPP-RL.
Figure 5. The results of SAC, TD3, and ESPP-RL in a single-target path-planning environment. (a) SAC; (b) TD3; (c) ESPP-RL.
Electronics 13 03190 g005
Figure 6. The results of SAC, TD3, and ESPP-RL in a multi-target path-planning environment. On the left is the SAC algorithm and in the middle is the TD3 algorithm. On the right is the ESPP-RL algorithm. The number of target points from top to bottom is 3, 4, and 5.
Figure 6. The results of SAC, TD3, and ESPP-RL in a multi-target path-planning environment. On the left is the SAC algorithm and in the middle is the TD3 algorithm. On the right is the ESPP-RL algorithm. The number of target points from top to bottom is 3, 4, and 5.
Electronics 13 03190 g006
Table 1. List of main notations.
Table 1. List of main notations.
ParameterMeaningValue
ρ Air density in kg/m31.168
h Rotor solidity0.045
A s Rotor disk area in m20.214
C T Thrust coefficient0.001195
k Incremental correction faction to induced power0.11
γ v Fuselage drag coefficient during vertical flight in kg/m0.220168
γ h Fuselage drag coefficient in horizontal flight in kg/m0.005256
λ Profile drag coefficient0.011
G UAV weight in Newtons20
Table 2. Algorithm parameters.
Table 2. Algorithm parameters.
AlgorithmSACTD3ESPP-RL
Learning rate3 × 10−41 × 10−31 × 10−3
Total training steps (T)1 × 10−61 × 10−61 × 10−6
Batch_size1 × 10−52 × 10−61 × 10−5
Update weight ( η )0.0050.0050.005
Policy noise111
Discount factor ( γ )0.990.990.99
Delayed update ( d )None22
PID   parameters   k p , k i , k d NoneNone4 × 10−2, 5 × 10−7, 1 × 10−4
Table 3. The reward and steps of algorithms under a single-target path-planning environment task.
Table 3. The reward and steps of algorithms under a single-target path-planning environment task.
SACTD3ESPP-RL
Reward123.96156.86160.32
Steps1356668
Table 4. The reward and steps of algorithms under a multi-target path-planning environment task.
Table 4. The reward and steps of algorithms under a multi-target path-planning environment task.
SACTD3ESPP-RL
Three-targetReward 326.85 397.66 421.14
Steps 313 248 222
Four-targetReward 407.31 487.98 528.40
Steps 458 339 326
Five-targetReward 461.73 587.59 615.77
Steps 494 370 353
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, S.; Mo, Y.; Wu, X.; Xiao, J.; Liu, Q. Reinforcement Learning-Based Energy-Saving Path Planning for UAVs in Turbulent Wind. Electronics 2024, 13, 3190. https://doi.org/10.3390/electronics13163190

AMA Style

Chen S, Mo Y, Wu X, Xiao J, Liu Q. Reinforcement Learning-Based Energy-Saving Path Planning for UAVs in Turbulent Wind. Electronics. 2024; 13(16):3190. https://doi.org/10.3390/electronics13163190

Chicago/Turabian Style

Chen, Shaonan, Yuhong Mo, Xiaorui Wu, Jing Xiao, and Quan Liu. 2024. "Reinforcement Learning-Based Energy-Saving Path Planning for UAVs in Turbulent Wind" Electronics 13, no. 16: 3190. https://doi.org/10.3390/electronics13163190

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop