Next Article in Journal
Machine Learning Models to Predict Google Stock Prices
Previous Article in Journal
Analytical MPC Algorithm Using Steady-State Process Model
Previous Article in Special Issue
Design of a New Energy Microgrid Optimization Scheduling Algorithm Based on Improved Grey Relational Theory
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Energy Scheduling of Hydrogen Hybrid UAV Based on Model Predictive Control and Deep Deterministic Policy Gradient Algorithm

1
State Grid Changzhou Power Supply Company, Changzhou 213200, China
2
National Engineering Research Center of Power Generation Control and Safety, Liyang Research Institute, Southeast University, Liyang 213300, China
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(2), 80; https://doi.org/10.3390/a18020080
Submission received: 3 December 2024 / Revised: 25 January 2025 / Accepted: 28 January 2025 / Published: 2 February 2025

Abstract

:
Energy scheduling for hybrid unmanned aerial vehicles (UAVs) is of critical importance to their safe and stable operation. However, traditional approaches, predominantly rule-based, often lack the dynamic adaptability and stability necessary to address the complexities of changing operational environments. To overcome these limitations, this paper proposes a novel energy scheduling framework that integrates the Model Predictive Control (MPC) with a Deep Reinforcement Learning algorithm, specifically the Deep Deterministic Policy Gradient (DDPG). The proposed method is designed to optimize energy management in hydrogen-powered UAVs across diverse flight missions. The energy system comprises a proton exchange membrane fuel cell (PEMFC), a lithium-ion battery, and a hydrogen storage tank, enabling robust optimization through the synergistic application of MPC and DDPG. The simulation results demonstrate that the MPC effectively minimizes electric power consumption under various flight conditions, while the DDPG achieves convergence and facilitates efficient scheduling. By leveraging advanced mechanisms, including continuous action space representation, efficient policy learning, experience replay, and target networks, the proposed approach significantly enhances optimization performance and system stability in complex, continuous decision-making scenarios.

1. Introduction

With the rapid advancements in UAV technology, the deployment of UAVs has expanded significantly across diverse domains, including logistics, environmental monitoring, agriculture, and defense [1,2]. Traditionally, UAVs have predominantly relied on lithium-ion batteries as their primary power source. However, the limited energy density of lithium-ion batteries imposes constraints on both endurance and payload capacity, hindering their applicability in more demanding scenarios [3]. To overcome these limitations, hydrogen fuel cells have emerged as a promising alternative, offering advantages such as high energy density, extended endurance, and environmental sustainability [4,5]. Despite these benefits, the energy management of hydrogen fuel cells remains a critical challenge. In operational scenarios involving multiple flight missions, UAV energy consumption patterns exhibit significant uncertainty and complexity. This complexity necessitates the development of intelligent energy scheduling strategies to optimize energy utilization, thereby improving UAV endurance and efficiency of operations.
Energy scheduling for hybrid power systems can be broadly categorized into three primary approaches: rule-based methods, optimization-based methods, and learning-based methods [6]. Rule-based methods manage energy allocation based on predefined rules and logical frameworks, including time-based, state-based, and priority-based strategies. These methods operate by comparing real-time variable values against set thresholds, allowing integrated energy systems to adapt and function effectively under varying conditions. For example, Jin et al. [7] proposed a rule-based energy management strategy that utilizes experimental data analysis to develop a control method for fuel cell-based compensation of battery charge in hybrid fuel cell trucks, achieving a 24.8% reduction in hydrogen consumption and a 20% reduction in battery state of charge (SOC) variation. In addition, Wang Y et al. [8] implemented a rule-based power allocation strategy, considering power demand, remaining capacity, and power capacity, which improved fuel efficiency, power, and extended the lifespan of hybrid energy systems.
Optimization-based methods employ mathematical optimization techniques to solve energy scheduling challenges in integrated energy systems [9]. By formulating objective functions, defining constraints, and selecting suitable optimization algorithms, these methods aim to achieve optimal or near-optimal solutions. Such approaches are extensively applied in UAV operations to optimize flight paths, task assignments, flight speed, and altitude adjustments, thereby enhancing overall system efficiency and performance. Recently, Chen et al. [10] developed an off-grid biofuel micro-CHP system with hybrid energy storage, using an optimization-based energy management strategy that increased energy efficiency from 45.77% to 57.97% with diesel biofuel. Additionally, Salehpour et al. [11] proposed a model to optimize both the energy system configuration—comprising fuel cells, batteries, electric motors, and thermal engines—and the flight trajectory of UAVs, using a nonlinear programming approach solved by decomposition techniques in GAMS.
Learning-based methods utilize machine learning and artificial intelligence techniques to facilitate energy scheduling in integrated energy systems [12]. These methods use various machine learning algorithms, including supervised learning, unsupervised learning, and reinforcement learning, to train models or develop strategies. By optimizing based on predefined objectives, learning-based approaches enable adaptive and intelligent energy management in complex and dynamic system environments. The Deep Deterministic Policy Gradient (DDPG) algorithm is a reinforcement learning algorithm based on the Actor–Critic framework, suitable for continuous action–space control and scheduling, as it directly outputs deterministic actions instead of a probability distribution [13,14]. For instance, Li et al. [15] discussed the application of reinforcement learning (RL) in energy management for fuel cell hybrid energy systems (FCHEVs), emphasizing gaps in existing literature regarding training environments, reward function settings, and RL agent evolution. The authors highlight that reward functions are typically derived from optimization objectives, with additional penalties/rewards introduced to improve training outcomes. Although RL-based energy management strategies (EMSs) outperform conventional approaches when trained under specific conditions, they struggle with adaptability in unknown scenarios—a challenge that can be mitigated by training with multi-operating conditions. Advanced RL methods like DQNs, DDPGs, and Double DQNs have been shown to address issues like discretized state spaces and Q-function overestimation, offering better performance than basic RL algorithms. Future research directions include leveraging machine learning to model FCHEV dynamics more accurately and exploring novel RL algorithms to address current limitations. Additionally, Fu et al. [16] compared the ECMS energy management strategy with the Adaptive-ECMS (A-ECMS) learning-based equivalent fuel consumption strategy, showing fuel consumption reductions of 0.96% and 1.37% under NEDC and CHTC-LT conditions, respectively.
While rule-based and optimization-based methods exhibit robust scheduling performance in specific scenarios, they often fall short in adapting to uncertainties. Existing rule-based and optimization-based methods perform well in specific scenarios but struggle with complex and uncertain environmental changes. Rule-based methods rely on predefined rules and logical frameworks designed for known or simple situations. These methods allocate energy by comparing real-time variables against set thresholds, working effectively under stable conditions. However, when faced with uncertainties like drastic environmental fluctuations, sudden task changes, or dynamic vehicle performance, rule-based systems may lack the flexibility to adapt, leading to decreased efficiency or energy waste. Optimization-based methods, though capable of finding optimal solutions through mathematical models and constraints, also have limitations. They rely on known system models and conditions, so if the environment or system dynamics change (e.g., battery degradation or varying wind speeds), these methods may fail to deliver real-time optimal solutions. Additionally, many optimization algorithms are computationally intensive, particularly in complex, multi-objective problems, making them unsuitable for real-time applications. Furthermore, optimization methods typically assume that system changes can be predicted, but in practice, environmental and system behaviors often vary unpredictably, reducing the effectiveness of these methods in uncertain and dynamic environments.
In contrast, learning-based methods excel in adaptability and generalization, making them well-suited to address complex models, high nonlinearity, and multiple environmental uncertainties.
Therefore, this paper combines the Model Predictive Control (MPC), an optimization-based method, with the DDPG algorithm, a learning-based method, to achieve optimal power scheduling and efficient energy management for hydrogen-powered hybrid UAVs. The main contributions of this study are listed as follows:
(1)
By adopting the Deep Deterministic Policy Gradient (DDPG) reinforcement learning algorithm, the issue of insufficient adaptability in existing methods under uncertain environments can be effectively addressed. Through continuous learning and adjustment from the environment, the DDPG can automatically optimize energy scheduling strategies when facing unforeseen changes, thereby improving the system’s dynamic adaptability and robustness.
(2)
Traditional rule-based and optimization-based methods often struggle with highly nonlinear and complex system models, especially in dynamic environments. By introducing learning-based energy scheduling methods, particularly reinforcement learning, which offers powerful modeling capabilities, the complexities that current methods fail to handle can be tackled. More accurate system modeling enables optimized energy management under various environmental and task conditions.
(3)
Many optimization methods require lengthy computation times, making them unsuitable for real-time applications. In contrast, DDPG can perform online learning and decision-making in a shorter time, greatly improving the real-time efficiency of energy scheduling. This is particularly crucial for energy management in hydrogen-powered hybrid UAVs operating in uncertain environments.
The remaining structure of the paper is organized as follows: Section 2 introduces the force analysis and power calculation of UAVs across different mission phases. In Section 3, the modeling process of the proposed hybrid energy system mathematical model will be elaborated in detail. Section 4 will describe in detail the principles of the MPC and DDPG algorithms and their integration. Then, in Section 5, the power optimization results and optimized scheduling results will be presented for various flight missions. Finally, Section 6 concludes the paper and outlines future work.

2. Force Analysis and Power Calculation

The flight modes of the UAV, illustrated in Figure 1, consist of takeoff, climb, cruise, descent, and landing. Specifically, the cruise phase encompasses level flight and acceleration, while the landing phase includes descent, leveling off, deceleration, touchdown, and taxiing.

2.1. Takeoff Phase

The takeoff phase begins with the UAV accelerating from a stationary state on the runway to its takeoff speed, followed by liftoff and continued acceleration in the air until it achieves a safe altitude. Throughout this phase, the UAV’s motion can be mathematically approximated as uniform acceleration:
F a , t f = T a , t f - D t f - μ ( W - L t f ) = M × a t f
where F a , t f is the static acceleration force acting on the UAV during its ground roll, T a , t f is the available thrust during the UAV’s operation, D t f is the air resistance acting on the UAV, μ is the friction resistance factor, W is the weight of the UAV, L t f is the lift force acting on the UAV, M is the total mass of the UAV, and a t f is the acceleration.
When the UAV reaches takeoff speed and lifts off the ground, it can be calculated using the lift equation and the vertical equilibrium equation:
L t f = 1 2 ρ V 2 S C L W = L t f V r = 2 W ρ S C L
where ρ is the air density, V is the UAV’s velocity, S is the wing area, C L is the lift coefficient, and V r is the UAV’s takeoff speed.
During the takeoff phase, the UAV’s motor operates at maximum power, P = P t f = P max , where P max represents the maximum power and P t f is the takeoff power.

2.2. Climb Phase

The climb phase involves the UAV ascending from ground level to its designated cruising altitude at a specified climb angle and climb rate. During this phase, the forces acting on the UAV can be simplified into components along the vertical and horizontal directions, corresponding to the UAV’s climb trajectory.
During steady, straight-line climbing, the forces acting on the UAV in both the vertical and horizontal directions are balanced. In this phase, the UAV’s motor operates at a constant power output to maintain the steady climb.
(1)
Force analysis:
L c = W cos γ ,   V e r t i c a l   d i r e c t i o n T a , c = D c + W sin γ ,   P a r a l l e l   d i r e c t i o n
where γ is the UAV’s climb angle.
(2)
Power situation:
P = P c , r
where P c , r is the rated power during the UAV’s climb process.
If the UAV performs an unsteady straight-line climb, the force and power conditions can be described as follows:
(1)
Force analysis:
L c = W cos γ ,   V e r t i c a l   d i r e c t i o n F a , c = T a , c D c W sin γ ,   P a r a l l e l   d i r e c t i o n
(2)
Power situation:
P = P c = T a , c V ( t )
If the UAV performs an unsteady, non-straight-line climb, the force and power conditions are as follows:
(1)
Force analysis:
F a , c , y = L c W cos γ ( t ) ,   V e r t i c a l   d i r e c t i o n F a , c , x = T a , c D c W sin γ ( t ) ,   P a r a l l e l   d i r e c t i o n
(2)
Power situation:
P = P c = T a , c V ( t )

2.3. Cruise Phase

The cruise phase can be primarily divided into two sub-phases: the level flight acceleration phase and the steady-speed cruise phase. In the level flight acceleration phase, after the UAV ascends to its cruising altitude, it transitions to level flight and accelerates until reaching the designated cruising speed. The force and power characteristics during this phase are as follows:
(1)
Force analysis:
L c r = W ,   V e r t i c a l   d i r e c t i o n F a , c r = T a , c r D c r ,   H o r i z o n t a l   d i r e c t i o n
(2)
Power situation:
P = P c r = T a , c r V ( t )
If there is uniform acceleration, the power can also be expressed as P = P c r , r ( a c r ) , where a c r is the constant acceleration during the cruise acceleration phase, so the power P c r , r is also constant at this time.
During the steady-speed cruise phase, the UAV maintains its cruising speed, which can be categorized into two operational modes: maximum range speed and maximum endurance speed. The maximum range speed is the flight speed at which the UAV minimizes fuel or energy consumption per unit distance, enabling it to cover the greatest distance with the available fuel or energy. Conversely, the maximum endurance speed is the flight speed at which the UAV minimizes fuel or energy consumption per unit time, allowing it to remain airborne for the longest duration at a given altitude. Notably, the speed corresponding to the maximum lift-to-drag ratio is typically the same as the maximum range speed. Since both operational modes involve steady-speed cruise, the aerodynamic forces are balanced, ensuring stable flight dynamics.
(1)
Force analysis:
L c r = W ,   V e r t i c a l   d i r e c t i o n T a , c r = D c r ,   H o r i z o n t a l   d i r e c t i o n
(2)
Power situation:
P = P c r , t = T a , c r V t , max P c r , R = T a , c r V R , max
where P c r , t represents the power corresponding to the maximum endurance speed, and P c r , R represents the power corresponding to the maximum range speed. V t , max and V R , max denote the maximum endurance speed and range speed, respectively, defined as P c r , t < P c r , R .
Under normal conditions, the speed remains constant during the cruise phase. However, in certain scenarios, the UAV may enter a non-equilibrium state while maintaining its cruising altitude. In such cases, the UAV experiences either acceleration or deceleration. The force and power characteristics of the acceleration process have already been discussed; the following provides an analysis of the forces and power requirements during the deceleration process.
(1)
Force analysis:
L c r = W ,   V e r t i c a l   d i r e c t i o n F a , c r = D c r T a , c r ,   H o r i z o n t a l   d i r e c t i o n
(2)
Power situation:
P = P c r = T a , c r V ( t )

2.4. Descent Phase

During the descent process, the UAV’s motor operates in idle mode, producing minimal thrust output, causing the UAV to gradually descend due to gravity. Under standard conditions, the descent angle and descent speed remain constant. The force and power conditions for the UAV in this phase are as follows:
(1)
Force analysis:
L d = W cos γ ,   V e r t i c a l   d i r e c t i o n T a , d + W sin γ = D d ,   P a r a l l e l   d i r e c t i o n
where γ is the UAV’s descent angle.
(2)
Power situation:
P = P d = T a , d V d
where V d represents the descent speed.

2.5. Landing Phase

The landing phase primarily consists of four segments: the descent segment from cruising altitude, the flare and deceleration segment, the touchdown segment, and the ground roll segment. The descent segment has already been analyzed in detail.
When the UAV descends from cruising altitude to the final descent altitude (buffer altitude), it transitions into the flare and deceleration segment. The specific force and power conditions during this segment are as follows:
(1)
Force analysis:
L f d = W ,   V e r t i c a l   d i r e c t i o n F a , f d = D f d T a , f d ,   H o r i z o n t a l   d i r e c t i o n
(2)
Power situation:
P = P f d = T a , f d V ( t )
During the landing segment, the UAV decelerates and descends from the buffer altitude to the ground. Due to the relatively low buffer altitude, this motion can be approximated as near-horizontal linear movement. The corresponding force and power conditions are as follows:
(1)
Force analysis:
L l W ,   V e r t i c a l   d i r e c t i o n F a , l = D l T a , l ,   H o r i z o n t a l   d i r e c t i o n
(2)
Power situation:
P = P l = T a , l V ( t )
During the ground roll phase, after the UAV has landed, the motor is turned off, and the UAV gradually comes to a stop due to ground friction and air resistance. At this stage, no power is consumed, and the force conditions are as follows:
Force analysis:
F a , s = μ W + D s

3. Hybrid Energy System Mathematical Model

The structure of proposed hybrid energy system is shown in Figure 2. The main equipment includes the proton exchange membrane fuel cell, lithium-ion battery, high-pressure hydrogen storage tank, direct-current (DC) motor, fuel cell balance of plant system (BOP), and communication system.

3.1. Proton Exchange Membrane Fuel Cell (PEMFC)

The hydrogen consumption of the proton exchange membrane fuel cell (PEMFC) [17,18] is calculated using Faraday’s law:
H F C = I F C n F C 2 F
where H F C is the hydrogen consumption mass flow, I F C is the operating current, and n F C is the number of cells. F represents Faraday’s constant. The reaction rate of the proton exchange membrane fuel cell is represented by the current density j:
j = I F C A c e l l
where A c e l l is the reaction area of a single cell.
The thermal power output of the PEMFC P H FC is represented as follows:
P H FC = P E FC η T η E
where η T is the thermal efficiency of the fuel cell, and η E is the electrical efficiency of the fuel cell.

3.2. Lithium Battery

In this paper, a lithium-ion battery is selected as the energy storage device [19], which typically operates in a battery pack configuration. Its total power can be expressed as:
P E B = P E b × n 1 × n 2
where P E B is the total power output of the battery pack, P E b is the power of a single battery cell and n 1 and n 2 represent the number of rows and columns of the battery cells, respectively.
The current of a single battery cell is as follows:
I b = V oc V oc 2 4 R b P E b 2 R b
where the open-circuit voltage V oc and the internal resistance of the battery cell R b can both be considered as constant values.
The dynamic characteristic of the battery cell, the SOC, can be expressed as:
SOC t + 1 = SOC t I b , t Q B Δ T
where t (0 ≤ tT) is the scheduling time, T is the scheduling period, Q B is the battery cell capacity, and I b , t is the current, with charging being negative and discharging being positive.

3.3. High-Pressure Hydrogen Storage Tank

The hydrogen storage tank uses the high-pressure hydrogen storage tank (HST) model [20], which provides the hydrogen source for the fuel cell. The pressure inside the storage tank changes dynamically as hydrogen is released, and it is represented as:
T P t + 1 H = T P t H z t R T H V H H t H S T
where T P t H and H t H S T are the pressure inside the hydrogen storage tank and the hydrogen flow rate at time t (mol/s), respectively. V H is the volume of the storage tank, T H is the operating temperature, R is the universal gas constant, and z t is the compressibility factor of hydrogen, which can be calculated based on NIST data and the Lemmon equation [21]:
z t ( T P t H ) = 1 + i = 1 9 a i T 0 T H b i ( T P t H ) c i
where ai, bi and ci are three separate constants as shown in Table 1, where T0 = 100 K and R = 8.3145 J/(mol·K).

3.4. DC Motor

A brushless DC motor is utilized as the primary power source, characterized by low electrical losses, extended lifespan, and minimal maintenance requirements. The equivalent voltage and current equations for the brushless DC motor are expressed as follows:
I m o t e r = M m o t e r K V 0 U m 0 9.55 ( U m 0 I m 0 R m o t e r ) + I m 0 U m o t e r = R m o t e r I m o t e r + U m 0 I m 0 R m o t e r K V 0 U m 0 N m o t e r
where I m o t e r is the equivalent current of the motor, U m 0 is the no-load voltage of the motor, I m 0 is the no-load current of the motor, K V 0 is the no-load speed of the motor, R m o t e r is the internal resistance of the motor, M m o t e r is the load torque of the motor, and N m o t e r is the speed of the motor.

3.5. Fuel Cell BOP

The fuel cell BOP primarily consists of consumable devices, including compressors, hydrogen circulation devices, and thermal management systems. The power consumption is approximately linearly related to the output power of the hydrogen fuel cell and is expressed as follows:
P B O P = A P E F C + B
where PBOP represents the consumed power of the fuel cell BOP and A and B indicate the parameters of the linear relationship.

3.6. Communication Equipment

The communication equipment onboard the drone primarily consists of the data transmission radio and the GPS module. When data transmission is active, it operates at a constant power PCD.

4. Model Predictive Control for Power Optimization and DDPG-Based Optimization Scheduling

In order to achieve UAV power optimization and optimal scheduling, the flight state parameters of the UAV are collected and input into the MPC model to obtain the predicted electric power, and the predicted electric power is used as the input of DDPG model to train the DDPG network and obtain the optimal scheduling results. The detailed process is illustrated in Figure 3. First, the model in prediction part is established, including the state space, objective function, control variables, and related matrices. Then, rolling optimization and feedback correction are performed to obtain the optimized output and the UAV’s optimized power. Next, the neural network is initialized, which involves setting the input of the energy scheduling-related equations (energy scheduling model), the number of training episodes, learning rate, network update rate, and other DDPG parameters. The optimized power (training set) is compiled into training scenarios, followed by the MDP training loop, where the network is continuously updated. Additionally, constant tuning is required to achieve a well-trained network. Finally, the optimized power that needs scheduling (test set) is input into the trained network to obtain the power scheduling results.

4.1. Model Predictive Control for Power Optimization

The MPC process comprises three key stages: model development, rolling optimization, and feedback correction. The model includes four state vectors, two input vectors, and electrical power as the output vector, designed to optimize power consumption [21].
MPC is an advanced control strategy extensively applied in fields such as process control, automation, and robotics. Its core principle involves utilizing the system’s mathematical model to predict future system behavior. Optimization algorithms are then employed to determine the optimal control inputs that ensure the system achieves peak performance while adhering to specified constraints.
In the context of UAV operations, MPC leverages flight state parameters to predict and optimize power consumption. The detailed process is illustrated in Figure 4.

4.1.1. System Development

Based on the actual flight state of the UAV, a multi-task hydrogen fuel cell UAV flight state model is developed. The flight state parameters mainly include flight speed, flight altitude, attitude, and so on. Then, a function relating the flight state x ( t ) to power consumption P ( t ) is established based on the following empirical formula:
P ( t ) = f ( x ( t ) , u ( t ) )
where x ( t ) is the flight state vector and u ( t ) is the control input vector.
The flight state vector mainly includes flight altitude h ( t ) , flight speed v ( t ) , pitch θ ( t ) , and engine temperature T E ( t ) . The control input vector mainly includes thrust command u t h r o t t l e ( t ) and pitch control u p i t c h ( t ) .
The experimental data, including flight altitude, flight speed, pitch angle, and engine temperature, are used as the state vector. Due to the large amount of data, only the range is shown in Table 2.

4.1.2. System Model Prediction

Based on the current flight state vector and control input vector, the changes in the flight state over a short future time horizon are predicted. Using the predicted flight states and their corresponding control inputs, the electrical power consumption at future time instances is forecasted.

4.1.3. Optimize Control Inputs

The objective is to minimize electrical power consumption while ensuring compliance with the requirements of the flight mission.
f ( t ) = min u ( t ) t = k k + N [ P ( t ) + λ J ( x ( t ) , u ( t ) ) ]
where P ( t ) is the electrical power consumption, J ( x ( t ) , u ( t ) ) is the cost function of the control objective, and λ is the weighting parameter.

4.1.4. Optimal Control Inputs

By solving the optimization problem, the optimal control input for the current time instant is determined and applied to the actual control system. In MPC-based rolling optimization, only the optimal control input for the present moment is implemented, while future control inputs are recalculated in subsequent iterations, ensuring adaptability to dynamic changes in the system state.

4.1.5. Feedback and Update

The actual flight state of the UAV is fed back to the MPC in real time. Using this updated state information, the MPC refines the system model and predictions, subsequently adjusting the control input for the next time instant.
The UAV’s flight is divided into five distinct phases: takeoff, climb, cruise, descent, and landing. The MPC algorithm is applied individually to each phase, with tailored configurations including state matrices, input matrices, state weight matrices, terminal state weight matrices, input weight matrices, prediction horizons, and prediction intervals. Upon completing the MPC process, the output is the optimized electrical power consumption, ensuring efficient energy management across all flight phases.

4.2. DDPG-Based Optimization Scheduling

DDPG is a classic deep reinforcement learning algorithm that can be used to solve continuous control problems in integrated energy systems. The specific process is shown in Figure 5.

4.2.1. Exploration Noise and Markov Decision Process

In order to better explore the unknown environment in the decision-making process, Ornstein–Uhlenbeck (OU) noise, which is suitable for inertial systems, is added [22], with noise decay set to achieve a balance between exploration and exploitation. The specific mathematical expression is as follows:
d N t _ O U = θ μ N t _ O U d t + σ d W t a t = P ( s t | θ p ) + N t _ O U
where t denotes the time step, N t _ O U is the OU noise value, θ is the mean-reversion rate, μ is the mean, σ is the volatility, and W t is the Brownian motion. a t is the action, P ( s t | θ p ) is the actor network in DDPG, θ p are the network parameters, and s t is the system state. The temporal correlation of the OU process allows the system to explore multiple steps in the same direction to accumulate training experience, thereby effectively improving the exploration efficiency of the inertial system.
Reinforcement learning tasks are typically formulated as a Markov decision process (MDP), defined as a five-tuple (S, A, P, R, γ), where S is the state space, A represents the action space, P is the transition matrix, R indicates the reward, and γ is the discount factor. In this paper, the model-free Deep Deterministic Policy Gradient (DDPG) algorithm is employed, eliminating the need for a transition matrix P. Based on the characteristics of the hydrogen-powered hybrid UAV energy system, the defined objective function, operating constraints, state space, action space, and reward function are as follows:
(1)
Objective function
The objective function is primarily defined by the systemic operating cost and the state of the energy storage device, ensuring an optimal balance between energy efficiency and operational reliability. It is expressed as follows:
J = min t = 1 T C t O M + i = 1 k i C T S   = min k 1 t = 1 T C 1 ( P E , t F C ) + k 2 S O C T S O C 0
where C 1 and P E , t F C are equivalent to hydrogen consumption of the systemic devices and the power at time t, respectively, k i and k 2 are the weighting coefficients, respectively.
(2)
Operating constraints
The constraint of electrical balance can be described as follows:
P E , t F C + P E , t B = P E , t L
The constraint of PEMFC power is shown as follows:
P E , min F C P E , t F C P E , max F C
The upper and lower boundary of SOC are constrained as follows:
S O C min S O C t S O C max
The initial SOC value of battery can be expressed as follows:
S O C 0 = 0.5 ( S O C min + S O C max )
(3)
State space
The state space matrix S can be expressed in Equation (40):
S = sin t , cos t , P E , t L , S O C t T
where t indicates the time step, P E , t L is the electrical load (i.e., motor power) at time t, and S O C t is the battery state of charge at time t.
(4)
Action space
The state space matrix A can be expressed in Equation (41):
A = P E , t F C T
where P E , t F C is the electrical power of the hydrogen fuel cell.
(5)
Reward function
The reward function R is defined as follows:
R = C t O M + α C T S + l 1 P o u t , t E
where P o u t , t E is defined as the scheduling action based on the policy network at time t, representing the surplus (greater than 0) or deficiency (less than 0) when the power balance cannot be satisfied. l1 is the weight adjusted based on training experience and results.

4.2.2. DDPG Algorithm Training Process

(1)
Neural network initialization
Based on the state space and action space, a policy network (P network) and a value network (Q network) are established, as shown in Figure 6. Except for the input layer, the P network and Q network have 11 and 10 layers, respectively. The P network has four inputs (state) and one output (action), with the activation function of the output layer being Sigmoid to map the output to the range of 0–1, facilitating the scaling and restoring of actions. The Q network has five inputs (state + action) and one output (reward). FC represents fully connected layer, and LN represents layer normalization.
(2)
Algorithm parameter settings
Training parameter settings: Set the learning rates of the network α P and α P , target network update coefficient τ, scheduling period T, number of training episodes e max , discount factor γ, experience replay Memory, and training batch size Batch size.
OU noise parameter settings: Value regression rate θ, mean μ, volatility σ, maximum fluctuation σ max , minimum fluctuation σ min , and noise decay period T N d .
Reward function parameter settings: Coefficients α and l1.
(3)
Training scenario setup
Before the start of each training episode, a normal random fluctuation of no more than 10% is added to the mean of the load source data within the typical day scenario and the feasible operating domain of the equipment. This approach continuously generates new typical day training scenarios until the training process is complete.
The total number of training episodes is set to 200, necessitating the random generation of 200 unique training scenarios. An example illustrating the electrical power mean and fluctuation range is presented in Figure 7, where the solid line represents the load mean, and the shaded area depicts the range of data fluctuation.
(4)
MDP training loop
Before each training begins, the initial state S 0 of the system is obtained. For each time step, given the state s t , the policy network outputs an action P s t θ P . After the system executes the action, a reward r t is obtained, and the next state s t + 1 is determined. The training experience s t , a t , r t , s t + 1 is stored in the experience replay pool, the current state is updated, and the MDP is repeated until the final time step, completing one training episode.
(5)
Network parameter update
When the experience replay pool accumulates sufficient experiences to meet the batch size requirement, the network parameters are updated at each time step. This process involves updating the policy network and value network, while the target network undergoes updates using a soft update mechanism.

4.2.3. Device Parameters and Algorithm Parameter Settings

The main equipment parameters involved in energy scheduling are listed as follows:
Hydrogen fuel cell: Rated power 3 kW and power range 0~3 kW.
Lithium-ion battery: Battery cell capacity 20 Ah, battery pack size 10 × 10, state of charge range 0.2~0.9 and open-circuit voltage 4.2 V.
The parameters of the DDPG algorithm are provided in Table 3, including key values used to tune and optimize the performance of the control and learning algorithms in this study, such as learning rates, discount factors, and other hyperparameters that directly impact the efficiency and convergence of the model.

5. Power Optimization and Optimized Scheduling Results

The optimized electrical power is shown in Figure 8, with the resulting electrical power generally lower than the reference value.
Due to the system’s time step being 360 min, the time span is relatively long, and since the drone is in cruising mode from 19 to 345 min, the energy scheduling shows periodic variations. Therefore, three time periods are selected to explore the energy scheduling situation: 0–18 min, 19–36 min, and 346–360 min. As shown in Figure 9, the cumulative rewards for energy scheduling during the three time periods are represented as DDPG-1, DDPG-2, and DDPG-3, respectively. DDPG-M-1, DDPG-M-2, and DDPG-M-3 are the average cumulative rewards over every five episodes. The cumulative rewards for all three time periods converge. As shown in Figure 10, Figure 11 and Figure 12, the hydrogen fuel cell and SOC scheduling results for the three time periods are all within a reasonable range, achieving efficient utilization of electrical energy.

6. Conclusions

In this paper, a hydrogen hybrid UAV energy system model, comprising proton exchange membrane fuel cells, lithium batteries, and hydrogen storage tanks, is developed to address the challenges of strong coupling and uncertainty in system operation. The primary objective is to minimize electrical power consumption and control costs for the hydrogen hybrid UAV. Electrical power is optimized by utilizing MPC method, while the DDPG algorithm is employed to achieve efficient energy scheduling optimization. This approach aims to minimize scheduling costs, ensure system sustainability, and account for operational constraints of the system components. The main conclusions of the study are as follows:
(1)
The hydrogen hybrid UAV energy scheduling model developed in this study fully considers the system’s operational state and continuous device scheduling, allowing for a comprehensive representation of the dynamic operational process of the system’s devices;
(2)
The MPC method is used to achieve lower electrical power consumption for the multi-task hydrogen hybrid UAV, while the DDPG algorithm minimizes the energy scheduling cost;
(3)
Uncertainty is introduced into the minimum electrical power to simulate the uncertainty in the UAV’s flight process. The DDPG algorithm demonstrates high real-time scheduling capability and system adaptability;
(4)
The combination of MPC and DDPG algorithms effectively solves the electrical energy scheduling problem for multi-task hydrogen fuel cell UAVs. While the MPC algorithm alone cannot address the system’s uncertainty, and the DDPG algorithm alone cannot optimize the electrical power side, the combination of both algorithms achieves optimization of the electrical power side and efficient scheduling of electrical power.
Future research will extend the proposed energy scheduling framework by integrating more advanced machine learning algorithms, such as reinforcement learning with attention mechanisms, multi-agent systems, or deep neural networks designed specifically for real-time decision-making in UAVs. These advanced algorithms will aim to improve the optimization of energy management in hydrogen hybrid UAVs, enhancing both the efficiency and robustness of the scheduling process under more dynamic and complex operational conditions.
Additionally, real-world flight experiments will be conducted to validate the effectiveness of the proposed methods in practical settings. These experiments will focus on testing the energy scheduling framework across a range of flight missions, with the goal of demonstrating its applicability and reliability under various environmental factors, including variable weather conditions, payload variations, and flight trajectory changes. This phase of the research will provide critical insights into the real-time performance and scalability of the system, paving the way for the industrial implementation of energy management solutions for hydrogen-powered UAVs in commercial applications.
Furthermore, we plan to collaborate with industry partners to explore the integration of the proposed energy scheduling framework into existing UAV platforms, with potential applications in surveillance, logistics, and environmental monitoring. This collaboration will help bridge the gap between theoretical advancements and real-world implementation, ensuring that the proposed solution can be effectively deployed in large-scale UAV systems.

Author Contributions

Conceptualization, H.L. and L.S.; methodology, C.W. and L.S.; software, C.W. and S.Y.; validation, H.L., C.W., S.Y. and L.S.; formal analysis, S.Y.; investigation, H.L. and C.W.; resources, H.Z. and B.L.; data curation, Y.L.; writing—original draft preparation, C.W. and S.Y.; writing—review and editing, C.W. and S.Y.; visualization, C.W.; supervision, H.L. and L.S.; funding acquisition, H.L. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the key technology research project on long-endurance hydrogen-powered hybrid UAVs for air–ground–space collaborative power inspection, grant number J2024028.

Data Availability Statement

Data will be made available on request.

Acknowledgments

The authors would like to acknowledge the State Grid Jiangsu Electric Power Company for their financial support of the scientific research project, project number J2024028.

Conflicts of Interest

Authors Haitao Li, Hui Zhu, Bo Li and Yuexin Liu were employed by the company “State Grid Changzhou Power Supply Company”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Fan, B.; Li, Y.; Zhang, R.; Fu, Q. Review on the technological development and application of UAV systems. Chin. J. Electron. 2020, 29, 199–207. [Google Scholar] [CrossRef]
  2. Li, Y.; Zhu, Q.; Elahi, A. Sequential Convex Programming for Nonlinear Optimal Control in UAV Trajectory Planning. Algorithms 2024, 17, 304. [Google Scholar] [CrossRef]
  3. Pan, Z.F.; An, L.; Wen, C.Y. Recent advances in fuel cells based propulsion systems for unmanned aerial vehicles. Appl. Energy 2019, 240, 473–485. [Google Scholar] [CrossRef]
  4. Aminudin, M.A.; Kamarudin, S.K.; Lim, B.H.; Majilan, E.H.; Masdar, M.S.; Shaari, N. An overview: Current progress on hydrogen fuel cell vehicles. Int. J. Hydrogen Energy 2023, 48, 4371–4388. [Google Scholar] [CrossRef]
  5. Salah, O.; Shamayleh, A.; Mukhopadhyay, S. Energy management of a multi-source power system. Algorithms 2021, 14, 206. [Google Scholar] [CrossRef]
  6. Urooj, A.; Nasir, A. Review of intelligent energy management techniques for hybrid electric vehicles. J. Energy Storage 2024, 92, 112132. [Google Scholar] [CrossRef]
  7. Jin, B.; Zhang, L.; Chen, Q.; Fu, Z. Energy management strategy of fuzzy logic control for fuel cell truck. Energy Rep. 2023, 9, 247–255. [Google Scholar] [CrossRef]
  8. Wang, Y.; Sun, Z.; Chen, Z. Development of energy management system based on a rule-based power distribution strategy for hybrid power sources. Energy 2019, 175, 1055–1066. [Google Scholar] [CrossRef]
  9. Thirunavukkarasu, M.; Sawle, Y.; Lala, H. A comprehensive review on optimization of hybrid renewable energy systems using various optimization techniques. Renew. Sustain. Energy Rev. 2023, 176, 113192. [Google Scholar] [CrossRef]
  10. Chen, X.P.; Hewitt, N.; Li, Z.T.; Wu, Q.M.; Yuan, X.; Roskilly, T. Dynamic programming for optimal operation of a biofuel micro CHP-HES system. Appl. Energy 2017, 208, 132–141. [Google Scholar] [CrossRef]
  11. Salehpour, M.J.; Zarenia, O.; Wang, J.; Yu, X. Simultaneous components sizing and flight scheduling for an hybrid aerial vehicle as a multi-energy mobile microgrid. Int. Trans. Electr. Energy Syst. 2021, 31, e12925. [Google Scholar] [CrossRef]
  12. Alabi, T.M.; Aghimien, E.I.; Agbajor, F.D.; Yang, Z.; Lu, L.; Adeoye, A.R.; Gopaluni, B. A review on the integrated optimization techniques and machine learning approaches for modeling, prediction, and decision making on integrated energy systems. Renew. Energy 2022, 194, 822–849. [Google Scholar] [CrossRef]
  13. Fan, P.; Ke, S.; Yang, J.; Wen, Y.; Xie, L.; Li, Y.; Kamel, S. A frequency cooperative control strategy for multi microgrids with EVs based on improved evolutionary-deep reinforcement learning. Int. J. Electr. Power Energy Syst. 2024, 159, 109991. [Google Scholar] [CrossRef]
  14. Oleh, U.; Obermaisser, R.; Ahammed, A.S. A Review of Recent Techniques for Human Activity Recognition: Multimodality, Reinforcement Learning, and Language Models. Algorithms 2024, 17, 434. [Google Scholar] [CrossRef]
  15. Li, Q.; Meng, X.; Gao, F.; Zhang, G.; Chen, W.; Rajashekara, K. Reinforcement learning energy management for fuel cell hybrid systems: A review. IEEE Ind. Electron. Mag. 2022, 17, 45–54. [Google Scholar] [CrossRef]
  16. Fu, Z.; Wang, H.; Tao, F.; Ji, B.; Dong, Y.; Song, S. Energy management strategy for fuel cell/battery/ultracapacitor hybrid electric vehicles using deep reinforcement learning with action trimming. IEEE Trans. Veh. Technol. 2022, 71, 7171–7185. [Google Scholar] [CrossRef]
  17. Staffell, I. Zero carbon infinite cop heat from fuel cell CHP. Appl. Energy 2015, 147, 373–385. [Google Scholar] [CrossRef]
  18. Saco, A.; Sundari, P.S.; Paul, A. An optimized data analysis on a real-time application of PEM fuel cell design by using machine learning algorithms. Algorithms 2022, 15, 346. [Google Scholar] [CrossRef]
  19. Gong, X.; Xiong, R.; Mi, C.C. Study of the characteristics of battery packs in electric vehicles with parallel-connected lithium-ion battery cells. IEEE Trans. Ind. Appl. 2015, 51, 1872–1879. [Google Scholar] [CrossRef]
  20. Zheng, J.; Zhang, X.; Xu, P.; Gu, C.; Wu, B.; Hou, Y. Standardized equation for hydrogen gas compressibility factor for fuel consumption applications. Int. J. Hydrogen Energy 2016, 41, 6610–6617. [Google Scholar] [CrossRef]
  21. Lemmon, E.W.; Huber, M.L.; McLinden, M.O. NIST Standard Reference Database 23. Reference Fluid Thermodynamic and Transport Properties (REFPROP), Version 10.0; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2010; Volume 9. [Google Scholar]
  22. Bingol, M.C. Investigation of the Standard Deviation of Ornstein-Uhlenbeck Noise in the DDPG Algorithm. Gazi Univ. J. Sci. Part C Des. Technol. 2021, 9, 200–210. [Google Scholar] [CrossRef]
Figure 1. An UAV flight profile diagram.
Figure 1. An UAV flight profile diagram.
Algorithms 18 00080 g001
Figure 2. Hydrogen hybrid UAV energy system structure diagram.
Figure 2. Hydrogen hybrid UAV energy system structure diagram.
Algorithms 18 00080 g002
Figure 3. The MPC and DDPG flowchart.
Figure 3. The MPC and DDPG flowchart.
Algorithms 18 00080 g003
Figure 4. The MPC flowchart.
Figure 4. The MPC flowchart.
Algorithms 18 00080 g004
Figure 5. The DDPG flowchart.
Figure 5. The DDPG flowchart.
Algorithms 18 00080 g005
Figure 6. Schematic diagram of the DDPG policy network and value network structure.
Figure 6. Schematic diagram of the DDPG policy network and value network structure.
Algorithms 18 00080 g006
Figure 7. Mean and range of random operating conditions.
Figure 7. Mean and range of random operating conditions.
Algorithms 18 00080 g007
Figure 8. Comparison between MPC predicted power and reference value.
Figure 8. Comparison between MPC predicted power and reference value.
Algorithms 18 00080 g008
Figure 9. DDPG training cumulative reward.
Figure 9. DDPG training cumulative reward.
Algorithms 18 00080 g009
Figure 10. Hydrogen fuel cell and SOC scheduling during 0–19 min.
Figure 10. Hydrogen fuel cell and SOC scheduling during 0–19 min.
Algorithms 18 00080 g010
Figure 11. Hydrogen fuel cell and SOC scheduling during 19–36 min.
Figure 11. Hydrogen fuel cell and SOC scheduling during 19–36 min.
Algorithms 18 00080 g011
Figure 12. Hydrogen fuel cell and SOC scheduling during 346–360 min.
Figure 12. Hydrogen fuel cell and SOC scheduling during 346–360 min.
Algorithms 18 00080 g012
Table 1. Nine sets of values for ai, bi and ci.
Table 1. Nine sets of values for ai, bi and ci.
iaibici
10.05884601.3251.0
2−0.061361111.871.0
3−0.0026504732.52.0
40.0027311252.82.0
50.0018023742.9382.42
6−0.00121507073.142.63
70.958842 × 10−43.373.0
8−0.1109040 × 10−63.754.0
90.1264403 × 10−94.05.0
Table 2. The range of flight state parameters.
Table 2. The range of flight state parameters.
Moment
(min)
Flight Altitude
(m)
Flight Speed
(m/s)
Pitch Angle
(°)
Engine Temperature (°C)
1–30–200–15–1055–60
4–1820–501–1.53–550–55
19–30501.5–20–345–50
31–40501.5–20–250–55
41–120501.5–20–245–50
121–140501.5–20–250–55
141–250501.5–20–245–50
251–270501.5–20–250–55
271–330501.5–20–245–50
331–34550–201–1.5(−2)–040–45
346–35720–10.5–1(−5)–(−2)40–45
358–36000–0.5(−2)–030–40
Table 3. Parameter values of the DDPG algorithm.
Table 3. Parameter values of the DDPG algorithm.
ParametersValueParametersValueParametersValue
α P 0.001Memory4000 σ max 0.35
α Q 0.001Batch size96 T N d 4800
τ0.001μ0k1100
episode200θ0.15k250
γ0.99 σ min 0.001l120
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Wang, C.; Yuan, S.; Zhu, H.; Li, B.; Liu, Y.; Sun, L. Energy Scheduling of Hydrogen Hybrid UAV Based on Model Predictive Control and Deep Deterministic Policy Gradient Algorithm. Algorithms 2025, 18, 80. https://doi.org/10.3390/a18020080

AMA Style

Li H, Wang C, Yuan S, Zhu H, Li B, Liu Y, Sun L. Energy Scheduling of Hydrogen Hybrid UAV Based on Model Predictive Control and Deep Deterministic Policy Gradient Algorithm. Algorithms. 2025; 18(2):80. https://doi.org/10.3390/a18020080

Chicago/Turabian Style

Li, Haitao, Chenyu Wang, Shufu Yuan, Hui Zhu, Bo Li, Yuexin Liu, and Li Sun. 2025. "Energy Scheduling of Hydrogen Hybrid UAV Based on Model Predictive Control and Deep Deterministic Policy Gradient Algorithm" Algorithms 18, no. 2: 80. https://doi.org/10.3390/a18020080

APA Style

Li, H., Wang, C., Yuan, S., Zhu, H., Li, B., Liu, Y., & Sun, L. (2025). Energy Scheduling of Hydrogen Hybrid UAV Based on Model Predictive Control and Deep Deterministic Policy Gradient Algorithm. Algorithms, 18(2), 80. https://doi.org/10.3390/a18020080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop