Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids

Huang, Wei; Li, Qing; Jiang, Yuan; Lu, Xiaoya

doi:10.3390/pr12091822

Open AccessArticle

Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids

by

Wei Huang

¹,

Qing Li

^1,2,*

,

Yuan Jiang

^1,2 and

Xiaoya Lu

¹

School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

²

Key Laboratory of Industrial Process Knowledge Automation, Ministry of Education, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Processes 2024, 12(9), 1822; https://doi.org/10.3390/pr12091822

Submission received: 21 July 2024 / Revised: 24 August 2024 / Accepted: 26 August 2024 / Published: 27 August 2024

(This article belongs to the Topic Advanced Operation, Control, and Planning of Intelligent Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

This study is aimed at addressing the problem of optimizing microgrid operations to improve local renewable energy consumption and ensure the stability of multi-energy systems. Microgrids are localized power systems that integrate distributed energy sources, storage, and controllable loads to enhance energy efficiency and reliability. The proposed approach introduces a novel microgrid optimization method that leverages the parameterized Dueling Deep Q-Network (Dueling DQN) and Deep Deterministic Policy Gradient (DDPG) algorithms. The method employs a parametric hybrid action-space reinforcement learning technique, where the DDPG is utilized to convert discrete actions into continuous action values corresponding to each discrete action, while the Dueling DQN uses the current observation states and these continuous action values to predict the discrete actions that maximize Q-values. This integrated strategy is designed to tackle the co-scheduling challenge in microgrids, enabling them to dynamically select the most favorable control strategies based on their specific states and the actions of other intelligent entities. The ultimate objective is to minimize the overall operational costs of microgrids while ensuring the efficient local consumption of renewable energy and maintaining the stability of multi-energy systems. Simulation experiments were conducted to validate the efficacy and superiority of the proposed method in achieving the optimal microgrid operation, showcasing its potential to improve service quality and reduce operational expenses. Average rewards increased by 30% and 15% compared to the use of the Dueling DQN or DDPG only.

Keywords:

microgrid; deep deterministic policy gradient algorithm; deep reinforcement learning; dueling DQN

1. Introduction

With the advancement of the new energy sector, the proportion of renewable energy sources is gradually increasing. In order to utilize these clean energy sources more efficiently and meet the growing demand for electricity, microgrid technology has emerged. A microgrid is a small-scale power system with autonomous operation capability, capable of integrating multiple distributed energy resources, such as renewable energy sources, energy storage devices, and controllable loads, to satisfy the local power demand. Microgrids can operate either in grid-connected or islanded mode, enhancing the flexibility and reliability of the power system. However, the operation of microgrids also faces several challenges, including energy uncertainty, load volatility, and dynamic tariff structures, which increase the complexity and difficulty of energy scheduling and management within microgrids. In this context, realizing the economic, environmentally friendly, and reliable operation of microgrids has become an important and complex issue.

Currently, research on microgrid scheduling problems primarily encompasses traditional algorithms [1,2], heuristic algorithms [3,4,5], and reinforcement learning algorithms [6], through which the optimization of scheduling issues, such as microgrid capacity allocation and operational economics, is addressed. Wu et al. [7] incorporated the principle of chaos to effectively improve the Imperialist Competitive Algorithm (ICA) and proposed a dynamic economic scheduling method for microgrids based on the improved imperialistic competition algorithm. In Vergara et al.’s study [8], based on the Multi-Objective Genetic Algorithm (MOGA), proposed Non-dominated Sorting Genetic Algorithm II (NSGA-II), the Unit Commitment (UC) problem and Economic Load Dispatch (ELD) problem in microgrids were taken into account. Shuai et al. [9] designed a shared decision module architecture based on Long Short-Term Memory (LSTM) based on the Branching Dueling Q-Network for the extraction of features from historical data. The microgrid online optimization problem was formulated as a mixed-integer second-order cone programming (MISOCP) problem. Fan et al. [10] considered the problem of increasing total cost due to the disconnect between optimization and real-time control, and the optimal generative control problem was studied as a constrained nonconvex problem with nonlinearity. Based on multi-intelligence deep reinforcement learning, a dynamic nonlinear nonconvex optimal generation control algorithm for DC microgrids was proposed to effectively reduce the generation cost. Chen et al. [11] proposed a real-time optimization method for microgrids based on the DDPG, which formulates the real-time optimal scheduling problem as a Markov decision-making process and obtains the optimal scheduling strategy through offline training and online decision-making. Xia et al. [12] used the DDPG algorithm to obtain an optimal solution for microgrid control and designed a global reward function.

However, traditional algorithms require the establishment of an accurate mathematical model, while the relationships between distributed power sources, energy storage devices, loads, and the grid within a microgrid are nonlinear, time-varying, and stochastic, making it challenging to dynamically adjust optimization objectives and constraints to adapt to changing environments and user demands. Heuristic algorithms, on the other hand, are susceptible to becoming trapped in local optimal solutions, and their performance is typically dependent on the characteristics and instances of the problem, with the same heuristic algorithm potentially yielding different results for different types or sizes of microgrid scheduling problems. This issue can be effectively addressed using reinforcement learning techniques. In contrast to traditional algorithms, reinforcement learning does not require the prior establishment of an accurate mathematical model, giving it the flexibility to adapt to nonlinear, time-varying, and stochastic relationships between subsystems within the microgrid. The microgrid environment involves complex interactions, and reinforcement learning is capable of dynamically adapting the optimization objectives and constraints to the changing environment and user requirements by conducting real-world trials and learning from the environment. Compared to heuristic algorithms, reinforcement learning can more effectively cope with the problem of local optimal solutions. Reinforcement learning algorithms accumulate experience and learn how to make optimal decisions in the face of different situations by constantly interacting with the environment. This enables reinforcement learning algorithms to better handle complex, dynamic, and uncertain environments when dealing with microgrid scheduling problems.

Islam et al. [13] used multi-objective optimization algorithms to control microgrid systems. Similar to reinforcement learning, multi-objective optimization algorithms aim to find a set of solutions that balance multiple objectives. Bjerland et al. [14] used the two-stage stochastic programming approach to solve the conflict problem between different distributed energy sources in the system. Song et al. [15] optimized the comprehensive cost of microgrids by combining multiple heuristic algorithms.

However, traditional reinforcement learning algorithms, such as the Deep Q-Network (DQN), suffer from the overestimation of true values, which can lead to suboptimal performance and instability during the training process [16]. The Dueling DQN addresses this issue by decomposing the Q-value function into a state-value function and a dominance function, where the state-value function solely predicts the state value, and the dominance function solely predicts the effect of the action on the environment in the current state. The DDPG algorithm used in [11,12] is primarily designed for continuous action spaces, which can limit its effectiveness in handling discrete actions within microgrid operations. The multi-objective optimization algorithm employed in [13] may be less effective in highly dynamic and uncertain microgrid environments. The two-stage stochastic programming approach adopted in [14] may struggle to address real-time decision-making in rapidly changing environments. While the approach in [15], which combines multiple heuristic algorithms, improves performance to some extent, it still inherits the inherent limitations of heuristic methods. In addition, there is a lack of comprehensive approaches that effectively integrate the strengths of different algorithm types, particularly in addressing the challenges posed by the coexistence of discrete and continuous action spaces in microgrid management. The problem that we aim to solve is a microgrid architecture with a hybrid action space, which encompasses not only discrete action spaces (battery charging and discharging, gas turbine start-up and shut-down, etc.) but also continuous action spaces (generator output power, energy distribution ratios, etc.). Reinforcement learning methods that employ discrete or continuous action spaces alone are inadequate to address this problem. Lin et al. [17] combined the DDPG algorithm, which deals with continuous action space, and the Dueling DQN algorithm, which deals with discrete action space, to solve the path-planning problem for UAVs, with good results compared to using only the DDPG or Dueling DQN.

Therefore, we propose a method that combines the Dueling DQN algorithm, which deals with discrete actions, with the DDPG algorithm, which handles continuous actions, and parameterizes the action space in order to solve the microgrid scheduling problem.

2. System Model

Microgrids usually contain clean energy generation systems, conventional energy generation systems, energy storage systems, and power consumption facilities [18]. Therefore, microgrid subsystems are modeled as photovoltaic power generation, wind power generation, micro gas turbine power generation, and battery charging and discharge models. Since different energy devices have different output characteristics, microgrids must be able to efficiently utilize renewable energy resources while meeting the power demand for reasons of economic efficiency. Therefore, effective energy management systems are needed to optimize their combined use. The system architecture is shown in Figure 1.

2.1. Photovoltaic Power Model

Photovoltaic power generation involves the direct transformation of light energy into electrical energy. Rooted in the photovoltaic effect, this phenomenon occurs when particular materials are subjected to light, resulting in the production of an electric current. The mathematical representation of the photovoltaic power generation process is shown in Equation (1):

P_{v} = H_{A} c o s θ \frac{P_{A Z}}{E_{S}} [1 + δ (T - T_{0})]

(1)

In the equation,

P_{V}

represents the power output of the photovoltaic system;

H_{A}

denotes the solar irradiance incident on a horizontal surface per unit time;

θ

is the angle separating the sun from the horizontal plane;

P_{A Z}

is the installed power capacity under standard operating conditions;

E_{S}

is the irradiance under those standard conditions;

δ

is the temperature coefficient of power; T is the current temperature; and

T_{0}

is the reference standard operating temperature.

2.2. Wind Energy Model

Wind energy conversion systems harness the kinetic energy of wind to generate electricity. Wind turbines can modulate their output power by adjusting parameters such as the wind speed, wind direction, and blade pitch angle. Furthermore, they can operate either independently or in grid-connected mode, depending on the electricity demand. The mathematical model for wind power generation is expressed as Equation (2):

P_{W P} = \frac{1}{2} ρ A C_{p} (λ, β) V^{3}

(2)

where

P_{W P}

is the wind power output;

ρ

represents the density of the air; A is the wind-swept area of the wind turbine;

C_{p}

is the power coefficient of the wind turbine;

λ

is the tip speed ratio of the wind turbine;

β

is the pitch angle of the wind turbine; and V is the wind speed.

2.3. Micro Gas Turbine Model

A micro gas turbine is a small thermal power generation device that utilizes the principle of the gas turbine cycle to generate high-temperature and high-pressure working fluid through the mixed combustion of fuel and air, which is expanded by the turbine to perform work and drive the generator to convert it into electricity. The mathematical model of the micro gas turbine is defined by Equations (3) and (4):

P_{TU} = Q C_{g} (T_{out} - T_{in})

(3)

m = \frac{P_{TU}}{η_{C} Q_{LHV}}

(4)

where

P_{T U}

denotes the base output power of the gas turbine, Q represents the gas flow rate,

C_{g}

is the constant-pressure specific heat capacity of the gas,

T_{o u t}

is the exhaust temperature, and

T_{i n}

is the inlet temperature. Furthermore,

η_{C}

symbolizes the efficiency of the gas turbine,

\dot{m}

represents the mass flow rate of the gas, and

Q_{L H V}

is the lower heating value of the fuel. Consequently, the fuel cost

C_{M T}

associated with the micro gas turbine can be expressed as Equation (5):

C_{MT} = \frac{Q C_{g} (T_{out} - T_{in})}{η_{C} Q_{LHV}}

(5)

where

C_{g}

is the price of fuel.

2.4. Storage Battery Model

Storage batteries are electrochemical energy storage devices capable of storing electrical energy generated from renewable sources [19]. They can provide supplementary electrical power to mitigate peak demand, thereby reducing the impact of high power consumption on the electrical grid and enhancing output stability. The mathematical model governing battery behavior is expressed as Equation (6):

E_{SOC}^{t} = E_{SOC}^{t - 1} + η_{k} P_{BA}^{t}

(6)

where

E_{SOC}^{t}

denotes the battery’s state of charge at time t,

E_{SOC}^{t - 1}

represents the state of charge at the previous time step

t - 1

,

η_{k}

is the power conversion efficiency coefficient, which takes a negative value during battery discharge and a positive value during charging, and

P_{BA}^{t}

symbolizes the charging or discharging power.

3. Optimal Scheduling Model for Microgrids Based on Parametric Dueling DQN and DDPG

The main purpose of this research is to reduce the operational expenses associated with microgrid systems. Consequently, we propose an optimal dispatch model for microgrids that incorporates factors such as renewable energy generation, energy storage systems, load demand, grid tariffs, and the ability to switch between grid-connected and islanded modes of operation. Our model is formulated based on the parameterized Dueling DQN and DDPG algorithms. The ultimate objective is to ascertain the optimal scheduling approach that diminishes the cumulative operational costs of the microgrid across a defined temporal span. The problem is mathematically formulated as Equation (7):

min f = \sum_{t = 1}^{T} (P_{grid} (t) c_{grid} (t) + C_{MT} + \sum_{i = 1}^{n} P_{G_{i}} C_{G_{i}})

(7)

where

P_{grid} (t)

denotes the power input from the external grid at time t,

c_{grid} (t)

represents the electricity price at time t,

P_{G_{i}}

symbolizes the output power of the i-th distributed generation source, and

C_{G_{i}} (P_{G_{i}})

is the cost function associated with the operation of the i-th generating unit, subject to the following constraints:

Binary constraint: The binary constraint is used to make decisions about the activation and deactivation of distributed energy sources. Its importance lies in the ability to promptly shut down wind turbines when wind speeds are excessively high, thereby preventing equipment damage. Furthermore, when the generation from renewable energy sources can fully cover the current electricity demand, the gas turbine can be deactivated to reduce operational costs.

$a_{k} [n] = \{0, 1\}$

(8)

where $a_{k} [n]$ is a binary decision, such as starting or stopping the gas turbine.
Load constraint: The power provided to the load by the microgrid needs to satisfy its power demand, and the output power of the distributed power supply is limited to values between its minimum output power and maximum output power:

$\sum_{i = 1}^{n} P_{G_{i}} + P_{grid} + η^{'} P_{BA} \geq P_{L}$

(9)

$P_{G_{i}, min} \leq P_{G_{i}} \leq P_{G_{i}, max}$

(10)

where $P_{G_{i}, min}$ represents the minimum permissible output power of the $i^{th}$ distributed generation source, and $P_{G_{i}, max}$ denotes the maximum allowable output power of the $i^{th}$ distributed generation source.
Energy storage constraints: To guarantee the longevity of the energy storage system, constraints are imposed on the depth of charge and discharge, as well as the output power of the energy storage device, which is constrained according to Equations (11) and (12):

$S O C_{min} \leq S O C \leq S O C_{max}$

(11)

$P_{BA, min} \leq P_{BA} \leq P_{BA, max}$

(12)

where SOC represents the state of charge of the energy storage system, and ${SOC}_{min}$ and ${SOC}_{max}$ denote the minimum and maximum permissible states of charge, respectively. Additionally, $P_{BA}$ symbolizes the output power of the energy storage system, with $P_{BA, min}$ and $P_{BA, max}$ representing the minimum and maximum allowable output power limits, respectively.

3.1. A Scheduling Method Based on DRL

Deep reinforcement learning (DRL) approximates the policy or value function in a Markov Decision Process (MDP) through deep neural networks, which consists of five elements: the state set

S

, the action set

A

, the state transition probability

P

, the reward function

R

, and the discount factor

γ

. Q-learning is used to optimize the microgrid scheduling problem, considering the operational costs associated with each internal component of the microgrid system, the electricity consumption costs in the grid-connected mode, and the costs related to the operation of the gas turbine. The microgrid’s operational expenditure is primarily influenced by real-time electricity prices, fuel costs, and the costs associated with utilizing renewable energy facilities. Consequently, to minimize operational costs, the output of various power sources must be judiciously scheduled, ensuring that grid power purchases and fuel consumption are minimized while simultaneously leveraging the advantages of renewable energy facilities to increase their contribution to the overall power output.

3.2. Parameterized Dueling DQN and DDPG Algorithms

The Dueling DQN algorithm decomposes the Q-function into two subfunctions: the state-value function and the advantage function, which estimate the value of the state itself and the relative advantage of taking different actions in that state, respectively. This decomposition effectively distinguishes the contributions of the state and action to the Q-function, thereby improving the estimation accuracy and learning efficiency of the Q-function, which, in turn, enhances the algorithm’s performance. Conversely, the DDPG algorithm can be applied to continuous action spaces and directly outputs action values through a deterministic policy network, thus improving the flexibility and adaptability of actions. Combining the Dueling DQN and DDPG algorithms yields a hybrid reinforcement learning algorithm that can leverage both the efficient Q-value estimation of the Dueling DQN and the continuous action output of the DDPG, thereby improving the algorithm’s performance and applicability across both discrete and continuous action spaces.

Utilizing the actor–critic framework, which amalgamates the Dueling DQN and DDPG algorithms, the Dueling DQN is deployed within the actor network, whereas the DDPG is incorporated into the critic network. The DDPG is responsible for serializing the discrete actions and generating continuous action values corresponding to each discrete action. Following this, the Dueling DQN processes the current observation state along with the continuous action values as input to emit the maximized Q-value associated with the discrete action that optimizes the Q-value. Through this mechanism, the optimal discrete actions and their corresponding continuous action values are determined. The architecture of the network is illustrated in Figure 2.

The state–action function is denoted by

Q (s_{t} [n], d [n], a_{d} [n]) = E_{Q_{t - 1}, s_{t} [n + 1]} (Q_{t - 1} (s_{t - 1} [n], d [n], a_{d} [n]) + γ m a x_{d \in D} sup_{a_{d} \in A_{d}} Q (s_{t} [n + 1], d, a_{j}))

(13)

The action

a [n]

is transformed by parameterizing it as

a_{n} [n] = d [n] + a_{d} [n]

, where

d [n]

denotes the discrete action drawn from the set of discrete actions D,

a_{d} [n]

represents the continuous action corresponding to the discrete action

d [n]

, and

γ

symbolizes the discount factor used to assess the efficacy of the action.

After obtaining the continuous action from the DDPG, the optimal discrete action

d^{*}

is selected by the parameterized state–action function of the Dueling DQN

Q (s, d, a_{d})

:

\{\begin{matrix} d^{*} = \underset{d \in D}{arg max} Q (s, d, a_{d}), \\ Q (s, d, a_{d}) = Q (s, d, a_{d}; ω^{Q}), \\ Q (s, d, a_{d}, ω^{Q}) = V (s; ω^{V}) + A (s, d, a_{d}; ω^{A}), \end{matrix}

(14)

where

ω

denotes the parameter set of the Dueling DQN,

ω_{V}

represents the parameters of the state-value function, and

ω_{A}

symbolizes the parameters of the advantage function.

The parameter update is grounded in the temporal difference (TD) algorithm, which merges the concepts of Dynamic Programming with Monte Carlo techniques to refine the value function of the current state using the known state distribution. First, the policy function of the DDPG is refined, with the parameters of the DQN held constant; subsequently, the parameters of the DQN are updated, with the DDPG parameters remaining fixed. The overall parameter-updating process can be formally expressed as follows:

\{\begin{matrix} y_{exp} = r + γ max_{d \in D} Q (s^{'}, d^{'}, μ (s^{'}; ω^{μ}); ω^{Q}), \\ ζ_{Q} (ω^{Q}) = \frac{1}{2} [Q (s, d, μ (s; ω^{μ}); ω^{Q}) - y_{exp}], \\ ζ_{μ} (ω^{μ}) = - \sum_{d = 1}^{D} Q (s, d, μ (s; ω^{μ}); ω^{Q}), \\ ω = ω - δ \cdot \nabla_{ω} ζ (ω), \\ \nabla_{ω} ζ (ω) = \underset{τ \sim π_{ω}}{E} \sum_{t = 1}^{T} \nabla_{ω} log π_{ω} (a_{t} ∣ s_{t}) (\sum_{t^{'} = t}^{T} Q (s_{t} [n], d [n], a_{d} [n])) \end{matrix}

(15)

where

y_{\exp}

represents the expected value of the state–action pair,

s^{'}

denotes the subsequent state,

ζ_{Q}

and

ζ_{μ}

symbolize the loss functions of the DQN and DDPG components, respectively,

ω

denotes the updated parameter set, and

\nabla_{ω} ζ (ω)

signifies the stochastic gradient algorithm applied to

ω

.

The system state space is defined as follows:

s_{t} = [P_{V}, P_{WP}, P_{TU}, Q_{g}, E_{SOC}, c_{grid}]

(16)

The action space is defined as follows:

a (t) = \{a_{V} (t), a_{WP} (t), a_{TU} (t), a_{SOC} (t), a_{k} [n]\}

(17)

where

a_{V} (t)

represents the decision action for photovoltaic power generation to transmit power to the load and battery,

a_{WP} (t)

denotes the decision action for wind power generation to transmit power to the load and battery,

a_{TU} (t)

symbolizes the decision action for the micro gas turbine power output,

a_{WP} (t)

signifies the decision action for battery discharging, and

a_{k} [n]

is the binary decision variable for the microgrid components’ output power. Furthermore,

a_{SOC} (t)

represents the battery discharge decision action.

The reward is defined as follows:

R (t) = C_{PV} \cdot a_{V} (t) + C_{WP} \cdot a_{WP} (t) + C_{TU} \cdot a_{TU} (t) + C_{grid} \cdot P_{grid} (t) - |{SOC}_{target} - SOC (t)| - α \cdot | a SOC (t) |

(18)

where

C_{PV}

,

C_{WP}

,

C_{TU}

, and

C_{grid}

represent the power generation revenue or operating costs for photovoltaic power generation, wind power generation, micro gas turbines, and grid power, respectively.

{SOC}_{target}

denotes the target state of charge,

SOC (t)

is the current state of charge, and

α

is a penalty coefficient. This reward encourages the battery to maintain an ideal state of charge and reduces frequent charging and discharging.

The parameterized Dueling DQN and DDPG algorithm flow is shown in Figure 3.

The microgrid operation is optimized by the parameterized Dueling DQN and DDPG algorithms through the following steps:

Time Slot Division: The entire operation period of the microgrid is divided into discrete time slots, each of length $Δ t$ . This discretization allows for the modeling of microgrid dynamics and decision-making at specific intervals.
Network Initialization: The neural network parameters are initialized, which includes setting up the weights of the Dueling DQN and DDPG networks. An experience replay buffer is also initialized to store past experiences (state, action, reward, next state) for use in training.
Initial State Observation: The initial state of the microgrid is observed. This state typically includes information about the current load demand, renewable energy generation, storage levels, and other relevant factors affecting the microgrid’s operation.
Discrete Action Selection: Based on the current state, the Dueling DQN selects a discrete action. This discrete action could represent high-level operational decisions, such as switching on or off certain distributed energy resources or selecting between predefined operational modes.
Action Space Parameterization: The selected discrete action is parameterized to include corresponding continuous actions. These continuous actions represent finer-grained control, such as adjusting the power output of distributed energy resources or modulating the charge/discharge rates of storage devices.
Network Strategy Evaluation: The DDPG algorithm evaluates the strategy of the Dueling DQN network by calculating the Q-values for each action in the parameterized action space. This evaluation involves assessing how well the selected actions are expected to perform in terms of the reward signal.
Continuous Action Output: The DDPG algorithm then outputs deterministic values for the continuous actions. These values are precise control signals that specify the exact level of adjustment required for the distributed energy resources and other controllable elements within the microgrid.
Action Vector Formation: The discrete actions selected by the Dueling DQN and the continuous actions determined by the DDPG are combined to form a complete action vector. This vector represents the full set of operational decisions to be implemented in the microgrid during the current time slot.
Reward and State Transition: After the action vector is executed, the microgrid environment transitions to a new state. The immediate reward is obtained based on the operational cost, efficiency, and reliability of the microgrid in the new state. The reward reflects how well the microgrid is performing according to the optimization objectives.
Gradient Calculation: A random batch of past experiences is sampled from the experience replay buffer. The gradient of the loss function is calculated using this batch, which helps in reducing the temporal difference error between the predicted Q-values and the target Q-values.
Network Weight Update: The calculated gradients are used to update the weights of the Dueling DQN and DDPG networks. This weight update process enables the networks to improve their performance over time as they learn from the cumulative experience stored in the replay buffer.
Iteration: The process from step (3) to step (11) is repeated iteratively. This loop continues until the network converges (i.e., when the Q-values stabilize and the reward no longer significantly improves) or until a maximum number of training iterations is reached. Convergence indicates that the network has effectively learned the optimal control strategy for the microgrid.

4. Simulation and Result Analysis

4.1. Simulation Parameters

Within the simulation framework, a microgrid was established that includes a photovoltaic power generation system, a wind power generation system, a micro gas turbine power generation system, and a battery storage system. The efficacy of our proposed methodology was assessed and contrasted with the DQN and DDPG methods. The experimental results illustrate that our strategy can significantly reduce the operational costs of the microgrid, increase the utilization of renewable energy resources, and guarantee adherence to load demand and power balance constraints. The relevant operational parameters are detailed in Table 1, Table 2 and Table 3.

The hyperparameters are shown in the table below:

Table 3. Hyperparameter settings.

Parameters	Value
Size of the experience pool	20,000
Batch size	128
$δ_{a}$	$1 \times 10^{- 5}$
$δ_{c}$	$1 \times 10^{- 5}$
$γ$	0.95
Maximum iterations	20,000

The DRL parameters are shown in Table 4, including the actor network of the DDPG and the critic network of the Dueling DQN. The batch size of each network is 128, the learning rate is

1 \times 10^{- 4}

, and the memory capacity is 2000. The input layers are

s_{t} . d i m + a_{t} . d i m

(DDPG) and

s_{t} . d i m

(Dueling DQN), respectively, and the hidden layer structure is [256, 128, 64]. The output layer in the DDPG is

a (t) . d i m

, and in the critic network of the Dueling DQN, it is 1 (

V (s; ω^{V})

) and

a (t) . d i m

(

A (s, d, a_{d}; ω^{A})

).

4.2. Discussion and Result Analysis

In this study, we conducted simulation experiments to evaluate the performance of the parameterized Dueling DQN and DDPG algorithms compared to the DQN algorithm using varying learning rates. The simulation results are presented in Figure 4.

As shown in Figure 4, a smaller learning rate increases the reliability of convergence and avoids oscillations or divergence, but it also causes the network to fall into local minima and fail to jump out of the local optimum. Larger learning rates lead to oscillations in the reward function, which may lead to the non-convergence of the function, but they also widen the search range, giving the network a greater possibility of finding the global optimum, so the maximum reward of a network with a larger learning rate is higher than that of a network with a smaller learning rate, as well as having a greater amplitude of oscillations, as in Figure 5.

It can be observed in Figure 4 that, when the learning rate is

1 \times 10^{- 5}

, the Parametric Dueling DQN and DDPG begins to surpass the other algorithms after approximately 20 k steps. As the number of training steps increases, the average reward gradually rises and remains at a relatively high level (around 160). Despite some fluctuations around 80 k steps, this algorithm ultimately exhibits the best performance. In comparison to the red curve, the average reward of the DDPG remains relatively stable throughout the training process, ranging between 90 and 110 after 40 k steps. Although the DDPG’s performance is stable, it does not reach the heights achieved by the Parametric Dueling DQN and DDPG. The DQN curve shows a rapid increase in the initial stages of training, but the growth in average reward tends to plateau after approximately 40 k steps, ultimately fluctuating between 70 and 90, which is significantly lower than the other two algorithms.

As observed in Figure 5, the algorithm continues to exhibit the best performance even at a learning rate of

1 \times 10^{- 4}

. In the initial stages of training (0–20 k steps), the curve rises rapidly, achieving an average reward close to 140. Subsequently, the curve peaks between 20 k and 40 k steps (near 160), but as the number of training steps increases, the reward slightly decreases and stabilizes, eventually remaining between 120 and 140. This indicates that the algorithm can still learn effectively at a higher learning rate, albeit with potentially greater fluctuations. Overall, the parameterized Dueling DQN and DDPG combination algorithm demonstrates strong learning capabilities and relatively stable performance across different learning rates, while the performance of the standalone DDPG and DQN algorithms tends to degrade when using higher learning rates. This suggests that the parametric mixed action-space combination algorithm is more effective in complex environments.

The parameter comparison between the Parametric Dueling DQN and DDPG and other methods is shown in Table 5. It can be seen that the Parametric Dueling DQN and DDPG method has a higher average reward, but its convergence stability is slightly inferior to the DQN and the DDPG. The greedy algorithm exhibits smaller convergence variance, but it is prone to becoming trapped in local optima, resulting in lower rewards than those achieved by the Parametric Dueling DQN and DDPG algorithms. Although the genetic algorithm addresses the issue of local optima associated with the greedy algorithm, it suffers from a larger convergence variance and is less likely to achieve stable convergence.

The scheduling diagram is depicted in Figure 6. During the period from 0:00 to 10:00, which represents the trough period, wind and photovoltaic power generation can essentially satisfy the demand for electricity load, and surplus energy can be stored in the energy storage device. Between 6:00 and 14:00, as the photovoltaic power generation increases, while the wind power generation remains relatively stable, the charging power of the storage facility reaches its peak. From 14:00 to 20:00, as the photovoltaic power decreases and the load pressure gradually increases, the micro gas turbine commences power production, and the energy storage device gradually transitions from the charging state to the discharging state. It can be observed that the optimal scheduling scheme for the microgrid proposed in this paper can allocate energy based on the environmental conditions, fulfilling the roles of peak shaving and valley filling, thereby reducing grid fluctuations and dependence on the main grid and further diminishing the microgrid’s operational costs.

5. Conclusions

This paper proposes a microgrid optimization operation method based on the parameterized Dueling DQN and DDPG for the scheduling optimization problem of microgrids. This method can effectively solve complex optimization scheduling problems in microgrids, improve the quality of grid services, and minimize operating costs.

This method models the microgrid optimization problem as a hybrid action-space reinforcement learning problem. With this approach, discrete actions such as battery charging and discharging, gas turbine starting and stopping, etc., are subject to reinforced learning using the parameterized Dueling DQN algorithm; continuous actions such as generator output power, energy allocation ratio, etc., are processed by the DDPG algorithm. The parameterized Dueling DQN avoids the bias problem of median estimation in traditional DQN algorithms by decomposing the Q-value function. At the same time, the DDPG effectively addresses the curse-of-dimensionality challenges brought about by continuous action spaces. The organic combination of the two enables this method to efficiently learn the optimal scheduling strategy for microgrids.

To verify the effectiveness and superiority of this method, this study conducted a large number of simulation experiments in different microgrid scenarios. The experimental results show that, compared with using the DQN or DDPG algorithm alone, this method significantly improved the average reward value, exhibiting better convergence and optimality. Meanwhile, in a comparative analysis with other heuristic optimization algorithms and rule-based scheduling strategies, this method achieved excellent results based on multiple evaluation indicators, such as power grid service quality, operating costs, and renewable energy utilization rate.

In summary, the proposed microgrid optimization operation method based on the parameterized Dueling DQN and DDPG can achieve intelligent optimization scheduling of microgrids in different scenarios, improve the overall efficiency and economy of the grid, and provide strong support for the green and intelligent development of microgrids.

Author Contributions

Research methodology, W.H. and Q.L.; coding, W.H. and Q.L.; result analysis and discussion, W.H. and Y.J.; investigation, Y.J. and X.L.; writing, W.H. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 51937005).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Wu, D.; Dragicevic, T.; Vasquez, J.C.; Guerrero, J.M.; Guan, Y. Secondary coordinated control of islanded microgrids based on consensus algorithms. In Proceedings of the 2014 IEEE Energy Conversion Congress and Exposition (ECCE), Pittsburgh, PA, USA, 14–18 September 2014; pp. 4290–4297. [Google Scholar] [CrossRef]
Chen, B.; Jiang, J.; Shao, Y. Integrated Scheduling and Control System of Microgrid Based on Dynamic Programming Algorithm. In Proceedings of the 2023 IEEE International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 24–25 February 2023; pp. 1–5. [Google Scholar] [CrossRef]
Dai, X.; Tang, Y.; Yao, S. Application of genetic algorithm and particle swarm algorithm in microgrid dispatch model considering energy storage. In Proceedings of the 2023 IEEE 6th International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), Shenyang, China, 15–17 December 2023; pp. 850–854. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Y.; Zhao, K.; Deng, H.; Wang, F.; Zhuo, F. PEDF (Photovoltaics, Energy Storage, Direct Current, Flexibility) Microgrid Cost Optimization Based on Improved Whale Optimization Algorithm. In Proceedings of the 2023 IEEE 14th International Symposium on Power Electronics for Distributed Generation Systems (PEDG), Shanghai, China, 9–12 June 2023; pp. 598–603. [Google Scholar] [CrossRef]
Ghavifekr, A.A.; Mohammadzadeh, A.; Ardashir, J.F. Optimal Placement and Sizing of Energy-related Devices in Microgrids Using Grasshopper Optimization Algorithm. In Proceedings of the 2021 12th Power Electronics, Drive Systems, and Technologies Conference (PEDSTC), Tabriz, Iran, 2–4 February 2021; pp. 1–4. [Google Scholar] [CrossRef]
Wan, L.; Liu, L.; Cai, D.; Chen, R.; Rao, Y.; Liu, H.; Xie, L. Load frequency control of isolated microgrid based on soft actor-critic algorithm. In Proceedings of the 2022 Power System and Green Energy Conference (PSGEC), Shanghai, China, 25–27 August 2022; pp. 710–715. [Google Scholar] [CrossRef]
Wu, J.; Birong, X.; Shulei, D. Dynamic Economic Dispatch of MicroGrid Using Improved Imperialist Competitive Algorithm. In Proceedings of the 2015 8th International Conference on Intelligent Computation Technology and Automation (ICICTA), Nanchang, China, 14–15 June 2015; pp. 397–401. [Google Scholar] [CrossRef]
Vergara, P.P.; Torquato, R.; da Silva, L.C.P. Towards a real-time Energy Management System for a Microgrid using a multi-objective genetic algorithm. In Proceedings of the 2015 IEEE Power & Energy Society General Meeting, Denver, CO, USA, 26–30 July 2015; pp. 1–5. [Google Scholar] [CrossRef]
Shuai, H.; Li, F.; Pulgar-Painemal, H.; Xue, Y. Branching Dueling Q-Network-Based Online Scheduling of a Microgrid with Distributed Energy Storage Systems. IEEE Trans. Smart Grid 2021, 12, 5479–5482. [Google Scholar] [CrossRef]
Fan, Z.; Zhang, W.; Liu, W. Multi-Agent Deep Reinforcement Learning-Based Distributed Optimal Generation Control of DC Microgrids. IEEE Trans. Smart Grid 2023, 14, 3337–3351. [Google Scholar] [CrossRef]
Chen, W.; Wu, N.; Huang, Y. Real-Time Optimal Dispatch of Microgrid Based on Deep Deterministic Policy Gradient Algorithm. In Proceedings of the 2021 International Conference on Big Data and Intelligent Decision Making (BDIDM), Guilin, China, 23–25 July 2021; pp. 24–28. [Google Scholar] [CrossRef]
Xia, Y.; Xu, Y.; Wang, Y.; Dasgupta, S. A Distributed Control in Islanded DC Microgrid based on Multi-Agent Deep Reinforcement Learning. In Proceedings of the IECON 2020 The 46th Annual Conference of the IEEE Industrial Electronics Society, Singapore, 18–21 October 2020; pp. 2359–2363. [Google Scholar] [CrossRef]
Islam, S.; Mostaghim, S.; Hartmann, M. A Survey on Multi-Objective Optimization in Microgrid Systems. In Proceedings of the 2024 IEEE Congress on Evolutionary Computation (CEC), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
Bjerland, S.; Del Granado, P.C.; GrØttum, H.; Nokandi, E. TSO-DSO Coordination Under Wind and Solar Power Uncertainty: A Two-Stage Stochastic Programming Approach. In Proceedings of the 2024 20th International Conference on the European Energy Market (EEM), Istanbul, Turkiye, 10–12 June 2024; pp. 1–8. [Google Scholar] [CrossRef]
Song, K.; Feng, J.; Ying, Z. Optimized operation of microgrid based on improved honey badger algorithm. In Proceedings of the 2024 3rd International Conference on Energy, Power and Electrical Technology (ICEPET), Chengdu, China, 17–19 May 2024; pp. 654–660. [Google Scholar] [CrossRef]
Wang, X.; Vinel, A. Cross learning in deep q-networks. arXiv 2020, arXiv:2009.13780. [Google Scholar]
Lin, N.; Tang, H.; Zhao, L.; Wan, S.; Hawbani, A.; Guizani, M. A PDDQNLP Algorithm for Energy Efficient Computation Offloading in UAV-Assisted MEC. IEEE Trans. Wirel. Commun. 2023, 22, 8876–8890. [Google Scholar] [CrossRef]
Molotov, P.; Vaskov, A.; Tyagunov, M. Modeling Processes in Microgrids with Renewable Energy Sources. In Proceedings of the 2018 International Ural Conference on Green Energy (UralCon), Chelyabinsk, Russia, 4–6 October 2018; pp. 203–208. [Google Scholar] [CrossRef]
Quan, D.; Tang, L.; Wang, X.; Xie, H. Battery-storage-centered Microgrids: Modelling and Simulation Demonstration. In Proceedings of the 2023 IEEE Sustainable Power and Energy Conference (iSPEC), Chongqing, China, 29–30 November 2023; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Microgrid system architecture.

Figure 2. Parameterized Dueling DQN and DDPG network architecture.

Figure 3. Optimal scheduling process for microgrids based on Parametric Dueling DQN and DDPG.

Figure 4. Comparison of the performance of the algorithms at a learning rate of

1 \times 10^{- 5}

.

Figure 4. Comparison of the performance of the algorithms at a learning rate of

1 \times 10^{- 5}

.

Figure 5. Comparison of the performance of the algorithms at a learning rate of

1 \times 10^{- 4}

.

Figure 5. Comparison of the performance of the algorithms at a learning rate of

1 \times 10^{- 4}

.

Figure 6. Microgrid day-ahead scheduling.

Table 1. Parameters of storage batteries.

Maximum Capacity/kW·h	Minimum Capacity/kW·h	Maximum Charging Power/kW	Maximum Charging Power/kW	Charge/Discharge Factor
200	200	20	20	0.9

Table 2. Parameters of micro gas turbine.

Maximum Output Power/kW	Electrical Efficiency
100	28%

Table 4. Parameters of DRL.

Network	Batch Size	Learning Rate	Memory Size	Input Layer	Hidden Layer	Output Layer
DDPG (actor)	128	$1 \times 10^{- 4}$	2000	$s_{t} . d i m$ + $a_{d} . d i m$	[256, 128, 64]	a(t).dim
Dueling DQN (critic, V)	128	$1 \times 10^{- 4}$	2000	$s_{t} . d i m$	[256, 128, 64]	1
Dueling DQN (critic, A)	128	$1 \times 10^{- 4}$	2000	$s_{t} . d i m$	[256, 128, 64]	a(t).dim

Table 5. Comparison of parameters between Parametric Dueling DQN and DDPG and other methods.

Algorithm	Average Reward	Convergence Variance
Parameterized Dueling DQN and DDPG	157.65	0.254
DDPG	131.89	0.185
DQN	109.45	0.148
Greedy algorithm	85.32	0.076
Genetic algorithm	94.43	0.463

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Li, Q.; Jiang, Y.; Lu, X. Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids. Processes 2024, 12, 1822. https://doi.org/10.3390/pr12091822

AMA Style

Huang W, Li Q, Jiang Y, Lu X. Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids. Processes. 2024; 12(9):1822. https://doi.org/10.3390/pr12091822

Chicago/Turabian Style

Huang, Wei, Qing Li, Yuan Jiang, and Xiaoya Lu. 2024. "Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids" Processes 12, no. 9: 1822. https://doi.org/10.3390/pr12091822

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Parametric Dueling DQN- and DDPG-Based Approach for Optimal Operation of Microgrids

Abstract

1. Introduction

2. System Model

2.1. Photovoltaic Power Model

2.2. Wind Energy Model

2.3. Micro Gas Turbine Model

2.4. Storage Battery Model

3. Optimal Scheduling Model for Microgrids Based on Parametric Dueling DQN and DDPG

3.1. A Scheduling Method Based on DRL

3.2. Parameterized Dueling DQN and DDPG Algorithms

4. Simulation and Result Analysis

4.1. Simulation Parameters

4.2. Discussion and Result Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI