Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning

Irnawan, Roni; Rizqi, Ahmad Ataka Awwalur; Yasirroni, Muhammad; Putranto, Lesnanto Multa; Ali, Husni Rois; Firmansyah, Eka; Sarjiya,

doi:10.3390/en16145369

Open AccessArticle

Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning

by

Roni Irnawan

^1,2,*

,

Ahmad Ataka Awwalur Rizqi

¹,

Muhammad Yasirroni

¹

,

Lesnanto Multa Putranto

^1,2

,

Husni Rois Ali

¹,

Eka Firmansyah

¹ and

Sarjiya

^1,2,*

¹

Department of Electrical Engineering and Information Technology, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia

²

Center for Energy Studies, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia

^*

Authors to whom correspondence should be addressed.

Energies 2023, 16(14), 5369; https://doi.org/10.3390/en16145369

Submission received: 12 May 2023 / Revised: 19 June 2023 / Accepted: 4 July 2023 / Published: 14 July 2023

(This article belongs to the Special Issue Optimal Planning, Integration and Control of Renewable-Based Microgrid Systems)

Download

Browse Figures

Versions Notes

Abstract

:

There has been tremendous interest in the development of DC microgrid systems which consist of interconnected DC renewable energy sources. However, operating a DC microgrid system optimally by minimizing operational cost and ensures stability remains a problem when the system’s model is not available. In this paper, a novel model-free approach to perform operation control of DC microgrids based on reinforcement learning algorithms, specifically Q-learning and Q-network, has been proposed. This approach circumvents the need to know the accurate model of a DC grid by exploiting an interaction with the DC microgrids to learn the best policy, which leads to more optimal operation. The proposed approach has been compared with with mixed-integer quadratic programming (MIQP) as the baseline deterministic model that requires an accurate system model. The result shows that, in a system of three nodes, both Q-learning (74.2707) and Q-network (74.4254) are able to learn to make a control decision that is close to the MIQP (75.0489) solution. With the introduction of both model uncertainty and noisy sensor measurements, the Q-network performs better (72.3714) compared to MIQP (72.1596), whereas Q-learn fails to learn.

Keywords:

DC microgrids; optimisation; Q-learning; Q-network; reinforcement learning

1. Introduction

There has been tremendous interest in generating electric energy from renewable energy resources [1,2] owing to environmental concerns and current advances in power electronics. Unlike conventional power plants, which often come in large sizes, the capacity of renewable power plants strongly depends on local potential and ranges from a few kilowatts to several megawatts.

Currently, small-sized renewable energy power plants such as micro hydro, photo-voltaic (PV), or wind turbines, are often connected to local loads at the distribution level to configure autonomous alternating current (AC) microgrids [3]. These systems are expected to be able to work with and without the grid connections. However, the problems of reactive power compensation and frequency control in AC microgrids (ACMG) can be more challenging than conventional AC systems due to the intermittent nature of wind and solar power sources [4].

As most renewable energies, especially wind turbine type-4 and PV, inherently produce direct current (DC) outputs, it is natural to interconnect them using DC voltage to configure direct current microgrids (DCMG). Different from AC microgrids, DC microgrids have no problems with reactive and frequency controls [4]. Moreover, in a DC microgrid system, the use of an energy storage system (ESS), notably a battery energy storage system (BESS), becomes indispensable to balance the intermittency. This has been demonstrated in [5] where a method to coordinate several ESSs has been proposed.

DC voltage in a DC grid can be considered as the power balance indicator, i.e., similar to frequency in an AC grid [4,6,7,8,9]. By controlling the DC voltage level within the grid, a certain power flow can be achieved. The DC source connected to a DC grid can be operated in three different modes, i.e.,: droop-controlled voltage, constant power, or constant voltage mode. The response of these modes when subjected to a disturbance is illustrated in Figure 1.

Wind or PV power plants are usually operated at their maximum power point, hence this DC source is often operated in active power control, i.e., injecting active power to the DC microgrid system regardless of the DC voltage condition on the DC microgrid. This also means that the intermittent operation of these power plants is reflected in the DC voltage fluctuations. In order to counter the changes in the DC voltage, BESS can be operated in the DC voltage mode to maintain a constant DC voltage. This means that BESS can be absorbing or injecting power to the grid whenever there is a surplus or deficiency of power in the DC microgrid. However, the operation of BESS is a complex problem as it has special constraints such as Depth of Discharge (DOD) requirements, State of Charge (SOC) limitations, Charge and Discharge rates, among others [10]. Therefore, it might happen that BESS switches its mode to DC voltage droop mode to limit the contribution to maintain the DC voltage within the DC microgrid or even operate in the constant power mode when reaching its charge or discharge limit.

1.1. Related Work

In order to operate DC microgrids in an optimal way, a method to minimize operating costs has been proposed in [11] using a particle swarm optimization (PSO) algorithm. A similar idea of minimizing operating cost while improving stability has been proposed in [12] using a switch system approach called supervisory control (SC), which appropriately selects the operating mode to optimize the control objectives. In another study using adaptive dynamic programming (ADP) [13] and tabu search (TS) [14], researchers found that using a central control system that coordinates distributed control devices across multiple nodes in the microgrid can help optimize a common objective. The switching problem of voltage sources across multiple nodes has been further studied in [15] to guarantee the analytic performance of the proposed controller. The author in [15] formulated the switching problem of multi-node voltage sources as mixed-integer linear programming (MILP) problem. The stability study used in [15] is based on the previous study of a small signal model for low-voltage DCMG (LVDCMG) in [16].

Although the proposed methods in the literature have demonstrated satisfactory performances in the scenarios presented in the corresponding papers, they necessitate the exact model of DC grids. This assumption is very restrictive for several reasons. Firstly, from the practical aspect, DC microgrids consist of several components coming from different vendors. Each vendor is normally unwilling to share the detail of components, e.g., the transfer functions of the components are their controllers, to safeguard their intellectual properties. Secondly, even if accurate models are available, some components experience wear and tear after a certain time, which leads to model uncertainty. A study exploring the incomplete state model is explored in [17] by considering communication delay. The author overcame the lack of an exact system model by implementing a heuristic-based approach, that is, a genetic algorithm (GA). That is because GA did not require an exact model to obtain a sufficient solution approximation. Hence, the natural remedy for this is to use a method that does not rely on the exact model of the DC grids.

In the current power system hierarchy, there have been large amounts of data generated by, for example, smart meters and phasor measurement units (PMUs). These data contain very rich information regarding operating status of the systems at a time instance. On the other hand, recent progress in applied mathematics and data science leads to the possibility to extract useful information for power system operation without prior supervisory knowledge on how to extract meaningful information from data. An example of this is the reinforcement learning (RL) [18] approach, which is a method in machine learning allowing an agent to take a sequence of appropriate actions to maximize a certain objective. This method has recently been adopted in power systems [19], for instance to solve the Volt/Var control (VVC) problem in medium voltage AC (MVAC) distribution systems [20], real-time wide-area stabilizing control in a high-voltage AC (HVAC) system [21], and a data-driven load frequency control in a multi-area HVAC system [22].

The purpose of this study is to propose a novel method to solve the DC microgrid switching problem optimally given limited information about the exact model of DC grids based on a reinforcement learning algorithm. The limitation is modeled as noise and error in the system matrix model, as a form of simulating noise in sensor measurements and an imprecise power system model. The table highlighting the difference between our study compared with other works can be seen in the comparison of related works in Table 1. The data challenges shown in the table refer to communication delay, measurement noise, and model error.

1.2. Contributions

Due to the fact that accurate models of DC grids are not always readily available, this paper proposes a novel model-free approach to performing operation control of DC microgrids based on reinforcement learning. The variances of reinforcement learning algorithm is proposed in the paper because of its ability to learn efficiently from past data of the state and action, such as the current and operation cost data which correspond to any mode of operation in the context of a DC microgrid system. In addition, we show that our proposed approach can cope with the model uncertainty due to noise in sensor measurement and imprecise power system model, while producing near-optimal policy which takes both stability and operational cost into consideration.

The contributions of this paper can be summarized as follows:

Propose a novel model-free approach for solving the LVDCMG optimal switching problem
Demonstrate the ability of a reinforcement learning algorithm to solve LVDCMG optimal switching problem under measurement noise and imprecise power system mode
Provide a minimal working example for applying reinforcement learning parameters in the LVDCMG optimal switching problem

2. Operation Control of DC Microgrids

2.1. Models of DC Microgrids

Each power source in distributed droop-controlled LVDCMG can be operated in various different modes of operation, that is droop-controlled voltage source (DVS), a constant power source (CPS), or a constant power load (CPL) [15,16] as can be seen in Figure 2. For instance, a battery can be operated as either a CPS on discharging, CPL on charging, or a DVS both at charging and discharging.

For each droop-controlled voltage source

j \in N_{d}

,

V_{j} = V_{j}^{0} - d_{j} P_{D j},

(1)

where

V_{j}

is the droop controller output voltage,

V_{j}^{0}

is the droop controller nominal voltage value,

d_{j}

is the droop controller gain, and

P_{D j}

is the droop controller output power. The small-signal representation of the droop-controlled voltage source can be modeled as

v_{j} = - (\frac{d_{j} {\bar{V}}_{j}}{1 + d_{j} {\bar{I}}_{j}}) i_{D j}

(2)

where

v_{j}

is the small-signal perturbation of the voltage,

{\bar{V}}_{j}

and

{\bar{I}}_{j}

are the DC voltage and current steady-state operating point, respectively, and

i_{D j}

is the corresponding small-signal perturbation of the current. The individual small-signal models for each node are aggregated which leads to

v = - \tilde{d} i_{D},

(3)

where

v : = [v_{1}, v_{2}, \dots, v_{N_{d}}]

,

i_{D} : = [i_{1}, i_{2}, \dots, i_{N_{d}}]

, and

\tilde{d}

are a diagonal matrix with diagonal elements

\frac{d_{j} {\bar{V}}_{j}}{1 + d_{j} {\bar{I}}_{j}}

, and

j = 1, 2, \dots, N_{d}

.

Suppose that the microgrids consist of

N_{C P S}

number of CPsS out of

N_{s}

sources where the set of CPSs is given by

N_{C P S} : = \{1, 2, \dots, N_{C P S}\}

. By modeling a CPS as a current source in parallel with a conductance for each CPS

j \in N_{C P S}

, we have the following models

i_{C P S_{j}} = g_{C P S_{j}} v_{j} + {\tilde{i}}_{C P S_{j}},

(4)

where

i_{C P S_{j}}

and

v_{j}

are the corresponding CPS small-signal output current and voltage. The values of the corresponding conductance

g_{C P S_{j}}

and current source

{\tilde{i}}_{C P S_{j}}

are determined by the power and voltage at an operating point as follows

g_{C P S_{j}} = \frac{P_{C P S_{j}}}{V_{D C_{j}}^{2}},

(5)

I_{C P S_{j}} = \frac{2 P_{C P S_{j}}}{V_{D C_{j}}},

(6)

where

P_{C P S_{j}}

and

V_{D C_{j}}

are output power and nominal voltage at operating point of the CPS. It needs to be noted that

I_{C P S_{j}}

is the nominal current source value, while

{\tilde{i}}_{C P S_{j}}

is the small-signal perturbation of the current source. All the small-signal model can be aggregated into the following equation

i_{C P S} = g_{C P S} v_{C P S} + {\tilde{i}}_{C P S_{j}},

(7)

where

i_{C P S} : = [i_{C P S_{1}}, i_{C P S_{2}}, \dots, i_{C P S_{N_{C P S}}}],

v_{C P S} = [v_{1}, v_{2}, \dots, v_{N_{C P S}}],

{\tilde{i}}_{C P S} = [{\tilde{i}}_{C P S_{1}}, {\tilde{i}}_{C P S_{2}}, \dots, {\tilde{i}}_{C P S_{N_{C P S}}}],

g_{C P S} = [\begin{matrix} g_{C P S_{1}} & 0 & 0 \\ 0 & \dots & 0 \\ 0 & 0 & g_{C P S_{N_{C P S}}} \end{matrix}],

Similar to CPS, a CPL is modeled as a negative conductance in parallel to a current sink. Suppose that the set of CPLs is given by

N_{C P L} : = \{1, 2, \dots, N_{C P L}\}

where

N_{C P L}

refers to the number of CPLs in the microgrids. Each CPL

j \in N_{C P L}

can be modeled as follows

i_{C P L_{j}} = - g_{C P L_{j}} v_{j} + {\tilde{i}}_{C P L_{j}},

(8)

where

i_{C P L_{j}}

and

v_{j}

are the corresponding small-signal output current and voltage of the CPL, respectively. The values of the corresponding conductance

g_{C P L_{j}}

and current sink

{\tilde{i}}_{C P L_{j}}

are determined by the power and voltage at an operating point as follows:

g_{C P L_{j}} = \frac{P_{C P L_{j}}}{V_{D C_{j}}^{2}},

(9)

I_{C P L_{j}} = \frac{2 P_{C P L_{j}}}{V_{D C_{j}}},

(10)

where

P_{C P L_{j}}

and

V_{D C_{j}}

are the output power and DC nominal voltage at the operating point of the CPL. The small-signal model can be aggregated into

i_{C P L} = - g_{C P L} v_{C P L} + {\tilde{i}}_{C P L_{j}},

(11)

where

i_{C P L} : = [i_{C P L_{1}}, i_{C P L_{2}}, \dots, i_{C P L_{N_{C P L}}}],

v_{C P L} = [v_{1}, v_{2}, \dots, v_{N_{C P L}}],

{\tilde{i}}_{C P L} = [{\tilde{i}}_{C P L_{1}}, {\tilde{i}}_{C P L_{2}}, \dots, {\tilde{i}}_{C P L_{N_{C P L}}}],

g_{C P L} = [\begin{matrix} g_{C P L_{1}} & 0 & 0 \\ 0 & \dots & 0 \\ 0 & 0 & g_{C P L_{N_{C P L}}} \end{matrix}] .

Suppose that a set of

N_{b}

power lines connecting the adjacent vertices (sources and loads) in the microgrid is given by

N_{b} : = \{1, 2, \dots, N_{b}\}

. Each power line

j \in N_{b}

can be modeled as follows

v_{b_{j}} = l_{b_{j}} \frac{d i_{b_{j}}}{d t} + r_{b_{j}} i_{b_{j}},

(12)

where

v_{b_{j}}

and

i_{b_{j}}

are the power line voltage and current, respectively, and

l_{b_{j}}

and

r_{b_{j}}

are the power line inductance and resistance with length

ℓ_{j}

, respectivley. The aggregate of all individual small-signal models can be expressed as

v_{b} = l_{b} \frac{d i_{b}}{d t} + r_{b} i_{b},

(13)

where

v_{b} : = [v_{b_{1}}, v_{b_{2}}, \dots, v_{b_{N_{b}}}],

i_{b} = [i_{b_{1}}, i_{b_{2}}, \dots, i_{b_{N_{b}}}],

l_{b} = [\begin{matrix} l_{b_{1}} & 0 & 0 \\ 0 & \dots & 0 \\ 0 & 0 & l_{b_{N_{b}}} \end{matrix}],

r_{b} = [\begin{matrix} r_{b_{1}} & 0 & 0 \\ 0 & \dots & 0 \\ 0 & 0 & r_{b_{N_{b}}} \end{matrix}] .

Finally, the DC microgrids follow Kirchhoff’s voltage law (KVL) and Kirchhoff’s current law (KCL) as follows

v_{b} = Mv,

(14)

i_{D} - i_{C P} = M^{T} i_{b} .

(15)

By combining (3), (7), (11) and (13)–(15), the dynamics of the DC microgrids can be summarized in the following state equations:

\dot{x} = A x + B,

(16)

where

x = i_{b} \in R^{N_{x}}

and

A = - l_{b}^{- 1} (M {({\tilde{d}}^{- 1} - g_{C P})}^{- 1} M^{T} + r_{b}),

(17)

B = - l_{b}^{- 1} M {({\tilde{d}}^{- 1} - g_{C P})}^{- 1} .

(18)

Any operational modes (i.e., DVS, CPS, or CPL) will follow linear state equations described in (16), albeit with different values of

A

and

B

. Therefore, the dynamics can be slightly modified as follows

\dot{x} = A^{σ} x + B^{σ},

(19)

where

A^{σ}

and

B^{σ}

stand for matrix

A

and

B

, respectively, for mode

σ \in (1, N)

, where N refers to the total number of modes under consideration.

The DC microgrid’s dynamics can be reformulated in discrete forms as follows

x (k + 1) = x (k) + \dot{x} (k) Δ t,

(20)

where

Δ t

is a time-integration constant. Substituting (19) to (20), the complete dynamics in discrete forms can be rewritten as

x (k + 1) = A_{k}^{σ} x (k) + B_{k}^{σ},

(21)

where

A_{k}^{σ} = A^{σ} Δ t + 1

and

B_{k}^{σ} = B^{σ} Δ t

. Therefore, the decision variable that we can give to the system is

u (k) = {[\begin{matrix} u_{1} (k) & u_{2} (k) & \dots & u_{N} (k) \end{matrix}]}^{T}

, where

u_{σ} \in (0, 1)

refers to the status of mode

σ

of the DC microgrid and is given by

u_{σ} = \{\begin{matrix} 1 & if mode σ is active \\ 0 & if mode σ is inactive \end{matrix} .

(22)

Note that only one mode can be active at a time.

2.2. Problem Statement

In this paper, we try to find the most optimal operational control for the DC microgrids modeled in Section 2.1. The problem can be formulated as follows:

Problem 1

(Optimal operation control). Given a microgrid system following a certain dynamics

x (k + 1) = f (x (k), u (k))

which produces an operational cost

c (x (k), u (k))

, what is the input signal

u^{*} (k) = {[\begin{matrix} u_{1}^{*} (k) & u_{2}^{*} (k) & \dots & u_{N}^{*} (k) \end{matrix}]}^{T}

which will minimize the total cost C defined as

C = \sum_{k = 0}^{M} c (x (k), u (k)) .

(23)

Here, N denotes the number of available modes of operation and

u_{i} \in (0, 1)

for every

i \in (1, N)

and M denotes the number of steps.

Suppose that the cost function at iteration-k is given by

c (k) = \frac{1}{2} x {(k)}^{T} Q x (k) + γ_{σ} u_{σ} (k) .

(24)

where

Q \in R^{N_{x} \times N_{x}}

stands for a positive definite matrix while

γ_{σ}

stands for the cost of operating the current active mode

σ

. Here, the first term corresponds to the stability of the system (i.e., an unstable system causing large values of state

x

will be penalized) while the second term corresponds to the operation cost of the operational mode being employed. If the system dynamics

f (x (k), u (k))

is fully known and linear while the cost function

c (x (k), u (k))

is linear or quadratic, one way to solve the presented optimization problem is to employ linear/quadratic programming. In this case, the system dynamics in (19) can be written in the form of a constraint to the optimization problem. One can then employ algorithm such as the mixed-integer quadratic programming (MIQP) to solve the optimization problem as reported in [15].

However, the requirement to fully know the system dynamics accurately might be difficult to realize in many practical situations. Some parameters in a DC microgrid system might not be known beforehand due to communication delay [17]. Meanwhile, system identification to acquire the model parameters could be complicated, especially if the system contains a large number of components.

3. Reinforcement-Learning-Based Operation Control

3.1. DC Microgrids as a Markov Decision Process

To solve the optimization problem in Problem 1, we can consider the DC Microgrids with unknown system dynamics as a Markov Decision Process (MDP) [23]. By doing so, we assume that the DC Microgrids system fulfills the following criteria:

the future state $s (k + 1)$ only depends on the current state $s (k)$ , not the previous state history,
the system accepts a finite set of actions $a (k)$ at every step,
the system will provide state information $s (k)$ and reward $r (k)$ at every step.

The reward function

r (k)

denotes how good the performance of the state

s (k)

at iteration-k is. Therefore, the optimization problem described in Problem 1 where the cost function C needs to be minimized can be transformed into an optimization problem to find a policy

π = u^{*} (k)

which will maximize the total reward

R = \sum_{k = 0}^{M} r (k)

over M steps.

This discrete dynamics equation in (21) can be transformed into an MDP where the state

s

is given by

s = {[\begin{matrix} x & c \end{matrix}]}^{T}

and the action

a = u

. We assume that the cost

c (k)

follows Equation (24). The reward function

r (k)

can be derived from the cost function

c (k)

as follows:

r (k) = \{\begin{matrix} c_{s} - c (k) & if c_{s} > c (k) \\ 0 & if c_{s} \leq c (k) \end{matrix},

(25)

where

c_{s}

denotes a positive constant. Thus, the objective function used to evaluate each model is

max_{u} \sum_{k = 0}^{M} r (k) .

(26)

3.2. Q-Learning for Near-Optimal Operation Control

To solve the optimization problem for MDP modeling dynamics of DC microgrids, a reinforcement learning approach is employed. This approach allows an agent to interact with an environment (i.e., system) without knowledge of the system’s dynamics by applying an action

a

and receiving a state information

s

as well as a reward r as shown in Figure 3. The agent then learns from this information to decide the most optimal policy

π = a (k)

, where

k \in (0, M)

, which will maximize the cumulative rewards

R = \sum_{k = 0}^{M} r (k)

for M number of steps.

The reinforcement learning algorithm employed in this paper falls under the class of algorithm called Q-learning [18] which has been successfully employed for various engineering problems, such as gaming [24] and robotics [25]. The algorithm works by predicting an action-value function

Q (s, a)

for every state

s

and action

a

which reflects the expected final reward when choosing an action

a

in the state

s

. Once

Q (s, a)

is established, the optimum policy

π (s)

for every state

s

will select an action which has the most action-value (i.e., will yield the most rewards in the long-term) as given by [18],

π (s) = arg max_{a} Q (s, a) .

(27)

For a discrete number of states, the way to build an estimate of action-value function

Q (s, a)

is by constructing a table which maps a state

s

and action

a

into a value

Q (s, a)

as shown in Figure 4. This algorithm, called tabular Q-learning, starts with an empty table. The table will be updated upon interaction with the environment by taking into account information regarding current state

s

, the given action

a

, the current reward

r (s)

, and the next state

s^{'} = s (k + 1)

. The update equation is based on Bellman’s optimality equation and is given by [18],

Q (s, a) \overset{}{\leftarrow} Q (s, a) + α (r (s) + γ max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)) .

(28)

Here,

α

stands for a learning rate constant which determines how much we allow the value of

Q (s, a)

to change at every iteration and

γ

stands for the reward’s discount factor which determines how much we value the contribution of future rewards to the value of the current state. The iteration process stops once the cumulative reward R over M number of steps reach a certain threshold

R_{s}

.

One of the main reason in choosing Q-learning to solve the optimization problem in Problem 1 is because it is an off-policy algorithm, i.e., it is able to learn from and reuse any past experience regardless of the policy being used to gather this experience. In the context of DC microgrids, Q-learning is able to learn from past data of the state and action pairs, such as the current and operation cost data which corresponds to any mode of operation. Due to this fact, Q-learning is better in terms of data sampling efficiency compared to employing online reinforcement learning problem such as policy gradient method [26]. Another reason is because it is designed specifically for problems with discrete action space which is the characteristics of the operation control problem, i.e., to determine which mode of operation is the most optimal among a discrete number of possible operations.

3.3. Q-Network for Operation Control under Uncertainty

The tabular Q-learning approach found to be extremely effective in various applications when the number of state

s

is discrete. However, this is mostly not the case for a DC microgrid systems which can have a large number of possible current values and operational cost as a state. The presence of uncertainty in the measurement of the state value adds further complications. To handle a system with a large number of states or even a continuous state, the Q-table can be replaced with a function approximator which directly maps state

s

and action

a

into

Q (s, a)

. One such function approximator which has gained prominence in recent years is a neural network, due to its ability to approximate highly non-linear functions solely from raw data [27]. Recent study in employing neural networks into Q-learning as a function approximator of action-value function

Q (s, a)

has been carried out in [28]. This approach, called Q-network, works essentially the same as Q-learning, except that the Q-table is replaced with a neural network and the update equation in (28) is replaced with training steps of the neural network.

The neural network

ϕ (s)

, as depicted in Figure 5, accepts a state

s

as an input and returns values

Q (s, a)

of each possible discrete actions

a

. The target output of the network is derived from Bellman’s equation, similar to the term in Equation (28), as follows:

Q_{t} (s, a) = r (s) + γ max_{a^{'}} Q (s^{'}, a^{'}) .

(29)

The losses, given by

L = Q (s, a) - Q_{t} (s, a)

are used to update the network’s parameter

ϕ (s)

using the gradient descent algorithm.

To facilitate a better training of the neural network (which ideally requires uncorrelated training data), a strategy called experience buffer is employed. Upon interacting with the environment, the agent does not update the neural network’s parameter solely based on the current information, but saves a tuple

τ

defined as

τ = (s, a, s^{'}, r)

into a buffer of length

L_{b}

. In every iteration, a total of

N_{t} < L_{b}

of tuples are randomly sampled from the experience buffer and used as data training. The training process stops once the average cumulative reward over M number of steps

R_{a v g}

in the last

N_{E}

number of completed episode reaches a certain threshold

R_{s}

. To obtain diverse data at the initial stage of training, the agent initially starts with purely random action, i.e., the probability

ϵ

of choosing random action is set to 1. This parameter

ϵ

is then linearly decreased at every iteration until it reaches a certain minimum value

ϵ_{m i n}

after

N_{ϵ}

number of iterations. Afterwards, the agent will have a probability of

ϵ_{m i n}

of choosing a random action and a probability of

1 - ϵ_{m i n}

of choosing the most optimal policy in (27) from the output

Q (s, a)

provided by the neural network.

4. Methodologies

To evaluate the performance of the algorithm, simulations consisting of 2 environments are used as follows:

A simple environment: a simplified version of DC microgrid system consisting of a single source and a single load connected via a transmission line ( $N_{x} = 1$ ). The system can be operated in 2 different modalities ( $N = 2$ ). The number of update steps in a single episode of operation is set to be $M = 28$ . For simplicity, the state equation matrices are simplified and assumed to be $A = B = 1$ for mode $σ = 1$ and $A = - 10, B = 1$ for mode $σ = 2$ .
A complex environment: a more realistic DC microgrid system consisting of 3 nodes. Each node consists of a source and/or a load. Two transmission lines ( $N_{x} = 2$ ) connect Node 1 and Node 2 as well as Node 2 and Node 3 as shown in Figure 6. The system can be operated in 2 different modalities ( $N = 2$ ). The number of update steps in a single episode of operation is set to be $M = 40$ . In this environment, it is assumed that the system’s mode of operation can only be updated once every 1 s. The state equation matrices used in this environment are derived from (16)–(22).

The two environments are chosen because of their simplicity, yet are representative of simulating a DCMG power system model in the form of an MIQP problem. The first environment is the most possible simple representation of the system model, while the second environment is a simple building block for any DCMG system topology. For future research, this second environment can be used to build a bigger and more complex DCMG system topology.

For evaluating the performance of the algorithm, we design 3 different scenarios. The first one is when the model is assumed to be perfectly known to the MIQP algorithm and the current measurement is perfectly accurate, i.e., no modeling and measurement uncertainty. The second one is when there is a modeling error, i.e., the model information used by the MIQP algorithm does not perfectly match the real model of the system. In this scenario, the real transition matrix

A^{σ}

and input vector

B^{σ}

for each mode

σ

have an offset of

Δ A

and

Δ B

respectively with respect to the model information known to the MIQP algorithm. Finally, the third and last scenario is when there is a Gaussian noise with mean

μ

and standard deviation

υ

added to the current measurement

x

. Thus, the third scenario contains both the Gaussian noise and modeling error at the same time.

Apart from implementing Q-learning and Q-network, we also implement an optimization technique based on mixed-integer quadratic programming (MIQP) employed in [15]. The MIQP approach is used as the baseline approach and simulated under the assumption that the dynamics model of the DC microgrid is known. We tested the Q-learning, the Q-network, and the MIQP algorithm to find the most optimal action for each scenario in the given environment. Note that the Q-learning and Q-network do not know anything about the dynamics of the system while the MIQP requires information regarding the model’s dynamics. The neural network employed in the Q-network algorithm contains 1 hidden layer with 4 nodes. The parameters used in the algorithm are listed in detail in Table 2.

5. Results

5.1. Simple Environment

First, we present the results of the simulation in the first scenario when there is no modeling or measurement error. For this problem, the Q-learning and Q-network algorithm produce an exactly identical solution shown in Figure 7a,b. The best policy produced by both algorithms is shown in Figure 7a and the resulting state dynamics is shown in Figure 7b.

To evaluate the performance of both algorithms, we compare the cumulative reward per episode as function of training iteration number in Figure 8a. The red line, blue line, and green line indicate the cumulative reward produced by Q-learning, Q-network, and MIQP algorithm, respectively. Note that the cumulative reward of the Q-network is presented as an average cumulative reward over the last 100 episodes to take into account the stochastic nature of the Q-network algorithm. We can observe that the Q-learning algorithm (red line) converges very quickly towards its most optimal solution. The produced cumulative reward is also very close to the optimal solution produced by MIQP (green line). The Q-network algorithm (blue line) produces a solution with less average cumulative reward compared to Q-learning and MIQP. However, as can be observed in Figure 7a, the best solution produced by Q-network matches the one produced by Q-learning. Consequently, as shown in Figure 8b, the reward accumulated by Q-network over 28 steps match the reward accumulated by Q-learning. Overall, the total reward accumulated by both of these algorithm (red line for Q-learning and blue line for Q-network) over 28 steps is only slightly smaller than the optimal solution predicted by MIQP (green line for MIQP) as shown in Figure 8b. This can also be concluded from Table 3 where the final reward of Q-learning and Q-network (3.5867) is shown to be really close to the reward of MIQP (3.5871). Note that this performance is achieved by the Q-learning and Q-network without prior information of the system’s dynamics contrary to the MIQP.

The simulation results of the second scenario where there is a modeling error in the system is shown in Figure 9. From the plot of cumulative reward per episode as function of training iteration number in Figure 9a, we can observe that the Q-learning algorithm (red line) once again converges rapidly to the final solution. However, we can observe that the solution reached by this algorithm is no longer close enough to the solution provided by MIQP (green line) even in the presence of model error. The average cumulative reward produced by Q-network (blue line) is also less than the case where there is no model error. However, if we observe the cumulative reward gathered over the curse of 1 episode (28 steps) in Figure 9b, the best solution of Q-network (blue line) is slightly better than the one produced by MIQP (green line). This is the case since MIQP no longer provides the global optimum solution for the system due to the presence of modeling error. The Q-network algorithm, on the other hand, learns directly from the system without prior information regarding the model, and thus, its performance is not affected by the modeling error. We can observe this more clearly from Table 3 where the best reward produced by Q-network (3.6554) is better than the reward produced by Q-learning (3.3092) and MIQP (3.6293).

The simulation results of the third scenario in the presence of both the modeling error and the measurement noise is shown in Figure 10. In this scenario, the Q-learning algorithm fails to converge towards a solution. This is mainly caused by the fact that the measurement noise causes a lot of variation in the state values which consequently causes the Q-Table’s dimension to be too large. In contrast, the Q-network algorithm does not fall into this problem because it works using continuous approximation function in the form of a neural network. In Figure 10, we can observe that the Q-network algorithm in this case converges much faster (slightly after 10,000 iterations) compared to the previous cases. It demonstrates how the Q-network algorithm actually works better in a more complex scenario, such as the one with modeling and measurement uncertainty. The best solution produced by Q-network algorithm (blue line) also produces cumulative reward which is very close to the solution provided by MIQP (green line) in Figure 10. In fact, the best cumulative reward of Q-network (3.6685) is once again better than the one produced by MIQP (3.6667) as shown in Table 3. This demonstrates the power of Q-network algorithm to solve complex optimization problems in the presence of uncertainty without relying on prior information about the system’s model. These features make the Q-network algorithm a very strong candidate to be used in a real operational control problem of DC microgrid system.

5.2. Complex Environment

In this Section 5.2, we present the results of the complex simulation consisting of three nodes. The configuration of the network is shown in Figure 6. This Section 5.2 is almost identical with Section 5.1, only differing in system parameters. The system parameters are derived from state equation matrices presented in (16)–(22). The resulting sample of the transient current transfer between nodes as an impact of the switching can be seen in Figure 11. First, in a system without modeling and measurement error, the cumulative reward per episode, consisting of 40 steps, as a function of training iteration number can be observed in Figure 12. We can observe that Q-learning algorithm (red line) behaves as in Section 5.1, converging very quickly towards its most optimal solution. However, as we can see from Figure 12, Q-learning achieves the lowest score, followed by Q-network, while MIQP achieves the highest score. Furthermore, an increase in systems’ complexity can be seen to have an impact on the model-free approach as the difference of both Q-learning (74.27071075) and Q-network (74.42539101) to the global optimum solution by MIQP (75.04887579) widens compared to the simple system in Section 5.1. This happens because the information gap between the model-free approach and the fully known system’s dynamics parameters on MIQP solution is also widened.

The simulation results of the second scenario of a complex system with a modeling error in the system is shown in Figure 13. From the plot of cumulative reward per episode as a function of training iteration number in Figure 13, we can observe that the Q-learning algorithm (red line) keeps converging rapidly to the final solution, even achieving the best score followed by Q-network. From Figure 13, we can see that in the course of 1 episode (40 steps), both Q-learning (72.4335) and Q-network (72.3714) could perform relatively well even without prior knowledge of the model’s dynamics compared to MIQP (72.1596). Furthermore, the presence of modeling error greatly shifts the MIQP solution from the global optimum in the complex system scenario, showing the vulnerability of the MIQP approach under inaccurate system model. Nevertheless, Q-network requires many iterations to converge (around 500,000 iterations) compared to the first scenario where no model error exists.

The simulation results of the third scenario where there are both modeling error and measurement noise in the system is shown in Figure 14. In this scenario, the Q-learning algorithm fails to converge again towards a solution, just like in the simple model. In contrast, the Q-network algorithm is still capable of handling noise using its neural network model. Furthermore, we can see from Figure 14 that the solution of Q-network (72.3714) achieves a better score compared to the solution provided by MIQP (72.1596). This result shows that the Q-network algorithm is actually capable of handling both modeling and measurement uncertainty even in complex scenario. These features once again further solidify Q-network algorithm as very strong candidate to be used in a real operational control problem of DC microgrid system. Nevertheless, further research is needed to better understanding the limit of system complexity and measurement error that Q-network able to handle in feasible manner with acceptable results.

To investigate the impact of node quantity on the performance and training time of the Q-network, a series of experiments was conducted. The simulations involved placing clusters of nodes in a sequential arrangement, following the configuration of the complex network illustrated in Figure 6. The clusters consisted of 50, 100, and 1000 nodes, which can be equivalently represented as 150, 300, and 3000 nodes, respectively. The results of these experiments can be found in Table 4. The findings indicate that increasing the system size leads to a higher computational burden for Q-learning during training. However, the rate of increase in training time is lower than the rate of increase in the number of nodes. This suggests that the computational time increase is primarily caused by small-signal calculations, rather than the Q-network algorithm itself.

6. Conclusions and Future Works

In this paper, we propose a reinforcement learning algorithm to solve the optimal operational control problem of a DC microgrid system. The proposed algorithm works without relying on prior information regarding the system’s dynamics, but rather it learns the most optimal policy by interacting directly with the system. We tested two algorithms in this paper, namely Q-learning and Q-network, and compared the performance with the MIQP algorithm found in the literature as the baseline. The proposed algorithms are tested in two different environments, simple and complex environments; and three scenarios, namely perfect information, modeling error, and measurement noise. Both in the first and second environments, the Q-learning (3.5867 and 74.2707) and Q-network (3.5867 and 74.4254) algorithm are shown to produce solutions very close to the optimal solution of MIQP (3.5871 and 75.0489). These results are achieved by the RL methods without knowledge of the system’s dynamics, whereas the MIQP solution requires an accurate system model. It is shown that, when an accurate system model did not exist, meaning there is error in system’s model, both Q-learning (3.3092 and 72.4335) and Q-network (3.6554 and 72.3714) could perform better than MIQP (3.6293 and 72.1596) that requires an accurate system model. Moreover, with the introduction of measurement noise on top of the modeling error, the proposed algorithm of the Q-network is shown to produce a better solution to MIQP. These preliminary results demonstrate that the reinforcement learning algorithm is a strong candidate to solve optimal operation control of a DC microgrid system, especially when the system’s dynamics model is not available.

By showing the capabilities of reinforcement learning-based algorithms in solving model-free DC microgrids distributed droop switching problems, this paper suggests that other off-policy algorithms that did not require an exact system model might also be able to solve the same problem. This paper as a basic starting point in implementing a reinforcement learning algorithm in such environments also opens a challenge for a newer and more complex reinforcement learning algorithm that can be explored to achieve better results. Other kinds of machine-learning algorithms that did not require an exact system model that can learn from experience are also interesting topics to be explored.

In the future, more complicated environments and scenarios, such as the ones consisting of huge numbers of sources or modes, and more complicated constraints can be used, further exploring the limits of system complexity, mode choices, and uncertainties due to modeling and measurement error that Q-network can handle. We will also explore the use of more complex neural network architecture to improve the performance of the Q-network algorithm. Finally, employing the real data from DC microgrid system in the training process will also be investigated.

Author Contributions

Conceptualization, R.I. and S.; methodology, R.I., H.R.A., L.M.P. and E.F.; software, A.A.A.R. and M.Y.; validation, R.I., L.M.P. and S.; formal analysis, R.I.; investigation, R.I., A.A.A.R. and M.Y.; resources, R.I. and S.; data curation, A.A.A.R. and M.Y.; writing—original draft preparation, A.A.A.R. and M.Y.; writing—review and editing, R.I., A.A.A.R. and M.Y.; visualization, A.A.A.R. and M.Y.; supervision, S.; project administration, L.M.P.; funding acquisition, S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Indonesian Ministry of Research and Technology/National Agency for Research and Innovation; Indonesian Ministry of Education and Cultur; under World Class University Program managed by Institut Teknologi Bandung.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AC	alternating current
ACMG	AC microgrids
ADP	adaptive dynamic programming
BESS	battery energy storage system
CPS	constant power source
CPL	constant power load
DC	direct current
DCMG	DC microgrids
DOD	depth of discharge
DVS	droop-controlled voltage source
ESS	energy storage system
GA	genetic algorithm
HVAC	high voltage AC
KCL	Kirchoff’s current law
KVL	Kirchoff’s voltage law
LVDCMG	low-voltage DCMG
MDP	Markov decission process
MILP	mixed-integer linear programming
MIQP	mixed-integer quadratic programming
MVAC	medium voltage AC
PMU	phasor measurement unit
PSO	particle swarm optimization
PV	photo-voltaic
RL	reinforcement learning
SC	supervisory control
SOC	atate of charge
TS	tabu search
VVC	Volt/Var control

References

Blaabjerg, F.; Teodorescu, R.; Liserre, M.; Timbus, A.V. Overview of control and grid synchronization for distributed power generation systems. IEEE Trans. Ind. Electron. 2006, 53, 1398–1409. [Google Scholar] [CrossRef] [Green Version]
Carrasco, J.M.; Franquelo, L.G.; Bialasiewicz, J.T.; Galván, E.; PortilloGuisado, R.C.; Prats, M.M.; León, J.I.; Moreno-Alfonso, N. Power-electronic systems for the grid integration of renewable energy sources: A survey. IEEE Trans. Ind. Electron. 2006, 53, 1002–1016. [Google Scholar] [CrossRef]
Hatziargyriou, N.; Asano, H.; Iravani, R.; Marnay, C. Microgrids. IEEE Power Energy Mag. 2007, 5, 78–94. [Google Scholar] [CrossRef]
Guerrero, J.M.; Vasquez, J.C.; Matas, J.; de Vicuna, L.G.; Castilla, M. Hierarchical Control of Droop-Controlled AC and DC Microgrids—A General Approach Toward Standardization. IEEE Trans. Ind. Electron. 2011, 58, 158–172. [Google Scholar] [CrossRef]
Hou, N.; Li, Y. Communication-Free Power Management Strategy for the Multiple DAB-Based Energy Storage System in Islanded DC Microgrid. IEEE Trans. Power Electron. 2021, 36, 4828–4838. [Google Scholar] [CrossRef]
Irnawan, R.; da Silva, F.F.; Bak, C.L.; Lindefelt, A.M.; Alefragkis, A. A droop line tracking control for multi-terminal VSC-HVDC transmission system. Electr. Power Syst. Res. 2020, 179, 106055. [Google Scholar] [CrossRef]
Peyghami, S.; Mokhtari, H.; Blaabjerg, F. Chapter 3—Hierarchical Power Sharing Control in DC Microgrids. In Microgrid; Mahmoud, M.S., Ed.; Butterworth-Heinemann: Oxford, UK, 2017; pp. 63–100. [Google Scholar] [CrossRef]
Shuai, Z.; Fang, J.; Ning, F.; Shen, Z.J. Hierarchical structure and bus voltage control of DC microgrid. Renew. Sustain. Energy Rev. 2018, 82, 3670–3682. [Google Scholar] [CrossRef]
Abhishek, A.; Ranjan, A.; Devassy, S.; Kumar Verma, B.; Ram, S.K.; Dhakar, A.K. Review of hierarchical control strategies for DC microgrid. IET Renew. Power Gener. 2020, 14, 1631–1640. [Google Scholar] [CrossRef]
Chouhan, S.; Tiwari, D.; Inan, H.; Khushalani-Solanki, S.; Feliachi, A. DER optimization to determine optimum BESS charge/discharge schedule using Linear Programming. In Proceedings of the 2016 IEEE Power and Energy Society General Meeting (PESGM), Boston, MA, USA, 17–21 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
Maulik, A.; Das, D. Optimal operation of a droop-controlled DCMG with generation and load uncertainties. IET Gener. Transm. Distrib. 2018, 12, 2905–2917. [Google Scholar] [CrossRef]
Dragičević, T.; Guerrero, J.M.; Vasquez, J.C.; Škrlec, D. Supervisory Control of an Adaptive-Droop Regulated DC Microgrid With Battery Management Capability. IEEE Trans. Power Electron. 2014, 29, 695–706. [Google Scholar] [CrossRef] [Green Version]
Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Assistive Power Buffer Control via Adaptive Dynamic Programming. IEEE Trans. Energy Convers. 2020, 35, 1534–1546. [Google Scholar] [CrossRef]
Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Data-Driven Sparsity-Promoting Optimal Control of Power Buffers in DC Microgrids. IEEE Trans. Energy Convers. 2021, 36, 1919–1930. [Google Scholar] [CrossRef]
Ma, W.J.; Wang, J.; Lu, X.; Gupta, V. Optimal Operation Mode Selection for a DC Microgrid. IEEE Trans. Smart Grid 2016, 7, 2624–2632. [Google Scholar] [CrossRef]
Anand, S.; Fernandes, B.G. Reduced-Order Model and Stability Analysis of Low-Voltage DC Microgrid. IEEE Trans. Ind. Electron. 2013, 60, 5040–5049. [Google Scholar] [CrossRef]
Alizadeh, G.A.; Rahimi, T.; Babayi Nozadian, M.H.; Padmanaban, S.; Leonowicz, Z. Improving Microgrid Frequency Regulation Based on the Virtual Inertia Concept while Considering Communication System Delay. Energies 2019, 12, 2016. [Google Scholar] [CrossRef] [Green Version]
Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 2005, 16, 285–286. [Google Scholar] [CrossRef]
Glavic, M. (Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives. Annu. Rev. Control 2019, 48, 22–35. [Google Scholar] [CrossRef] [Green Version]
Wang, W.; Yu, N.; Gao, Y.; Shi, J. Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems. IEEE Trans. Smart Grid 2019, 11, 3008–3018. [Google Scholar] [CrossRef]
Hadidi, R.; Jeyasurya, B. Reinforcement learning based real-time wide-area stabilizing control agents to enhance power system stability. IEEE Trans. Smart Grid 2013, 4, 489–497. [Google Scholar] [CrossRef]
Yan, Z.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
Bellman, R.E.; Dreyfus, S.E. CHAPTER XI. Markovian Decision Processes. In Applied Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 31 December 1962; pp. 297–321. [Google Scholar] [CrossRef]
Goldwaser, A.; Thielscher, M. Deep Reinforcement Learning for General Game Playing. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1701–1708. [Google Scholar] [CrossRef]
Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to Train Your Robot with Deep Reinforcement Learning; Lessons We’ve Learned. arXiv 2021, arXiv:2102.02915. [Google Scholar]
Graesser, L.; Keng, W. Foundations of Deep Reinforcement Learning: Theory and Practice in Python; Addison-Wesley Data & Analytics Series; Pearson Education: Londun, UK, 2019. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The operation modes of a DC source represented by a single-slope active power (

P_{d c}

) and DC voltage (

U_{d c}

) relationship: (a) droop-controlled voltage, (b) constant power, and (c) constant voltage mode [6]. The pre-disturbance operating point of the converter is indicated by the red dot, while the blue dot represents the post-disturbance operating point.

Figure 1. The operation modes of a DC source represented by a single-slope active power (

P_{d c}

) and DC voltage (

U_{d c}

) relationship: (a) droop-controlled voltage, (b) constant power, and (c) constant voltage mode [6]. The pre-disturbance operating point of the converter is indicated by the red dot, while the blue dot represents the post-disturbance operating point.

Figure 2. Each node of the DC microgrid systems can be modeled as a droop-controlled voltage source, a constant power source, or a constant power load.

Figure 3. Reinforcement learning agents works by interacting with an environment, such as DC microgrid systems, by performing an action and gathering observation and reward to learn the most optimal policy.

Figure 4. Q-learning works by building Q-table to estimate the values

Q (s, a)

of each action

a

in a particular state

s

.

Figure 4. Q-learning works by building Q-table to estimate the values

Q (s, a)

of each action

a

in a particular state

s

.

Figure 5. Instead of a table, neural network can serve as an action-value approximator in Q-learning.

Figure 6. The complex environment consists of DC microgrid systems with 3 nodes and 2 bus connections.

Figure 7. (a) The action given to the simple DC microgrid system produced by Q-learning or Q-network and (b) the produced state dynamics.

Figure 8. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system without model error. Note that the red and blue lines are on top of each other.

Figure 9. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system with model error.

Figure 10. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-network (blue) and MIQP (green) in the case of normal DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.

Figure 11. Sample current transient from mode 1 to mode 2 in the case of a complex DC microgrid system.

Figure 12. (a) Comparison of cumulative reward per episode as a function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system without model error.

Figure 13. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system with model error.

Figure 14. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-network (blue) and MIQP (green) in the case of complex DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.

Table 1. Comparison of Related Works.

Paper	Systems	Methods
Paper	Systems	Algorithm	Objective	Unknown
[11]	DCMG	PSO	Economic, Environment	Droop
[12]	DCMG	SC	Stability	Mode
[15]	DCMG	MILP	Economic, Stability	Mode
[13]	DCMG	ADP	Stability	Policy
[14]	DCMG	RL, TS	Topology, Stability	Edge, Policy
[17]	ACMG	GA	Stability, Frequency	Generation
[20]	AC	RL	Economic, Stability	Tap Changer
[21]	AC	RL	Stability	Control Signal
Ours	DCMG	RL	Economic, Stability	Mode
Paper	Voltage Level	Data Challenge
Paper	Voltage Level	Delay	Noise	Error
[11]	LV	-	-	-
[12]	LV	-	-	-
[15]	LV	-	-	-
[13]	LV	-	-	-
[14]	LV	-	-	-
[17]	LV	✓	-	-
[20]	MV	-	-	-
[21]	HV	-	-	-
[22]	HV	-	-	-
Ours	LV	-	✓	✓

Table 2. List of Parameter Values.

Param	Values		Units
Param	Simple	Complex	Units
$γ_{Q - l e a r n i n g}$	0.9	0.9	-
$α_{Q - l e a r n i n g}$	0.2	0.2	-
$γ_{Q - n e t w o r k}$	0.99	0.99	-
$α_{Q - n e t w o r k}$	0.01	0.001	-
$c_{s}$	0.2	2	-
$μ$	0	0	ampere
$υ$	0.01	0.1	ampere
$Δ t$	0.1	0.0001	second
$Δ A, Δ B$	30	10	%
$Q$	1	0.00001	-
$γ_{σ}$	0.1	1	-
$N_{t}$	28	40	-
$L_{b}$	30,000	50,000	-
$N_{ϵ}$	10,000	250,000	-
$ϵ_{m i n}$	0.02	0.02	-

Table 3. Summary of Results.

System	Model Error	Noise	Best Reward/Cost per Episode
System	Model Error	Noise	Q-Learning	Q-Network	MIQP
Simple	x	x	3.5867	3.5867	3.5871
	✓	x	3.3092	3.6554	3.6293
	✓	✓	x	3.6685	3.6667
Complex	x	x	74.2707	74.4254	75.0489
	✓	x	72.4335	72.3714	72.1596
	✓	✓	x	72.3714	72.1596

Table 4. Impact of System Size.

System Multiplier	Number of Nodes	Training Time (s)
1	3	13.949822300113738
100	300	28.190732199931517
1000	3000	125.25556980003603

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Irnawan, R.; Rizqi, A.A.A.; Yasirroni, M.; Putranto, L.M.; Ali, H.R.; Firmansyah, E.; Sarjiya. Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning. Energies 2023, 16, 5369. https://doi.org/10.3390/en16145369

AMA Style

Irnawan R, Rizqi AAA, Yasirroni M, Putranto LM, Ali HR, Firmansyah E, Sarjiya. Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning. Energies. 2023; 16(14):5369. https://doi.org/10.3390/en16145369

Chicago/Turabian Style

Irnawan, Roni, Ahmad Ataka Awwalur Rizqi, Muhammad Yasirroni, Lesnanto Multa Putranto, Husni Rois Ali, Eka Firmansyah, and Sarjiya. 2023. "Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning" Energies 16, no. 14: 5369. https://doi.org/10.3390/en16145369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. Operation Control of DC Microgrids

2.1. Models of DC Microgrids

2.2. Problem Statement

3. Reinforcement-Learning-Based Operation Control

3.1. DC Microgrids as a Markov Decision Process

3.2. Q-Learning for Near-Optimal Operation Control

3.3. Q-Network for Operation Control under Uncertainty

4. Methodologies

5. Results

5.1. Simple Environment

5.2. Complex Environment

6. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI