Next Article in Journal
A Space Reduction Heuristic for Thermal Unit Commitment Considering Ramp Constraints and Large-Scale Generation Systems
Next Article in Special Issue
Robust Collaborative Scheduling Strategy for Multi-Microgrids of Renewable Energy Based on a Non-Cooperative Game and Profit Allocation Mechanism
Previous Article in Journal
Thrust Coordinated Assignment and Ripple Suppression of a Multiple-Modular Permanent Magnet Linear Synchronous Motor Based on Model Predictive Thrust Control
Previous Article in Special Issue
Capacity Optimization of Independent Microgrid with Electric Vehicles Based on Improved Pelican Optimization Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning

1
Department of Electrical Engineering and Information Technology, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
2
Center for Energy Studies, Universitas Gadjah Mada, Yogyakarta 55281, Indonesia
*
Authors to whom correspondence should be addressed.
Energies 2023, 16(14), 5369; https://doi.org/10.3390/en16145369
Submission received: 12 May 2023 / Revised: 19 June 2023 / Accepted: 4 July 2023 / Published: 14 July 2023

Abstract

:
There has been tremendous interest in the development of DC microgrid systems which consist of interconnected DC renewable energy sources. However, operating a DC microgrid system optimally by minimizing operational cost and ensures stability remains a problem when the system’s model is not available. In this paper, a novel model-free approach to perform operation control of DC microgrids based on reinforcement learning algorithms, specifically Q-learning and Q-network, has been proposed. This approach circumvents the need to know the accurate model of a DC grid by exploiting an interaction with the DC microgrids to learn the best policy, which leads to more optimal operation. The proposed approach has been compared with with mixed-integer quadratic programming (MIQP) as the baseline deterministic model that requires an accurate system model. The result shows that, in a system of three nodes, both Q-learning (74.2707) and Q-network (74.4254) are able to learn to make a control decision that is close to the MIQP (75.0489) solution. With the introduction of both model uncertainty and noisy sensor measurements, the Q-network performs better (72.3714) compared to MIQP (72.1596), whereas Q-learn fails to learn.

1. Introduction

There has been tremendous interest in generating electric energy from renewable energy resources [1,2] owing to environmental concerns and current advances in power electronics. Unlike conventional power plants, which often come in large sizes, the capacity of renewable power plants strongly depends on local potential and ranges from a few kilowatts to several megawatts.
Currently, small-sized renewable energy power plants such as micro hydro, photo-voltaic (PV), or wind turbines, are often connected to local loads at the distribution level to configure autonomous alternating current (AC) microgrids [3]. These systems are expected to be able to work with and without the grid connections. However, the problems of reactive power compensation and frequency control in AC microgrids (ACMG) can be more challenging than conventional AC systems due to the intermittent nature of wind and solar power sources [4].
As most renewable energies, especially wind turbine type-4 and PV, inherently produce direct current (DC) outputs, it is natural to interconnect them using DC voltage to configure direct current microgrids (DCMG). Different from AC microgrids, DC microgrids have no problems with reactive and frequency controls [4]. Moreover, in a DC microgrid system, the use of an energy storage system (ESS), notably a battery energy storage system (BESS), becomes indispensable to balance the intermittency. This has been demonstrated in [5] where a method to coordinate several ESSs has been proposed.
DC voltage in a DC grid can be considered as the power balance indicator, i.e., similar to frequency in an AC grid [4,6,7,8,9]. By controlling the DC voltage level within the grid, a certain power flow can be achieved. The DC source connected to a DC grid can be operated in three different modes, i.e.,: droop-controlled voltage, constant power, or constant voltage mode. The response of these modes when subjected to a disturbance is illustrated in Figure 1.
Wind or PV power plants are usually operated at their maximum power point, hence this DC source is often operated in active power control, i.e., injecting active power to the DC microgrid system regardless of the DC voltage condition on the DC microgrid. This also means that the intermittent operation of these power plants is reflected in the DC voltage fluctuations. In order to counter the changes in the DC voltage, BESS can be operated in the DC voltage mode to maintain a constant DC voltage. This means that BESS can be absorbing or injecting power to the grid whenever there is a surplus or deficiency of power in the DC microgrid. However, the operation of BESS is a complex problem as it has special constraints such as Depth of Discharge (DOD) requirements, State of Charge (SOC) limitations, Charge and Discharge rates, among others [10]. Therefore, it might happen that BESS switches its mode to DC voltage droop mode to limit the contribution to maintain the DC voltage within the DC microgrid or even operate in the constant power mode when reaching its charge or discharge limit.

1.1. Related Work

In order to operate DC microgrids in an optimal way, a method to minimize operating costs has been proposed in [11] using a particle swarm optimization (PSO) algorithm. A similar idea of minimizing operating cost while improving stability has been proposed in [12] using a switch system approach called supervisory control (SC), which appropriately selects the operating mode to optimize the control objectives. In another study using adaptive dynamic programming (ADP) [13] and tabu search (TS) [14], researchers found that using a central control system that coordinates distributed control devices across multiple nodes in the microgrid can help optimize a common objective. The switching problem of voltage sources across multiple nodes has been further studied in [15] to guarantee the analytic performance of the proposed controller. The author in [15] formulated the switching problem of multi-node voltage sources as mixed-integer linear programming (MILP) problem. The stability study used in [15] is based on the previous study of a small signal model for low-voltage DCMG (LVDCMG) in [16].
Although the proposed methods in the literature have demonstrated satisfactory performances in the scenarios presented in the corresponding papers, they necessitate the exact model of DC grids. This assumption is very restrictive for several reasons. Firstly, from the practical aspect, DC microgrids consist of several components coming from different vendors. Each vendor is normally unwilling to share the detail of components, e.g., the transfer functions of the components are their controllers, to safeguard their intellectual properties. Secondly, even if accurate models are available, some components experience wear and tear after a certain time, which leads to model uncertainty. A study exploring the incomplete state model is explored in [17] by considering communication delay. The author overcame the lack of an exact system model by implementing a heuristic-based approach, that is, a genetic algorithm (GA). That is because GA did not require an exact model to obtain a sufficient solution approximation. Hence, the natural remedy for this is to use a method that does not rely on the exact model of the DC grids.
In the current power system hierarchy, there have been large amounts of data generated by, for example, smart meters and phasor measurement units (PMUs). These data contain very rich information regarding operating status of the systems at a time instance. On the other hand, recent progress in applied mathematics and data science leads to the possibility to extract useful information for power system operation without prior supervisory knowledge on how to extract meaningful information from data. An example of this is the reinforcement learning (RL) [18] approach, which is a method in machine learning allowing an agent to take a sequence of appropriate actions to maximize a certain objective. This method has recently been adopted in power systems [19], for instance to solve the Volt/Var control (VVC) problem in medium voltage AC (MVAC) distribution systems [20], real-time wide-area stabilizing control in a high-voltage AC (HVAC) system [21], and a data-driven load frequency control in a multi-area HVAC system [22].
The purpose of this study is to propose a novel method to solve the DC microgrid switching problem optimally given limited information about the exact model of DC grids based on a reinforcement learning algorithm. The limitation is modeled as noise and error in the system matrix model, as a form of simulating noise in sensor measurements and an imprecise power system model. The table highlighting the difference between our study compared with other works can be seen in the comparison of related works in Table 1. The data challenges shown in the table refer to communication delay, measurement noise, and model error.

1.2. Contributions

Due to the fact that accurate models of DC grids are not always readily available, this paper proposes a novel model-free approach to performing operation control of DC microgrids based on reinforcement learning. The variances of reinforcement learning algorithm is proposed in the paper because of its ability to learn efficiently from past data of the state and action, such as the current and operation cost data which correspond to any mode of operation in the context of a DC microgrid system. In addition, we show that our proposed approach can cope with the model uncertainty due to noise in sensor measurement and imprecise power system model, while producing near-optimal policy which takes both stability and operational cost into consideration.
The contributions of this paper can be summarized as follows:
  • Propose a novel model-free approach for solving the LVDCMG optimal switching problem
  • Demonstrate the ability of a reinforcement learning algorithm to solve LVDCMG optimal switching problem under measurement noise and imprecise power system mode
  • Provide a minimal working example for applying reinforcement learning parameters in the LVDCMG optimal switching problem

2. Operation Control of DC Microgrids

2.1. Models of DC Microgrids

Each power source in distributed droop-controlled LVDCMG can be operated in various different modes of operation, that is droop-controlled voltage source (DVS), a constant power source (CPS), or a constant power load (CPL) [15,16] as can be seen in Figure 2. For instance, a battery can be operated as either a CPS on discharging, CPL on charging, or a DVS both at charging and discharging.
For each droop-controlled voltage source j N d ,
V j = V j 0 d j P D j ,
where V j is the droop controller output voltage, V j 0 is the droop controller nominal voltage value, d j is the droop controller gain, and P D j is the droop controller output power. The small-signal representation of the droop-controlled voltage source can be modeled as
v j = ( d j V ¯ j 1 + d j I ¯ j ) i D j
where v j is the small-signal perturbation of the voltage, V ¯ j and I ¯ j are the DC voltage and current steady-state operating point, respectively, and i D j is the corresponding small-signal perturbation of the current. The individual small-signal models for each node are aggregated which leads to
v = d ˜ i D ,
where v : = [ v 1 , v 2 , , v N d ] , i D : = [ i 1 , i 2 , , i N d ] , and d ˜ are a diagonal matrix with diagonal elements d j V ¯ j 1 + d j I ¯ j , and j = 1 , 2 , , N d .
Suppose that the microgrids consist of N C P S number of CPsS out of N s sources where the set of CPSs is given by N C P S : = 1 , 2 , , N C P S . By modeling a CPS as a current source in parallel with a conductance for each CPS j N C P S , we have the following models
i C P S j = g C P S j v j + i ˜ C P S j ,
where i C P S j and v j are the corresponding CPS small-signal output current and voltage. The values of the corresponding conductance g C P S j and current source i ˜ C P S j are determined by the power and voltage at an operating point as follows
g C P S j = P C P S j V D C j 2 ,
I C P S j = 2 P C P S j V D C j ,
where P C P S j and V D C j are output power and nominal voltage at operating point of the CPS. It needs to be noted that I C P S j is the nominal current source value, while i ˜ C P S j is the small-signal perturbation of the current source. All the small-signal model can be aggregated into the following equation
i C P S = g C P S v C P S + i ˜ C P S j ,
where
i C P S : = [ i C P S 1 , i C P S 2 , , i C P S N C P S ] ,
v C P S = [ v 1 , v 2 , , v N C P S ] ,
i ˜ C P S = [ i ˜ C P S 1 , i ˜ C P S 2 , , i ˜ C P S N C P S ] ,
g C P S = g C P S 1 0 0 0 0 0 0 g C P S N C P S ,
Similar to CPS, a CPL is modeled as a negative conductance in parallel to a current sink. Suppose that the set of CPLs is given by N C P L : = 1 , 2 , , N C P L where N C P L refers to the number of CPLs in the microgrids. Each CPL j N C P L can be modeled as follows
i C P L j = g C P L j v j + i ˜ C P L j ,
where i C P L j and v j are the corresponding small-signal output current and voltage of the CPL, respectively. The values of the corresponding conductance g C P L j and current sink i ˜ C P L j are determined by the power and voltage at an operating point as follows:
g C P L j = P C P L j V D C j 2 ,
I C P L j = 2 P C P L j V D C j ,
where P C P L j and V D C j are the output power and DC nominal voltage at the operating point of the CPL. The small-signal model can be aggregated into
i C P L = g C P L v C P L + i ˜ C P L j ,
where
i C P L : = [ i C P L 1 , i C P L 2 , , i C P L N C P L ] ,
v C P L = [ v 1 , v 2 , , v N C P L ] ,
i ˜ C P L = [ i ˜ C P L 1 , i ˜ C P L 2 , , i ˜ C P L N C P L ] ,
g C P L = g C P L 1 0 0 0 0 0 0 g C P L N C P L .
Suppose that a set of N b power lines connecting the adjacent vertices (sources and loads) in the microgrid is given by N b : = 1 , 2 , , N b . Each power line j N b can be modeled as follows
v b j = l b j d i b j d t + r b j i b j ,
where v b j and i b j are the power line voltage and current, respectively, and l b j and r b j are the power line inductance and resistance with length j , respectivley. The aggregate of all individual small-signal models can be expressed as
v b = l b d i b d t + r b i b ,
where
v b : = [ v b 1 , v b 2 , , v b N b ] ,
i b = [ i b 1 , i b 2 , , i b N b ] ,
l b = l b 1 0 0 0 0 0 0 l b N b ,
r b = r b 1 0 0 0 0 0 0 r b N b .
Finally, the DC microgrids follow Kirchhoff’s voltage law (KVL) and Kirchhoff’s current law (KCL) as follows
v b = Mv ,
i D i C P = M T i b .
By combining (3), (7), (11) and (13)–(15), the dynamics of the DC microgrids can be summarized in the following state equations:
x ˙ = A x + B ,
where x = i b R N x and
A = l b 1 ( M ( d ˜ 1 g C P ) 1 M T + r b ) ,
B = l b 1 M ( d ˜ 1 g C P ) 1 .
Any operational modes (i.e., DVS, CPS, or CPL) will follow linear state equations described in (16), albeit with different values of A and B . Therefore, the dynamics can be slightly modified as follows
x ˙ = A σ x + B σ ,
where A σ and B σ stand for matrix A and B , respectively, for mode σ ( 1 , N ) , where N refers to the total number of modes under consideration.
The DC microgrid’s dynamics can be reformulated in discrete forms as follows
x ( k + 1 ) = x ( k ) + x ˙ ( k ) Δ t ,
where Δ t is a time-integration constant. Substituting (19) to (20), the complete dynamics in discrete forms can be rewritten as
x ( k + 1 ) = A k σ x ( k ) + B k σ ,
where A k σ = A σ Δ t + 1 and B k σ = B σ Δ t . Therefore, the decision variable that we can give to the system is u ( k ) = u 1 ( k ) u 2 ( k ) u N ( k ) T , where u σ ( 0 , 1 ) refers to the status of mode σ of the DC microgrid and is given by
u σ = 1 if mode σ is active 0 if mode σ is inactive .
Note that only one mode can be active at a time.

2.2. Problem Statement

In this paper, we try to find the most optimal operational control for the DC microgrids modeled in Section 2.1. The problem can be formulated as follows:
Problem 1
(Optimal operation control). Given a microgrid system following a certain dynamics x ( k + 1 ) = f ( x ( k ) , u ( k ) ) which produces an operational cost c ( x ( k ) , u ( k ) ) , what is the input signal u * ( k ) = u 1 * ( k ) u 2 * ( k ) u N * ( k ) T which will minimize the total cost C defined as
C = k = 0 M c ( x ( k ) , u ( k ) ) .
Here, N denotes the number of available modes of operation and u i ( 0 , 1 ) for every i ( 1 , N ) and M denotes the number of steps.
Suppose that the cost function at iteration-k is given by
c ( k ) = 1 2 x ( k ) T Q x ( k ) + γ σ u σ ( k ) .
where Q R N x × N x stands for a positive definite matrix while γ σ stands for the cost of operating the current active mode σ . Here, the first term corresponds to the stability of the system (i.e., an unstable system causing large values of state x will be penalized) while the second term corresponds to the operation cost of the operational mode being employed. If the system dynamics f ( x ( k ) , u ( k ) ) is fully known and linear while the cost function c ( x ( k ) , u ( k ) ) is linear or quadratic, one way to solve the presented optimization problem is to employ linear/quadratic programming. In this case, the system dynamics in (19) can be written in the form of a constraint to the optimization problem. One can then employ algorithm such as the mixed-integer quadratic programming (MIQP) to solve the optimization problem as reported in [15].
However, the requirement to fully know the system dynamics accurately might be difficult to realize in many practical situations. Some parameters in a DC microgrid system might not be known beforehand due to communication delay [17]. Meanwhile, system identification to acquire the model parameters could be complicated, especially if the system contains a large number of components.

3. Reinforcement-Learning-Based Operation Control

3.1. DC Microgrids as a Markov Decision Process

To solve the optimization problem in Problem 1, we can consider the DC Microgrids with unknown system dynamics as a Markov Decision Process (MDP) [23]. By doing so, we assume that the DC Microgrids system fulfills the following criteria:
  • the future state s ( k + 1 ) only depends on the current state s ( k ) , not the previous state history,
  • the system accepts a finite set of actions a ( k ) at every step,
  • the system will provide state information s ( k ) and reward r ( k ) at every step.
The reward function r ( k ) denotes how good the performance of the state s ( k ) at iteration-k is. Therefore, the optimization problem described in Problem 1 where the cost function C needs to be minimized can be transformed into an optimization problem to find a policy π = u * ( k ) which will maximize the total reward R = k = 0 M r ( k ) over M steps.
This discrete dynamics equation in (21) can be transformed into an MDP where the state s is given by s = x c T and the action a = u . We assume that the cost c ( k ) follows Equation (24). The reward function r ( k ) can be derived from the cost function c ( k ) as follows:
r ( k ) = c s c ( k ) if c s > c ( k ) 0 if c s c ( k ) ,
where c s denotes a positive constant. Thus, the objective function used to evaluate each model is
max u k = 0 M r ( k ) .

3.2. Q-Learning for Near-Optimal Operation Control

To solve the optimization problem for MDP modeling dynamics of DC microgrids, a reinforcement learning approach is employed. This approach allows an agent to interact with an environment (i.e., system) without knowledge of the system’s dynamics by applying an action a and receiving a state information s as well as a reward r as shown in Figure 3. The agent then learns from this information to decide the most optimal policy π = a ( k ) , where k ( 0 , M ) , which will maximize the cumulative rewards R = k = 0 M r ( k ) for M number of steps.
The reinforcement learning algorithm employed in this paper falls under the class of algorithm called Q-learning [18] which has been successfully employed for various engineering problems, such as gaming [24] and robotics [25]. The algorithm works by predicting an action-value function Q ( s , a ) for every state s and action a which reflects the expected final reward when choosing an action a in the state s . Once Q ( s , a ) is established, the optimum policy π ( s ) for every state s will select an action which has the most action-value (i.e., will yield the most rewards in the long-term) as given by [18],
π ( s ) = arg max a Q ( s , a ) .
For a discrete number of states, the way to build an estimate of action-value function Q ( s , a ) is by constructing a table which maps a state s and action a into a value Q ( s , a ) as shown in Figure 4. This algorithm, called tabular Q-learning, starts with an empty table. The table will be updated upon interaction with the environment by taking into account information regarding current state s , the given action a , the current reward r ( s ) , and the next state s = s ( k + 1 ) . The update equation is based on Bellman’s optimality equation and is given by [18],
Q ( s , a ) Q ( s , a ) + α ( r ( s ) + γ max a Q ( s , a ) Q ( s , a ) ) .
Here, α stands for a learning rate constant which determines how much we allow the value of Q ( s , a ) to change at every iteration and γ stands for the reward’s discount factor which determines how much we value the contribution of future rewards to the value of the current state. The iteration process stops once the cumulative reward R over M number of steps reach a certain threshold R s .
One of the main reason in choosing Q-learning to solve the optimization problem in Problem 1 is because it is an off-policy algorithm, i.e., it is able to learn from and reuse any past experience regardless of the policy being used to gather this experience. In the context of DC microgrids, Q-learning is able to learn from past data of the state and action pairs, such as the current and operation cost data which corresponds to any mode of operation. Due to this fact, Q-learning is better in terms of data sampling efficiency compared to employing online reinforcement learning problem such as policy gradient method [26]. Another reason is because it is designed specifically for problems with discrete action space which is the characteristics of the operation control problem, i.e., to determine which mode of operation is the most optimal among a discrete number of possible operations.

3.3. Q-Network for Operation Control under Uncertainty

The tabular Q-learning approach found to be extremely effective in various applications when the number of state s is discrete. However, this is mostly not the case for a DC microgrid systems which can have a large number of possible current values and operational cost as a state. The presence of uncertainty in the measurement of the state value adds further complications. To handle a system with a large number of states or even a continuous state, the Q-table can be replaced with a function approximator which directly maps state s and action a into Q ( s , a ) . One such function approximator which has gained prominence in recent years is a neural network, due to its ability to approximate highly non-linear functions solely from raw data [27]. Recent study in employing neural networks into Q-learning as a function approximator of action-value function Q ( s , a ) has been carried out in [28]. This approach, called Q-network, works essentially the same as Q-learning, except that the Q-table is replaced with a neural network and the update equation in (28) is replaced with training steps of the neural network.
The neural network ϕ ( s ) , as depicted in Figure 5, accepts a state s as an input and returns values Q ( s , a ) of each possible discrete actions a . The target output of the network is derived from Bellman’s equation, similar to the term in Equation (28), as follows:
Q t ( s , a ) = r ( s ) + γ max a Q ( s , a ) .
The losses, given by L = Q ( s , a ) Q t ( s , a ) are used to update the network’s parameter ϕ ( s ) using the gradient descent algorithm.
To facilitate a better training of the neural network (which ideally requires uncorrelated training data), a strategy called experience buffer is employed. Upon interacting with the environment, the agent does not update the neural network’s parameter solely based on the current information, but saves a tuple τ defined as τ = ( s , a , s , r ) into a buffer of length L b . In every iteration, a total of N t < L b of tuples are randomly sampled from the experience buffer and used as data training. The training process stops once the average cumulative reward over M number of steps R a v g in the last N E number of completed episode reaches a certain threshold R s . To obtain diverse data at the initial stage of training, the agent initially starts with purely random action, i.e., the probability ϵ of choosing random action is set to 1. This parameter ϵ is then linearly decreased at every iteration until it reaches a certain minimum value ϵ m i n after N ϵ number of iterations. Afterwards, the agent will have a probability of ϵ m i n of choosing a random action and a probability of 1 ϵ m i n of choosing the most optimal policy in (27) from the output Q ( s , a ) provided by the neural network.

4. Methodologies

To evaluate the performance of the algorithm, simulations consisting of 2 environments are used as follows:
  • A simple environment: a simplified version of DC microgrid system consisting of a single source and a single load connected via a transmission line ( N x = 1 ). The system can be operated in 2 different modalities ( N = 2 ). The number of update steps in a single episode of operation is set to be M = 28 . For simplicity, the state equation matrices are simplified and assumed to be A = B = 1 for mode σ = 1 and A = 10 , B = 1 for mode σ = 2 .
  • A complex environment: a more realistic DC microgrid system consisting of 3 nodes. Each node consists of a source and/or a load. Two transmission lines ( N x = 2 ) connect Node 1 and Node 2 as well as Node 2 and Node 3 as shown in Figure 6. The system can be operated in 2 different modalities ( N = 2 ). The number of update steps in a single episode of operation is set to be M = 40 . In this environment, it is assumed that the system’s mode of operation can only be updated once every 1 s. The state equation matrices used in this environment are derived from (16)–(22).
The two environments are chosen because of their simplicity, yet are representative of simulating a DCMG power system model in the form of an MIQP problem. The first environment is the most possible simple representation of the system model, while the second environment is a simple building block for any DCMG system topology. For future research, this second environment can be used to build a bigger and more complex DCMG system topology.
For evaluating the performance of the algorithm, we design 3 different scenarios. The first one is when the model is assumed to be perfectly known to the MIQP algorithm and the current measurement is perfectly accurate, i.e., no modeling and measurement uncertainty. The second one is when there is a modeling error, i.e., the model information used by the MIQP algorithm does not perfectly match the real model of the system. In this scenario, the real transition matrix A σ and input vector B σ for each mode σ have an offset of Δ A and Δ B respectively with respect to the model information known to the MIQP algorithm. Finally, the third and last scenario is when there is a Gaussian noise with mean μ and standard deviation υ added to the current measurement x . Thus, the third scenario contains both the Gaussian noise and modeling error at the same time.
Apart from implementing Q-learning and Q-network, we also implement an optimization technique based on mixed-integer quadratic programming (MIQP) employed in [15]. The MIQP approach is used as the baseline approach and simulated under the assumption that the dynamics model of the DC microgrid is known. We tested the Q-learning, the Q-network, and the MIQP algorithm to find the most optimal action for each scenario in the given environment. Note that the Q-learning and Q-network do not know anything about the dynamics of the system while the MIQP requires information regarding the model’s dynamics. The neural network employed in the Q-network algorithm contains 1 hidden layer with 4 nodes. The parameters used in the algorithm are listed in detail in Table 2.

5. Results

5.1. Simple Environment

First, we present the results of the simulation in the first scenario when there is no modeling or measurement error. For this problem, the Q-learning and Q-network algorithm produce an exactly identical solution shown in Figure 7a,b. The best policy produced by both algorithms is shown in Figure 7a and the resulting state dynamics is shown in Figure 7b.
To evaluate the performance of both algorithms, we compare the cumulative reward per episode as function of training iteration number in Figure 8a. The red line, blue line, and green line indicate the cumulative reward produced by Q-learning, Q-network, and MIQP algorithm, respectively. Note that the cumulative reward of the Q-network is presented as an average cumulative reward over the last 100 episodes to take into account the stochastic nature of the Q-network algorithm. We can observe that the Q-learning algorithm (red line) converges very quickly towards its most optimal solution. The produced cumulative reward is also very close to the optimal solution produced by MIQP (green line). The Q-network algorithm (blue line) produces a solution with less average cumulative reward compared to Q-learning and MIQP. However, as can be observed in Figure 7a, the best solution produced by Q-network matches the one produced by Q-learning. Consequently, as shown in Figure 8b, the reward accumulated by Q-network over 28 steps match the reward accumulated by Q-learning. Overall, the total reward accumulated by both of these algorithm (red line for Q-learning and blue line for Q-network) over 28 steps is only slightly smaller than the optimal solution predicted by MIQP (green line for MIQP) as shown in Figure 8b. This can also be concluded from Table 3 where the final reward of Q-learning and Q-network (3.5867) is shown to be really close to the reward of MIQP (3.5871). Note that this performance is achieved by the Q-learning and Q-network without prior information of the system’s dynamics contrary to the MIQP.
The simulation results of the second scenario where there is a modeling error in the system is shown in Figure 9. From the plot of cumulative reward per episode as function of training iteration number in Figure 9a, we can observe that the Q-learning algorithm (red line) once again converges rapidly to the final solution. However, we can observe that the solution reached by this algorithm is no longer close enough to the solution provided by MIQP (green line) even in the presence of model error. The average cumulative reward produced by Q-network (blue line) is also less than the case where there is no model error. However, if we observe the cumulative reward gathered over the curse of 1 episode (28 steps) in Figure 9b, the best solution of Q-network (blue line) is slightly better than the one produced by MIQP (green line). This is the case since MIQP no longer provides the global optimum solution for the system due to the presence of modeling error. The Q-network algorithm, on the other hand, learns directly from the system without prior information regarding the model, and thus, its performance is not affected by the modeling error. We can observe this more clearly from Table 3 where the best reward produced by Q-network (3.6554) is better than the reward produced by Q-learning (3.3092) and MIQP (3.6293).
The simulation results of the third scenario in the presence of both the modeling error and the measurement noise is shown in Figure 10. In this scenario, the Q-learning algorithm fails to converge towards a solution. This is mainly caused by the fact that the measurement noise causes a lot of variation in the state values which consequently causes the Q-Table’s dimension to be too large. In contrast, the Q-network algorithm does not fall into this problem because it works using continuous approximation function in the form of a neural network. In Figure 10, we can observe that the Q-network algorithm in this case converges much faster (slightly after 10,000 iterations) compared to the previous cases. It demonstrates how the Q-network algorithm actually works better in a more complex scenario, such as the one with modeling and measurement uncertainty. The best solution produced by Q-network algorithm (blue line) also produces cumulative reward which is very close to the solution provided by MIQP (green line) in Figure 10. In fact, the best cumulative reward of Q-network (3.6685) is once again better than the one produced by MIQP (3.6667) as shown in Table 3. This demonstrates the power of Q-network algorithm to solve complex optimization problems in the presence of uncertainty without relying on prior information about the system’s model. These features make the Q-network algorithm a very strong candidate to be used in a real operational control problem of DC microgrid system.

5.2. Complex Environment

In this Section 5.2, we present the results of the complex simulation consisting of three nodes. The configuration of the network is shown in Figure 6. This Section 5.2 is almost identical with Section 5.1, only differing in system parameters. The system parameters are derived from state equation matrices presented in (16)–(22). The resulting sample of the transient current transfer between nodes as an impact of the switching can be seen in Figure 11. First, in a system without modeling and measurement error, the cumulative reward per episode, consisting of 40 steps, as a function of training iteration number can be observed in Figure 12. We can observe that Q-learning algorithm (red line) behaves as in Section 5.1, converging very quickly towards its most optimal solution. However, as we can see from Figure 12, Q-learning achieves the lowest score, followed by Q-network, while MIQP achieves the highest score. Furthermore, an increase in systems’ complexity can be seen to have an impact on the model-free approach as the difference of both Q-learning (74.27071075) and Q-network (74.42539101) to the global optimum solution by MIQP (75.04887579) widens compared to the simple system in Section 5.1. This happens because the information gap between the model-free approach and the fully known system’s dynamics parameters on MIQP solution is also widened.
The simulation results of the second scenario of a complex system with a modeling error in the system is shown in Figure 13. From the plot of cumulative reward per episode as a function of training iteration number in Figure 13, we can observe that the Q-learning algorithm (red line) keeps converging rapidly to the final solution, even achieving the best score followed by Q-network. From Figure 13, we can see that in the course of 1 episode (40 steps), both Q-learning (72.4335) and Q-network (72.3714) could perform relatively well even without prior knowledge of the model’s dynamics compared to MIQP (72.1596). Furthermore, the presence of modeling error greatly shifts the MIQP solution from the global optimum in the complex system scenario, showing the vulnerability of the MIQP approach under inaccurate system model. Nevertheless, Q-network requires many iterations to converge (around 500,000 iterations) compared to the first scenario where no model error exists.
The simulation results of the third scenario where there are both modeling error and measurement noise in the system is shown in Figure 14. In this scenario, the Q-learning algorithm fails to converge again towards a solution, just like in the simple model. In contrast, the Q-network algorithm is still capable of handling noise using its neural network model. Furthermore, we can see from Figure 14 that the solution of Q-network (72.3714) achieves a better score compared to the solution provided by MIQP (72.1596). This result shows that the Q-network algorithm is actually capable of handling both modeling and measurement uncertainty even in complex scenario. These features once again further solidify Q-network algorithm as very strong candidate to be used in a real operational control problem of DC microgrid system. Nevertheless, further research is needed to better understanding the limit of system complexity and measurement error that Q-network able to handle in feasible manner with acceptable results.
To investigate the impact of node quantity on the performance and training time of the Q-network, a series of experiments was conducted. The simulations involved placing clusters of nodes in a sequential arrangement, following the configuration of the complex network illustrated in Figure 6. The clusters consisted of 50, 100, and 1000 nodes, which can be equivalently represented as 150, 300, and 3000 nodes, respectively. The results of these experiments can be found in Table 4. The findings indicate that increasing the system size leads to a higher computational burden for Q-learning during training. However, the rate of increase in training time is lower than the rate of increase in the number of nodes. This suggests that the computational time increase is primarily caused by small-signal calculations, rather than the Q-network algorithm itself.

6. Conclusions and Future Works

In this paper, we propose a reinforcement learning algorithm to solve the optimal operational control problem of a DC microgrid system. The proposed algorithm works without relying on prior information regarding the system’s dynamics, but rather it learns the most optimal policy by interacting directly with the system. We tested two algorithms in this paper, namely Q-learning and Q-network, and compared the performance with the MIQP algorithm found in the literature as the baseline. The proposed algorithms are tested in two different environments, simple and complex environments; and three scenarios, namely perfect information, modeling error, and measurement noise. Both in the first and second environments, the Q-learning (3.5867 and 74.2707) and Q-network (3.5867 and 74.4254) algorithm are shown to produce solutions very close to the optimal solution of MIQP (3.5871 and 75.0489). These results are achieved by the RL methods without knowledge of the system’s dynamics, whereas the MIQP solution requires an accurate system model. It is shown that, when an accurate system model did not exist, meaning there is error in system’s model, both Q-learning (3.3092 and 72.4335) and Q-network (3.6554 and 72.3714) could perform better than MIQP (3.6293 and 72.1596) that requires an accurate system model. Moreover, with the introduction of measurement noise on top of the modeling error, the proposed algorithm of the Q-network is shown to produce a better solution to MIQP. These preliminary results demonstrate that the reinforcement learning algorithm is a strong candidate to solve optimal operation control of a DC microgrid system, especially when the system’s dynamics model is not available.
By showing the capabilities of reinforcement learning-based algorithms in solving model-free DC microgrids distributed droop switching problems, this paper suggests that other off-policy algorithms that did not require an exact system model might also be able to solve the same problem. This paper as a basic starting point in implementing a reinforcement learning algorithm in such environments also opens a challenge for a newer and more complex reinforcement learning algorithm that can be explored to achieve better results. Other kinds of machine-learning algorithms that did not require an exact system model that can learn from experience are also interesting topics to be explored.
In the future, more complicated environments and scenarios, such as the ones consisting of huge numbers of sources or modes, and more complicated constraints can be used, further exploring the limits of system complexity, mode choices, and uncertainties due to modeling and measurement error that Q-network can handle. We will also explore the use of more complex neural network architecture to improve the performance of the Q-network algorithm. Finally, employing the real data from DC microgrid system in the training process will also be investigated.

Author Contributions

Conceptualization, R.I. and S.; methodology, R.I., H.R.A., L.M.P. and E.F.; software, A.A.A.R. and M.Y.; validation, R.I., L.M.P. and S.; formal analysis, R.I.; investigation, R.I., A.A.A.R. and M.Y.; resources, R.I. and S.; data curation, A.A.A.R. and M.Y.; writing—original draft preparation, A.A.A.R. and M.Y.; writing—review and editing, R.I., A.A.A.R. and M.Y.; visualization, A.A.A.R. and M.Y.; supervision, S.; project administration, L.M.P.; funding acquisition, S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Indonesian Ministry of Research and Technology/National Agency for Research and Innovation; Indonesian Ministry of Education and Cultur; under World Class University Program managed by Institut Teknologi Bandung.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ACalternating current
ACMGAC microgrids
ADPadaptive dynamic programming
BESSbattery energy storage system
CPSconstant power source
CPLconstant power load
DCdirect current
DCMGDC microgrids
DODdepth of discharge
DVSdroop-controlled voltage source
ESSenergy storage system
GAgenetic algorithm
HVAChigh voltage AC
KCLKirchoff’s current law
KVLKirchoff’s voltage law
LVDCMGlow-voltage DCMG
MDPMarkov decission process
MILPmixed-integer linear programming
MIQPmixed-integer quadratic programming
MVACmedium voltage AC
PMUphasor measurement unit
PSOparticle swarm optimization
PVphoto-voltaic
RLreinforcement learning
SCsupervisory control
SOCatate of charge
TStabu search
VVCVolt/Var control

References

  1. Blaabjerg, F.; Teodorescu, R.; Liserre, M.; Timbus, A.V. Overview of control and grid synchronization for distributed power generation systems. IEEE Trans. Ind. Electron. 2006, 53, 1398–1409. [Google Scholar] [CrossRef] [Green Version]
  2. Carrasco, J.M.; Franquelo, L.G.; Bialasiewicz, J.T.; Galván, E.; PortilloGuisado, R.C.; Prats, M.M.; León, J.I.; Moreno-Alfonso, N. Power-electronic systems for the grid integration of renewable energy sources: A survey. IEEE Trans. Ind. Electron. 2006, 53, 1002–1016. [Google Scholar] [CrossRef]
  3. Hatziargyriou, N.; Asano, H.; Iravani, R.; Marnay, C. Microgrids. IEEE Power Energy Mag. 2007, 5, 78–94. [Google Scholar] [CrossRef]
  4. Guerrero, J.M.; Vasquez, J.C.; Matas, J.; de Vicuna, L.G.; Castilla, M. Hierarchical Control of Droop-Controlled AC and DC Microgrids—A General Approach Toward Standardization. IEEE Trans. Ind. Electron. 2011, 58, 158–172. [Google Scholar] [CrossRef]
  5. Hou, N.; Li, Y. Communication-Free Power Management Strategy for the Multiple DAB-Based Energy Storage System in Islanded DC Microgrid. IEEE Trans. Power Electron. 2021, 36, 4828–4838. [Google Scholar] [CrossRef]
  6. Irnawan, R.; da Silva, F.F.; Bak, C.L.; Lindefelt, A.M.; Alefragkis, A. A droop line tracking control for multi-terminal VSC-HVDC transmission system. Electr. Power Syst. Res. 2020, 179, 106055. [Google Scholar] [CrossRef]
  7. Peyghami, S.; Mokhtari, H.; Blaabjerg, F. Chapter 3—Hierarchical Power Sharing Control in DC Microgrids. In Microgrid; Mahmoud, M.S., Ed.; Butterworth-Heinemann: Oxford, UK, 2017; pp. 63–100. [Google Scholar] [CrossRef]
  8. Shuai, Z.; Fang, J.; Ning, F.; Shen, Z.J. Hierarchical structure and bus voltage control of DC microgrid. Renew. Sustain. Energy Rev. 2018, 82, 3670–3682. [Google Scholar] [CrossRef]
  9. Abhishek, A.; Ranjan, A.; Devassy, S.; Kumar Verma, B.; Ram, S.K.; Dhakar, A.K. Review of hierarchical control strategies for DC microgrid. IET Renew. Power Gener. 2020, 14, 1631–1640. [Google Scholar] [CrossRef]
  10. Chouhan, S.; Tiwari, D.; Inan, H.; Khushalani-Solanki, S.; Feliachi, A. DER optimization to determine optimum BESS charge/discharge schedule using Linear Programming. In Proceedings of the 2016 IEEE Power and Energy Society General Meeting (PESGM), Boston, MA, USA, 17–21 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
  11. Maulik, A.; Das, D. Optimal operation of a droop-controlled DCMG with generation and load uncertainties. IET Gener. Transm. Distrib. 2018, 12, 2905–2917. [Google Scholar] [CrossRef]
  12. Dragičević, T.; Guerrero, J.M.; Vasquez, J.C.; Škrlec, D. Supervisory Control of an Adaptive-Droop Regulated DC Microgrid With Battery Management Capability. IEEE Trans. Power Electron. 2014, 29, 695–706. [Google Scholar] [CrossRef] [Green Version]
  13. Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Assistive Power Buffer Control via Adaptive Dynamic Programming. IEEE Trans. Energy Convers. 2020, 35, 1534–1546. [Google Scholar] [CrossRef]
  14. Massenio, P.R.; Naso, D.; Lewis, F.L.; Davoudi, A. Data-Driven Sparsity-Promoting Optimal Control of Power Buffers in DC Microgrids. IEEE Trans. Energy Convers. 2021, 36, 1919–1930. [Google Scholar] [CrossRef]
  15. Ma, W.J.; Wang, J.; Lu, X.; Gupta, V. Optimal Operation Mode Selection for a DC Microgrid. IEEE Trans. Smart Grid 2016, 7, 2624–2632. [Google Scholar] [CrossRef]
  16. Anand, S.; Fernandes, B.G. Reduced-Order Model and Stability Analysis of Low-Voltage DC Microgrid. IEEE Trans. Ind. Electron. 2013, 60, 5040–5049. [Google Scholar] [CrossRef]
  17. Alizadeh, G.A.; Rahimi, T.; Babayi Nozadian, M.H.; Padmanaban, S.; Leonowicz, Z. Improving Microgrid Frequency Regulation Based on the Virtual Inertia Concept while Considering Communication System Delay. Energies 2019, 12, 2016. [Google Scholar] [CrossRef] [Green Version]
  18. Sutton, R.; Barto, A. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 2005, 16, 285–286. [Google Scholar] [CrossRef]
  19. Glavic, M. (Deep) Reinforcement learning for electric power system control and related problems: A short review and perspectives. Annu. Rev. Control 2019, 48, 22–35. [Google Scholar] [CrossRef] [Green Version]
  20. Wang, W.; Yu, N.; Gao, Y.; Shi, J. Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems. IEEE Trans. Smart Grid 2019, 11, 3008–3018. [Google Scholar] [CrossRef]
  21. Hadidi, R.; Jeyasurya, B. Reinforcement learning based real-time wide-area stabilizing control agents to enhance power system stability. IEEE Trans. Smart Grid 2013, 4, 489–497. [Google Scholar] [CrossRef]
  22. Yan, Z.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
  23. Bellman, R.E.; Dreyfus, S.E. CHAPTER XI. Markovian Decision Processes. In Applied Dynamic Programming; Princeton University Press: Princeton, NJ, USA, 31 December 1962; pp. 297–321. [Google Scholar] [CrossRef]
  24. Goldwaser, A.; Thielscher, M. Deep Reinforcement Learning for General Game Playing. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1701–1708. [Google Scholar] [CrossRef]
  25. Ibarz, J.; Tan, J.; Finn, C.; Kalakrishnan, M.; Pastor, P.; Levine, S. How to Train Your Robot with Deep Reinforcement Learning; Lessons We’ve Learned. arXiv 2021, arXiv:2102.02915. [Google Scholar]
  26. Graesser, L.; Keng, W. Foundations of Deep Reinforcement Learning: Theory and Practice in Python; Addison-Wesley Data & Analytics Series; Pearson Education: Londun, UK, 2019. [Google Scholar]
  27. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  28. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The operation modes of a DC source represented by a single-slope active power ( P d c ) and DC voltage ( U d c ) relationship: (a) droop-controlled voltage, (b) constant power, and (c) constant voltage mode [6]. The pre-disturbance operating point of the converter is indicated by the red dot, while the blue dot represents the post-disturbance operating point.
Figure 1. The operation modes of a DC source represented by a single-slope active power ( P d c ) and DC voltage ( U d c ) relationship: (a) droop-controlled voltage, (b) constant power, and (c) constant voltage mode [6]. The pre-disturbance operating point of the converter is indicated by the red dot, while the blue dot represents the post-disturbance operating point.
Energies 16 05369 g001
Figure 2. Each node of the DC microgrid systems can be modeled as a droop-controlled voltage source, a constant power source, or a constant power load.
Figure 2. Each node of the DC microgrid systems can be modeled as a droop-controlled voltage source, a constant power source, or a constant power load.
Energies 16 05369 g002
Figure 3. Reinforcement learning agents works by interacting with an environment, such as DC microgrid systems, by performing an action and gathering observation and reward to learn the most optimal policy.
Figure 3. Reinforcement learning agents works by interacting with an environment, such as DC microgrid systems, by performing an action and gathering observation and reward to learn the most optimal policy.
Energies 16 05369 g003
Figure 4. Q-learning works by building Q-table to estimate the values Q ( s , a ) of each action a in a particular state s .
Figure 4. Q-learning works by building Q-table to estimate the values Q ( s , a ) of each action a in a particular state s .
Energies 16 05369 g004
Figure 5. Instead of a table, neural network can serve as an action-value approximator in Q-learning.
Figure 5. Instead of a table, neural network can serve as an action-value approximator in Q-learning.
Energies 16 05369 g005
Figure 6. The complex environment consists of DC microgrid systems with 3 nodes and 2 bus connections.
Figure 6. The complex environment consists of DC microgrid systems with 3 nodes and 2 bus connections.
Energies 16 05369 g006
Figure 7. (a) The action given to the simple DC microgrid system produced by Q-learning or Q-network and (b) the produced state dynamics.
Figure 7. (a) The action given to the simple DC microgrid system produced by Q-learning or Q-network and (b) the produced state dynamics.
Energies 16 05369 g007
Figure 8. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system without model error. Note that the red and blue lines are on top of each other.
Figure 8. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system without model error. Note that the red and blue lines are on top of each other.
Energies 16 05369 g008
Figure 9. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system with model error.
Figure 9. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of normal DC microgrid system with model error.
Energies 16 05369 g009
Figure 10. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-network (blue) and MIQP (green) in the case of normal DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.
Figure 10. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (28 steps) produced by Q-network (blue) and MIQP (green) in the case of normal DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.
Energies 16 05369 g010
Figure 11. Sample current transient from mode 1 to mode 2 in the case of a complex DC microgrid system.
Figure 11. Sample current transient from mode 1 to mode 2 in the case of a complex DC microgrid system.
Energies 16 05369 g011
Figure 12. (a) Comparison of cumulative reward per episode as a function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system without model error.
Figure 12. (a) Comparison of cumulative reward per episode as a function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system without model error.
Energies 16 05369 g012
Figure 13. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system with model error.
Figure 13. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-learning (red) and Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-learning (red), Q-network (blue), and MIQP (green) in the case of complex DC microgrid system with model error.
Energies 16 05369 g013
Figure 14. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-network (blue) and MIQP (green) in the case of complex DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.
Figure 14. (a) Comparison of cumulative reward per episode as function of training iteration produced by Q-network (blue) in comparison to the ideal solution provided by MIQP (green) and (b) comparison of cumulative reward gathered in 1 episode (40 steps) produced by Q-network (blue) and MIQP (green) in the case of complex DC microgrid system with model error and measurement noise. Q-learning solution is not shown due to non-convergence.
Energies 16 05369 g014
Table 1. Comparison of Related Works.
Table 1. Comparison of Related Works.
PaperSystemsMethods
AlgorithmObjectiveUnknown
[11]DCMGPSOEconomic, EnvironmentDroop
[12]DCMGSCStabilityMode
[15]DCMGMILPEconomic, StabilityMode
[13]DCMGADPStabilityPolicy
[14]DCMGRL, TSTopology, StabilityEdge, Policy
[17]ACMGGAStability, FrequencyGeneration
[20]ACRLEconomic, StabilityTap Changer
[21]ACRLStabilityControl Signal
OursDCMGRLEconomic, StabilityMode
PaperVoltage LevelData Challenge
DelayNoiseError
[11]LV---
[12]LV---
[15]LV---
[13]LV---
[14]LV---
[17]LV--
[20]MV---
[21]HV---
[22]HV---
OursLV-
Table 2. List of Parameter Values.
Table 2. List of Parameter Values.
ParamValuesUnits
SimpleComplex
γ Q l e a r n i n g 0.90.9-
α Q l e a r n i n g 0.20.2-
γ Q n e t w o r k 0.990.99-
α Q n e t w o r k 0.010.001-
c s 0.22-
μ 00ampere
υ 0.010.1ampere
Δ t 0.10.0001second
Δ A , Δ B 3010%
Q 10.00001-
γ σ 0.11-
N t 2840-
L b 30,00050,000-
N ϵ 10,000250,000-
ϵ m i n 0.020.02-
Table 3. Summary of Results.
Table 3. Summary of Results.
SystemModel ErrorNoiseBest Reward/Cost per Episode
Q-LearningQ-NetworkMIQP
Simplexx3.58673.58673.5871
x3.30923.65543.6293
x3.66853.6667
Complexxx74.270774.425475.0489
x72.433572.371472.1596
x72.371472.1596
Table 4. Impact of System Size.
Table 4. Impact of System Size.
System MultiplierNumber of NodesTraining Time (s)
1313.949822300113738
10030028.190732199931517
10003000125.25556980003603
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Irnawan, R.; Rizqi, A.A.A.; Yasirroni, M.; Putranto, L.M.; Ali, H.R.; Firmansyah, E.; Sarjiya. Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning. Energies 2023, 16, 5369. https://doi.org/10.3390/en16145369

AMA Style

Irnawan R, Rizqi AAA, Yasirroni M, Putranto LM, Ali HR, Firmansyah E, Sarjiya. Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning. Energies. 2023; 16(14):5369. https://doi.org/10.3390/en16145369

Chicago/Turabian Style

Irnawan, Roni, Ahmad Ataka Awwalur Rizqi, Muhammad Yasirroni, Lesnanto Multa Putranto, Husni Rois Ali, Eka Firmansyah, and Sarjiya. 2023. "Model-Free Approach to DC Microgrid Optimal Operation under System Uncertainty Based on Reinforcement Learning" Energies 16, no. 14: 5369. https://doi.org/10.3390/en16145369

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop