1. Introduction
The modernization of the electric grid [
1] has greatly changed energy usage by integrating sustainable resources, improving use efficiency, and strengthening supply security. Smart-grid technologies [
2,
3] that allow for two-way communication between the utility and its customers, and advanced sensoring along the transmission lines [
4], play a crucial role in the modernization process. Among these technologies, MG has been viewed as a key component. In a MG, distributed generators (DGs), RES, and energy storage systems (ESS) are integrated in a distribution grid to supply the local consumers [
5]. A MG can operate in parallel with the main grid to fully exploit distributed energy resources or islanded to provide reliability guarantee for local service, while there is a failure in the main utility grid [
6]. It is expected that a combination of multiple autonomous MGs collaborating with each other will become a dominant mode in the future smart grid [
7,
8].
Nevertheless, the integration of distributed energy resources poses major challenges in stable and economic operation of a MG. Distributed renewable energy, such as solar and wind, can be highly intermittent and stochastic. These uncertain resources combined with load result in random variations in both the supply and the demand sides, which make it difficult to plan accurate generation schedules. Although the usage of ESS [
9] can buffer the effects of the uncertainty, smart control strategies and an efficient energy management system (EMS) are necessarily required to operate the ESS and the DGs in a cooperative and efficient way.
Traditionally, the model-based paradigm is adopted for the problem of MG energy management. In general, the model-based approaches use an explicit model to formulate the MG dynamics, a predictor to estimate the uncertainty and an optimizer to solve the best schedules [
10,
11,
12,
13]. For example, the rolling horizon optimization or model predictive control (MPC) is one of the most popular model-based approaches. The main advantage of MPC is the fact that it allows self-correction of the forecast on the model uncertainty and self-adjustment for the control sequence. It achieves this by repeatedly optimizing the predictive model over a rolling period of time. Many successful examples of its application can be found in the literature. Mario et al. [
14] applied the MPC approach to optimize the generation scheduling of a renewable hydrogen-based microgrid. Emily et al. [
15] proposed a robust optimization framework for microgrid operations. In particular, a rolling horizon optimization scheme ensembling weather forecasts is adopted for real-time implementation of the proposed method. Zhongwen et al. [
16] developed a strategy by combining a two-stage stochastic programming approach with MPC for MG energy management considering the uncertainty of load demand, renewable energy generation, and electricity prices. In [
17], Thomas et al. proposed a convex MPC strategy for dynamic optimal power flow control between multiple distributed ESSs in an AC MG.
Despite the advantages and successful application in the aforementioned works, model-based approaches rely heavily on domain expertise to construct appropriate MG models and parameters. Thus, the implementation of model-based approaches may cause increment of the development and maintenance costs. Overtime, the architecture, scale, and capacity of a MG may vary. The distribution of the uncertainty in RES and load demand may also change accordingly. Once changed, the model, the predictor, and the solver of a model-based controller must be re-designed correspondingly, which is neither cost-effective nor easy to maintain. In addition, the performance of the model-based controller may deteriorate if accurate models or appropriate estimates of the parameters are unavailable.
In recent years, learning-based schemes have been proposed to study the issue of MG energy management. Learning-based approaches can relax the requirement of an explicit system model and a predictor to handle the uncertainty. They treat the MG as a black box and find a near-optimal strategy from interactions with it. For instance, Brida et al. [
18] developed a battery energy management strategy for a MG by using the batch RL technology. Sunyong et al. [
19] proposed a RL-based EMS for a MG-like smart building to reduce the operating cost. Ganesh et al. [
20] proposed an evolutionary adaptive dynamic programming and RL framework for dynamic energy management of a smart MG. Elham et al. [
21] designed a multiagent-based RL system for optimal distributed energy management in a MG. However, most learning-based approaches adopted in aforementioned works suffer the curse of dimensionality, and have difficulty in handling MGs with high-dimensional state variables and uncertainties.
To solve the problems, DRL approaches were proposed a few years ago in the machine learning society. DRL techniques overcome the challenge of learning from high-dimensional state inputs by taking advantage of the end-to-end learning capability of deep neural networks. They have achieved great success in the field of games [
22,
23]. Motivated by these successes, several works applying DRL approaches to the problem of MG energy management have been reported in the literature recently. In [
24], for instance, Franois et al. applied DRL to efficiently operating the storage devices in a MG considering the uncertainty of the future electricity consumption and PV production. Specifically, a deep learning architecture based on convolutional neural network (CNN) was designed to extract knowledge from past time series of the energy consumption and PV production. However, this work did not consider the uncertainty of the electricity prices. In a real-time electricity market, the electricity prices, or locational marginal price (LMP), are generally uncertain, and have an important impact on the management of MGs. In [
25], Zeng et al. proposed an approximate dynamic programming (ADP) approach to solve MG energy management considering the uncertainty of load demand, renewable generation, and real-time electricity prices, as well as the power flow constraints. A recurrent neural network (RNN) is designed to make one-step-ahead state estimation and approximate the optimal value function. The MG model formulated in this work is elaborate. However, the proposed RL solution was model-based. It required explicit MG models and a one-step-ahead predictor for the uncertainty to solve the Bellman’s equation.
In this paper, we apply a specific DRL algorithm called DQN to the optimal energy management of MGs with uncertainty. The objective is to find the most cost-effective generation schedules of the MG by taking full advantage of the ESS. To handle the uncertainty of load demand, RES and LMP, the proposed approach uses their past observations as inputs, and outputs directly the real-time dispatch of the DGs and ESS. Thus, the proposed approach does not require an explicit model or a predictor.
Compared to prior studies, the major contributions of this paper are summarized as follows: (1) Considering the uncertainty of load demand, RES production, and LMP, we formulate the problem of MG energy management as an MDP with unknow transition probability. Specifically, the state, action, reward, and objective function of the problem are defined; (2) To obtain a cost-effective scheduling strategy for a MG, a DRL approach that does not require an explicit model of the uncertainty is applied to the problem. The proposed DRL approach uses a deep feedforward neural network to approximate the optimal action-value function and learns to make real-time scheduling in an end-to-end paradigm; (3) Case studies and numerical analysis using real power system data are conducted to verify the effectiveness of the proposed DRL approach.
The remainder of the paper is organized as follows: In
Section 2, the MG system model is presented. In
Section 3, the real-time energy management of a MG is formulated as an MDP. In
Section 4, the proposed DRL approach is illustrated in detail. In
Section 5, case studies are carried out. Finally, the conclusion is given in
Section 6.
5. Case Studies
To validate the effectiveness of the proposed DRL approach, we perform simulation studies on the European benchmark low voltage MG system [
30]. The structure of the benchmark MG system is shown in
Figure 2. The MG consists of a Micro Turbine (MT), a Fuel Cell (FC), a solar photovoltaics (PVs) system, a wind turbine (WT) and a battery ESS and some local loads. The MT and the FC have a maximum output power of 30 kW and 40 kW, respectively. A quadratic cost function is used to model their generation cost. The corresponding coefficients of the cost function for the MG and the FC are shown in
Table 1. The capacity of the ESS is 200 kWh, and its minimum and maximum SOC are 0.15 and 1.0, respectively. The charging and discharging efficiencies are 0.98. The charging/discharging power of the ESS is uniformly discretized to
values in the interval [−50 kW, 50 kW]. The limit of exchanged power at PCC is 200 kW. The parameters of the ESS and the main grid are presented in
Table 1. The maximal power production of the PVs and the WT is 20 kW and 10 kW, respectively. The time interval
between two time steps is 1 h.
We evaluate the proposed DRL approach on two experiments. In the first experiment, the proposed approach is tested in a deterministic scenario. In this scenario, the WT generation, PVs production, load demand, and LMPs over a period of one day are known, but the SOC of the ESS is initialized with different values. This is to show that the proposed DRL method can learn to make effective schedules in a deterministic environment for any initial state of the ESS. In the second experiment, we apply real power system data on wind generation, PVs production, load demand and LMP from the CAISO [
31] over a period of one-year to the proposed approach. We use the first 21 days in each month as the training set and the remaining days in each month as the test set. This is to demonstrate that the proposed DRL method is adaptive to stochastic scenarios and able to generalize well to situations that it has never seen.
In both experiments, the proposed DRL approach is implemented in Tensorflow 1.12, an open source deep learning platform. The simulations are carried out on a personal computer with 4 Intel (R) Cores (TM) i5-6300U CPU, 2.40 GHz and 8 GB RAM memory. The simulation environment is Python 3.6.8.
5.1. Experiment 1: Deterministic Scenario
In this experiment, the Q-network has 3 fully connected hidden layers. Each hidden layer has 200 ReLU neurons. The output layer is also a fully connected layer with 101 linear neurons. Overall, there are connection weights and 600 hidden neurons. All the weights are initiated to zero-mean Gaussian with a variance of 0.01. The capacity N of the replay memory is set to be 5000, and the minibatch size of samples is 240 for each gradient descent step.
We run the DQN algorithm (Algorithm 1) 1000 episodes for training in this experiment. In the first 100 episodes, actions are chosen at total random to try to explore the state-action space as well as possible. Afterwards, the
-greedy policy in Equation (
26) is used to choose actions. From episode 101 to episode 900, the value of
gradually decreases from 1.0 to 0.1 to keep a balance between exploration and exploitation. Then, the value of
stays at 0.1 until the end of training.
We evaluated the proposed approach periodically in the course of training by testing it without exploration, e.g., setting
and choosing actions greedily to maximize the action-value function. We compare the performance of the proposed approach with that of the theoretical optimum strategy. The theoretical optimum strategy formulates the problem as a mixed-integer nonlinear programming problem. Then, the problem is modeled by using the YALMIP toolbox [
32] and solved via a built-in solver named “BMIBNB” to obtain the best generation schedules.
Figure 3 shows the performance curves of the proposed approach with different initial values of the SOC of the ESS. As shown in the figure, the proposed approach succeeds in learning to increase the rewards on different initial SOC states of the ESS. After about 400 episodes, all the performance curves reach their highest values, and converge to a small area that is very close to the corresponding theoretical optimum, respectively.
Table 2 compares the rewards obtained by the proposed DRL approach and the theoretical optimum strategy in details. On average, the performance gap between the proposed DRL approach and the theoretical optimum is
$2.16, only accounting for 2.2% of the total cost.
Figure 4a shows the hourly LMPs and net load over a period of one day.
Figure 4b,c presents the scheduling results obtained by the DRL approach. The initial SOC of the ESS is 0.5. As it can be seen in
Figure 4b, the ESS is charged during low LMP periods, from hour 2 to 6 and hour 9 to 15. Correspondingly, the SOC level increases at the same time. During high LMP periods, from hour 6 to 8 and hour 16 to 22, the ESS is discharged to help supply the local demand or sell the electricity to the main grid. This pattern coincides with the curve of power exchanged with the main grid as presented in
Figure 4c. When the LMPs are relatively low, the MG purchases electricity from the main grid to supply the local demand and charge the ESS. In addition, the power outputs of the MT or the FC are reduced if the LMPs are lower than their generating costs. When the LMPs are high, however, the MG imports less electricity. The FC and the MT are scheduled to generate power because they are less costly. These simulation results demonstrate that the proposed DRL approach can learn a cost-effective scheduling strategy on different initial conditions of the ESS.
5.2. Experiment 2: Stochastic Scenario
In this experiment, we consider the MG in a more realistic setting where the load demand, RES production, and LMP are stochastic. The proposed DRL approach is evaluated by using real power system data in 2016 from the CAISO. We use the first 21 days in each month as the training set and the remaining days in 2016 as the test set. In total, there are 252 days of hourly data in the training set and 114 days of hourly data in the test set, respectively. The used data in the experiment are presented in
Figure 5 and
Figure 6. The Q-network consists of 3 fully connected hidden layers. Each hidden layer has 500 ReLU neurons. The output layer is a fully connected layer with 101 linear neurons. Overall, there are 575,000 connection weights and 1500 hidden neurons. All the weights are initialized to zero-mean Gaussian with a variance of 0.01. The capacity
N of the replay memory
is set to be 20,000, and the minibatch size of samples is 240 for each gradient descent step. We run the DQN algorithm (Algorithm 1) 15,000 episodes for training. In the first 1000 episodes, actions are chosen at total random to try to explore the state-action space. Then, the
-greedy policy is used to choose actions with a decaying
. From episode 1001 to episode 9000, the value of
gradually decreases from 1.0 to 0.1 to keep a balance between exploration and exploitation. After that, the value of
stays at 0.1 until the end of the training.
During the training process, we calculate the total rewards
at each episode to monitor the learning performance of the proposed approach.
Figure 7 presents the learning curve of the proposed approach. As shown in the figure, for the first 1000 episodes when the agent selects actions at total random, the rewards vary in the range from −7.4 to −7.3. From episode 1000 to 9000, the rewards gradually increase from −7.35 to −6.3. After 9000 episodes, the cumulative rewards converge to a small region around −6.3. This result demonstrates that the proposed approach succeeds in learning an effective and stable policy under the stochastic environment.
To evaluate the performance of the proposed approach in the test set, several benchmark solutions are applied for comparison. The benchmark solutions include (1) theoretical optimum; (2) standard Q-learning (SQL); (3) fitted Q-iteration (FQI); and (4) uncontrolled strategy. For the theoretical optimum solution, we assume that the LMPs and net load of the MG are known in advance, and the problem is modeled as a mixed-integer nonlinear programming. The build-in solver “BMIBNB” in YALMIP toolbox [
32] is employed to solve the model. Please note that the theoretical optimum solution provides the minimal daily operating cost of the MG, but it can never be reached in practice due to the existence of uncertainty. For the SQL solution, a neural network with one hidden layer is used to approximate the optimal action-value function. The hidden layer consists of 1000 ReLU neurons. The standard Q-learning algorithm is employed to train the neural network and the greedy policy is used to select actions while making real-time scheduling decisions. Through trial and error, we set the maximum training episode to be 5000 for the best performance. For the FQI solution, a linear approximator is used to approximate the action-value function. The batch size for training the approximator is set to be 15,000 for comparison. Similarly, the generation schedules are determined based on the greedy policy that selects actions maximizing the approximate action-value function. For the uncontrolled strategy, the MG supplies its local demand all by purchasing electricity from the main grid no matter what the LMP is. The uncontrolled strategy serves as a baseline for the performance evaluation in the test set.
The daily operating costs of the MG and the corresponding cumulative daily costs on the test days by using the proposed and benchmark solutions are presented in
Figure 8. As it can be seen in
Figure 8a, the proposed approach obtained lower daily operating costs on most of the test days than the benchmarks. Although there are bad cases (marked by red circles) on several test days where the proposed approach does not obtain as good results as the other benchmarks, the overall performance of the proposed approach is better. As shown in
Figure 8b, in terms of the cumulative daily costs, the proposed DRL approach outperforms the other two RL solutions and obtains a lower total operating cost over all test days. Compared with the uncontrolled strategy (blue dotted line), the proposed DRL approach (red dotted line) reduces the operating cost by 20.75%, but the SQL (orange dash line) and FQI (green dash-dotted line) solutions only reduce the operating cost by 13.12% and 13.92%, respectively. Furthermore, the performance of the proposed approach is close to the theoretical optimum. The cost reduction by the proposed approach is only 6.45% less than the one resulted from the theoretical optimum strategy (purple star). These results demonstrate the effectiveness of the proposed DRL approach for real-time energy management of a MG with uncertainty.
To further investigate the performance of the proposed approach, the yielded generation schedules over 7 consecutive days in the test results are presented in
Figure 9. In
Figure 9a, the hourly LMPs and net load over the 7 days are illustrated. The charging/discharging power schedule and the SOC of the ESS are presented in
Figure 9b. As it can be observed, the ESS is charged when the LMPs and net load are at peak, and discharged when they are off peak. This means that the proposed approach can effectively manage the charging or discharging of the ESS. By taking advantage of its buffer effect, the low-budget electricity is stored at off-peak hours, and then discharged at peak hours to supply the local demand or sold to the main grid. Moreover, the SOC of the ESS at the end of each day is equivalent or close to its minimum admissible value, i.e., 0.15. This means that the proposed approach can sufficiently use the energy of the ESS by the end of the scheduling over a day to minimize the daily operating cost. For the utility grid, as shown in
Figure 9c, the MG purchases less electricity from the main grid at peak hours to save cost or sell extra electricity to it to earn revenue. This is because that superfluous power is purchased at low LMP hours and stored in the ESS. A similar pattern can be observed from the schedules of the MT and FC in this figure. The MT or the FC generates electricity when the LMPs are higher than the corresponding generation cost, and reduce the generation when the LMPs are lower. The simulation results demonstrate that the proposed approach can adaptively adjust its actions to the trends of the LMP and net load, and make cost-effective schedules for operating the MG under uncertain environments.
5.3. Analysis: Effect of Hyper-Parameters
To demonstrate how the hyper-parameters affect the performance of the proposed DRL approach, we train the DRL approach using different hyper-parameters and then test it in the stochastic scenario. Specifically, we apply three sets of different hyper-parameters to the DRL approach. (1) In the first set of hyper-parameters, the number of neurons in each hidden layer of the Q-network is reduced from 500 to 100, and the other hyper-parameters remain unchanged. We refer the DRL approach using this set of hyper-parameters to as DQN-100n-240b, where “100n” denotes 100 neurons in each hidden layer, and “240b” denotes the batch size for training the Q-network is 240; (2) In the second set of hyper-parameters, the number of neurons in each hidden layer of the Q-network is still 500, but the batch size for training the Q-network is reduced from 240 to 32, and the other hyper-parameters remain unchanged. We refer the DRL using this set of hyper-parameters to as DQN-500n-32b; (3) In the third set of hyper-parameters, the number of neurons in each hidden layer of the Q-network is changed from 500 to 100, and the batch size for training the Q-network is also changed from 240 to 32. We refer the DRL using this set of hyper-parameters to as DQN-100n-32b.
Figure 10 shows the learning curves of the proposed DRL approach and the DRL with the three different sets of hyper-parameters mentioned above.
Figure 11 shows the daily operating costs and the corresponding cumulative daily costs on the test set. In both
Figure 10 and, the proposed DRL approach is referred to as DQN-500n-240b and marked by “∗”. As shown in these figures, using less hidden neurons in the Q-network or/and less batch size for training, the performance of the DRL approach degrades, in both the training process and the test results. Comparatively, reducing the batch size from 240 to 32 has a worse influence on the performance than reducing the number of hidden neurons of the Q-network from 500 to 100. This could be because that when we reduce the batch size, the number of samples used at each iteration for training the Q-network becomes less. Therefore, we may run the risk of not taking full advantage of the samples in the replay buffer and resulting in an underestimated Q-network.