1. Introduction
The microgrid (MG) is a subset of a power system with self-control capabilities, usually composed of distributed generators, loads, energy storage facilities, etc. [
1]. Different from the distribution network, the MG can operate in islanded mode or grid-connected mode, and from the perspective of the upper-level system, the MG is an independent entity in the power system [
2]. To deal with the problem of environmental pollution, renewable energy has developed rapidly in recent years. In an MG with high proportion of renewable energy, distributed renewable power generation devices, e.g., rooftop photovoltaics (PV), are often installed on the demand side, which turns consumers into prosumers with dual attributes of demand and supply. The uncertainty of renewable energy output brings certain challenges to the efficient operation of the MG.
Energy storage (ES) is considered to be an effective means to deal with the fluctuation of renewable energy power generation, and its installation is rapidly increasing around the world [
3]. Since ES can be owned by microgrid operators (MGOs) or prosumers, in order to improve the utilization efficiency of ES, mechanisms for ES capacity sharing have been proposed, which are mainly divided into two modes, i.e., centralized ES sharing and distributed ES sharing.
In the centralized mode, ES is invested and operated by MGO or independent ES operators, and prosumers purchase the required ES capacity. Stackelberg game theory is used to analyze the relationship among participants [
4], and the market framework is designed to maximize the revenue of system [
5]. An offline optimization approach for single MG equipped with ES is proposed in [
6], and the energy cost of the conventional energy drawn from the main grid is minimized. A two-layer energy management system for MGs is proposed in [
7], and ES is used to minimize the total operational cost, as well as deal with the uncertainty of renewable energy. Other similar studies can be seen in [
8,
9,
10,
11].
In the distributed mode, ES is owned by each prosumer, and the capacity can be shared by prosumers through incentives or transactions [
12,
13,
14]. Affected by changes in power supply and demand, the shared ES capacity required by MGO in each time slot varies. Excessive shared ES capacity causes waste of capacity resources, while insufficient shared ES capacity affects the adjustment of MGO. However, due to the influence of uncertainty, both the shared ES capacity required by MGO and the ES capacity that prosumers can share fluctuate, and how to obtain an appropriate shared ES capacity in each time slot has not been well resolved in existing research.
Moreover, demand response (DR) is also recognized as an effective means to use the adjustable resources on the demand side to improve the flexibility of the MG, and it mainly includes two types: one is price-based (e.g., [
15,
16,
17,
18]), and the other is incentive-based (e.g., [
19,
20,
21,
22,
23]).
In the area of price-based DR, time-of-use (TOU) is widely applied due to its stability, and the social costs can be reduced by utilizing the temporal complementarity of end-users [
15,
16]. Real-time price with higher flexibility is also the concern of many researchers, and the desirable usage behaviors are elicited through appropriate mechanisms and online optimization approaches [
17,
18].
Incentive-based DR can provide flexible schedulable resources for the system operator, which is conducive to the collaborative optimization of DR and other flexible resources [
19]. Incentive-based DR is usually implemented during peak load periods, and consumers are directly subsidized according to their response [
20,
21,
22,
23]. Since the reduced load demand in DR includes transferable loads such as electric vehicles and delayable loads such as temperature control loads, DR will cause load rebound in subsequent time periods, which affects the cumulative revenues of the whole day. Some studies have paid attention to this problem and considered the load reflection phenomenon in the optimization, e.g., [
24,
25,
26,
27], but the uncertainty of load rebound caused by prosumer behavior has been ignored in these studies.
Although massive mechanisms and optimization strategies on DR and ES sharing have been proposed in the existing studies, the joint optimization considering their mutual influence and multiple uncertainties needs to be further investigated. On one hand, under the constraint of power balance, the change in shared ES power will not only affect its own marginal cost, but also change the reduced power and marginal cost in DR, and vice versa. On the other hand, the superposition of uncertainties existing in intraday optimization of MG and day-ahead ES sharing, respectively, brings a great challenge to the maintenance of power balance, as well as the maximization of MGO’s revenues.
To solve optimization problems in complex environments, deep reinforcement learning (DRL) with strong learning ability can be applied, which has been proven to be efficient in some studies [
28]. Considering that the behavior of the prosumer in DR and ES capacity sharing is difficult to accurately model, the model-free DRL algorithm should be applied, and in order to improve the utilization efficiency of historical datasets, the offline DRL algorithm, i.e., deep deterministic policy gradient (DDPG), is selected in this paper. Based on the reply buffer, the DDPG algorithm can realize the reuse of historical data, and has achieved good results in the optimization control of MG, such as battery charging control, motor control, voltage control, etc. [
29,
30,
31].
In the scenario of this paper, MGO obtains the appropriate shared ES capacity through incentives, and the incentives need to be formulated on the previous day. However, the required ES capacity is affected by intraday optimization, which cannot be known in advance, i.e., MGO has to formulate incentives with incomplete information. Meanwhile, both the required ES capacity and the ES capacity that can be shared by prosumers are affected by uncertainty, so the DDPG algorithm cannot be directly applied to solve the optimization problem in this paper.
In order to improve the utilization efficiency of distributed ES, a two-stage ES sharing mechanism based on incentives is proposed, in which MGOs can obtain the required ES capacity to reduce operating costs, while prosumers can revenue from the sharing of idle ES capacity. Then, a two-layer semi-coupled optimization strategy based on DDPG is proposed to solve the decision-making problem with incomplete information, and Monte Carlo sampling is applied to deal with the influence of uncertainty. The main importance and contributions are summarized as follows.
(1) A two-stage optimization framework is proposed to realize the cooperation of DR and ES sharing. Compared with the existing studies that only focus on DR, such as [
20,
21,
22,
23], or only focus on ES sharing, such as [
12,
13,
14], joint optimization can more fully release the adjustable potential of resources on the demand side, so as to improve the revenues of MGO and the local consumption of renewable energy.
(2) Since the required ES capacity in the day-ahead ES sharing is determined by real-time optimization and cannot be known in advance, a two-layer semi-coupled optimization strategy based on DDPG is proposed to realize asynchronous optimization of coupled decision-making problems that are distributed in different time slots.
(3) Multiple uncertainties caused by prediction errors, prosumer behavior, etc., are considered as fully as possible. Differently from existing studies, such as [
23,
24,
25,
26,
27], which ignore the uncertainty of load rebound in DR, Bayesian transition probability is introduced to describe the uncertainty caused by prosumer behavior.
(4) To deal with the impact of multiple uncertainties on ES capacity sharing, Monte Carlo sampling is applied in the network training of the proposed algorithm. Compared with the existing research that ignores the impact of uncertainty on ES sharing, such as [
32,
33,
34,
35], the optimization strategy proposed in this paper can ensure that sufficient shared ES capacity for real-time optimization can always be obtained at the lowest cost in any scenario.
The rest of this paper is organized as follows:
Section 2 illustrates the framework of the system and introduces the sharing mechanism of distributed ES. The modeling of MGO and prosumers is presented in
Section 3.
Section 4 proposes the two-layer semi-coupled optimization strategy based on DDPG, and numerical simulation is given in
Section 5. Finally, conclusions are drawn in
Section 6.
4. Problem Formulation and Solution
In this section, we use the Markov decision process (MDP) to describe the decision-making actions of MGO, then introduce the proposed two-layer semi-coupled optimization strategy, as well as the method of applying Markov sampling in the reverse training of the neural network to deal with the influence of uncertainty.
4.1. Construction of Decision Problems Based on MDP
A standard Markov process consists of a set of 5 tuples, i.e., , where S the state space, A is the action space, T is the transition probability of the state after the action is executed, R is the reward for action, and is the discount factor.
The actions performed by MGO include incentive price for ES sharing
, the charging/discharging power of the ES
and
, and the incentive price for DR
. Since the sharing of ES capacity is completed on the previous day, DR and power control of the shared ES are completed intraday, the action space can be divided into two subspaces:
The purpose of action is to obtain sufficient capacity according to the required ES capacity in each time slot, so the state is set to be . With the goal of maximizing revenues, the value of is determined by many factors, e.g., RTP, PV output, etc., and affected by multiple uncertainties, so it cannot be obtained directly. Through the reverse training of the two-layer semi-coupled network proposed in this paper and Monte Carlo sampling, the value of can be obtained, which is discussed in detail in the next part. In order to satisfy the power balance constraint, the interactive power and with the main grid are set to be passive variables, and their values are calculated according to the power balance constraint.
In the intraday optimization, MGO needs to comprehensively consider RTP, load demand, PV output power, shared ES capacity, etc., to maximize the cumulative revenues throughout the day. The state space for action
is:
Although the goal of MGO’s optimization is to maximize the revenues throughout the day, the sub-goals of the actions
and
are different. On the previous day, MGO’s goal is to minimize the cost for ES sharing while ensuring that the shared ES capacity is not less than the target ES capacity
. Therefore, the reward is set as follows:
where
is the shared ES capacity traded on the previous day.
The reward consists of two parts. The first is the cost for ES sharing, and the latter is the penalty for the shortage of shared ES capacity. When the shared ES capacity is insufficient, the operating efficiency of the MG is reduced, and even the power balance constraint may be violated. Therefore, the penalty can ensure that there is always sufficient shared ES capacity for real-time adjustment. Meanwhile, in order to ensure that the value of the penalty does not grow too fast, the difference between and is squared, and a correction coefficient is added to make its value match the previous item.
In the real-time adjustment, the goal of the actions is to maximize the revenue with the premise of satisfying the constraints, so the reward of the actions is set as
where
and
are penalties for violation of constraints (7) and (12), respectively:
Then, the cumulative rewards are:
where
and
are discount factors.
4.2. Two-Layer Semi-Coupled Optimization Strategy Based on DDPG
MGO needs to perform actions on the previous day and within the day, respectively. However, the target ES capacity
is unknown during the day-ahead action, such that the reward caused by the action cannot be evaluated. Inspired by the learning and memory capabilities of neural networks, a semi-coupled two-layer network based on DDPG is proposed to solve this problem, in which the actor–critic networks are established for the day-ahead action and the intra-day actions, respectively, and Monte Carlo sampling is introduced in the training process to deal with the influence of multiple uncertainties, as shown in
Figure 3.
Since DDPG is an offline learning algorithm, although the actions are executed in a time sequence in practice, in the training of networks, the intraday networks for DR and ES control can be trained first. The shared ES capacity is unknown in advance, so we first make the following assumptions, which are guaranteed to be established by the day-ahead action.
Assumption 1. The required ES capacity in each time slot can always be satisfied.
With Assumption 1, the networks for ES capacity sharing can be trained. Since the actor–critic networks for day-ahead action and intraday actions are all based on DDPG, they have similar training processes, and the training processes are shown in the subsequent analysis. For the sake of simplicity, some subscripts are omitted.
To measure the performance of action
, we set the value function based on the Bellman equation as:
where
is the policy of action.
Let
and
be the parameters of critic network and actor network, respectively, and
and
be the parameters of target critic network and target actor network, respectively. To train the critic network, define the loss function as
where
H is the number of samples from the reply buffer, and
is calculated using target network:
Then, the policy gradient to train the action network is:
where
is the gradient of
and
is the gradient of
a.
Moreover, since DDPG is a deterministic strategy, random noise needs to be added when exploring the environment:
where
is random noise.
The parameters of the target networks are updated by copying:
where
is the update rate coefficient,
and
are the parameter vectors of critic network and action network, respectively, and
and
are the parameter vectors of target critic network and target action network, respectively.
The goal of day-ahead optimization is to provide sufficient ES capacity for intraday optimization at the minimum cost. The optimization problem can be divided into two sub-problems, i.e., the calculation of the aggregated ES target capacity of each time slot, and the formulation of the ES shared incentive price .
The uncertainty of PV output, prosumers’ load and response all affect the value of
, so Monte Carlo sampling is applied to obtain the expected value of
. Sample
times in each training and uses the maximum value as the target ES capacity as follows:
where
is the required ES capacity in the
j-th sample.
The ES capacity shared by prosumers under specific incentives also has uncertainty. In order to make the shared ES capacity always meet the needs of intraday adjustment, Monte Carlo sampling is also applied to determine the incentive price for ES sharing. In network training, for a certain action, i.e., the ES incentive price
, all shared ES capacity of prosumers are sampled with
times, and the minimum value is used to calculate the reward to ensure that Assumption 1 holds under the worst case. Then, the value of
in network training is as follows:
where
is the shared ES capacity in the
j-th sample of the
i-th prosumer with action
. Then, the reward for training can be calculated using (19), and the network can be trained accordingly.
The detailed reverse training process of the two-layer semi-coupled network is presented in Algorithm 1.
Algorithm 1. The detailed reverse training process of the two-layer semi-coupled network. |
Randomly initialize critic network and actor for DR and ES control with weight and , initialize critic network and actor for DR and ES control with weight and Initialize target network and for DR and ES control with weight and , initialize target network and for DR and ES control with weight and Initialize reply buffer , for episode =1, M do Receive initial observation state for t = 1, T do Select action according to (28), (29) Execute action and observer reward and the next state Store transition in reply buffer Sample a random minibatch of H transitions from Update critic by minimizing (25) Update actor policy using (27) Update the target networks according to (30), (31) End for End for for episode = 1, M do Receive initial observation state according to (32) for t = 1, T do Select action according to (28) and (29) Execute action and observer reward according to (33) Store transition in reply buffer Sample a random minibatch of H transitions from Update critic by minimizing (25) Update actor policy using (27) Update the target networks according to (30) and (31) End for End for |
5. Simulation Experiment
5.1. Settings of Simulation Environment
In order to verify the performance of the proposed algorithm, 180 days of data are used to perform the simulation with 140 days as the training set, 20 days as the validation set, and 20 days as the test set. Although the dataset contains data from different seasons, seasonal differences are not specially considered because the prosumer’s load data and PV output data contain seasonal characteristics, which can be learned by the algorithm. The RTP is taken from the Pennsylvania–New Jersey–Maryland (PJM) electricity market. The power demand and PV output power are based on the real data in the PJM electricity market [
42], but it is scaled down in proportion. In this paper, MG participates in the electricity market as an independent entity. In order to encourage local consumption of renewable energy, the price of surplus PV sold to the main grid is lower than the RTP [
43], and the price coefficient
is set to be 0.5. The TOU tariff in
Table 1 is set according to existing research results in [
44]:
The prosumers’ response functions for ES capacity sharing incentive
and DR incentive
are both assumed to be quadratic functions, i.e.,
and
, where the values of all parameters are shown in
Table 2. The uncertainties in RTP, PV output, prosumers’ load demand, prosumers’ response to DR incentive, and response to ES capacity sharing incentive are all assumed to follow a normal distribution with mean 0 and standard deviation 0.03.
In
Table 2,
and
are the expectation and standard deviation of each uncertain variable, respectively, and all uncertainties are calculated as follows:
where
denotes the original value without uncertainty,
denotes the value with uncertainty, and
is a normal distribution with
as the expectation and
as the standard deviation.
Assuming that the load rebound is affected by the past six time slots, and due to the influence of uncertainty, the load rebound coefficient follows the normal distribution with a standard deviation of 0.01, and the expected value of load rebound coefficient for each time slot is shown in
Table 3:
In order to verify the advantages of the proposed method, four comparative cases are set up in the simulation experiment.
Case 1: The sharing and adjustment of distributed ES capacity is used to improve the revenues of MGO, without the consideration of DR; see, e.g., [
12,
13,
14].
Case 2: DR is used to improve the revenues of MGO, but the idle capacity of prosumers’ ES is not utilized; see, e.g., [
20,
21,
22,
23].
Case 3: The shared ES capacity of prosumers is fixed, i.e., the ES capacity aggregated by MGO in each time slot is the same; see, e.g., [
10,
11].
Case 4: The impact of multiple uncertainties is ignored in the sharing of ES capacity; see, e.g., [
30,
31,
32,
33,
34].
All networks in the DRL adopt the fully connected network with three hidden layers, of which there are 256 neurons. The learning rate of the actor network and critic network for both sub-agents is set to be 0.0001 and 0.001, respectively. The algorithm is implemented using PyTorch 1.8.1 in Python 3.7.7. The case studies have been performed on a laptop with Intel(R) Core(TM) i7-9750H processor and one single NVIDIA GeForce GTX 1660 Ti GPU.
5.2. Performance Analysis of Intra-Day Joint Optimization
One day is selected for display in the 20 days of the test set. The RTP, TOU, load demand and PV output of the day are as follows:
Since PV power, prosumers’ load, and RTP are all predicted values when performing actions, there is an error between them and actual values. The solid line in
Figure 4 is the actual value of each parameter, and the shade is the fluctuation range of each parameter in 5000 Monte Carlo samples. It should be pointed out that the load fluctuations on the selected days are not large, but in the dataset, there are many days with large load fluctuations.
The power demand of the MG is first analyzed, and the results are as shown in
Figure 5. The blue line is the PV power in the MG, the purple line is the original load demand in the MG, and the green, red and brown lines are the adjusted load demand in this paper, case 1 and case 2, respectively, including the prosumers’ load demand and the charging/discharging power of shared ES. The original load demand fluctuates slightly throughout the day, but the PV output power varies greatly due to the influence of light intensity. Therefore, the PV output power in the MG from 10 to 17 o’clock is higher than the load demand, while there exists power shortages in the MG in other time slots.
In the optimization of this paper and case 1, the sharing of ES capacity is considered, and MGO can use the shared ES capacity to store excess PV output to increase the local consumption of PV. The total output of PV throughout the day is 4307.4 kWh. Before optimization, 73.6% can be consumed locally. After optimization using shared ES in this paper and case 1, the proportion of local consumption of PV output is increased to 86.5%. In case 2 where ES sharing is not considered, the local consumption rate of PV output is the same as that without optimization, indicating that the increase in the local consumption rate of PV output is mainly contributed by ES sharing.
Based on
Figure 4 and
Figure 5, it can be seen that DR is mainly implemented in the time slots with higher RTP to reduce the deficit of MGO in the specific time slot, so as to improve the cumulative revenues of MGO throughout the day. Therefore, the revenues in each time slot are analyzed as shown in
Figure 6:
Since the price of TOU is higher than that of RTP in most time slots, MGO can obtain revenues from power supply, while in other time slots with higher RTP, the revenue of MGO is negative, i.e., it has to bear the deficit to satisfy the energy demand of prosumers. Both the adjustment of ES in case 1 and DR in case 2 are effective means to reduce the deficit and increase the total revenues of MGO. The total revenues of MGO throughout the day are USD 29.2 without optimization, while the total revenue increased by 74.0% to USD 50.8 in Case 1 and increased by 69.9% to USD 49.6 in Case 2. In the algorithm proposed in this paper, both DR and shared ES are considered, and the revenues of MGO are increased by 113.4% to USD62.3. The results verify that the revenues of MGO can be further improved through the cooperation of DR and ES sharing compared to each method of them alone. Since the two methods have different effects on the revenues of each time slot of MGO throughout the day, the revenue improvement of each method compared with that without optimization is analyzed as follows:
The difference in revenue of MGO mainly appears in time slot 20 due to the extremely high RTP. Although the RTP in time slot 11 is also very high, the output of PV can satisfy the local load demand, so the change in RTP does not have a great impact on the revenues of MGO. Affected by the load rebound effect, the load demand in the subsequent time slots increases, thereby changing the revenues of MGO. After DR in time slot 20, load demand is increased in time slot 21, and since RTP is still higher than TOU in time slot 21, the revenues of MGO are reduced in Case 2. In comparison, ES sharing mainly affects the revenues of MGO in previous time slots. In the algorithm of this paper and Case 1, the revenues of MGO are reduced in time slots with high PV output, because MGO needs to pay for the shared ES capacity to store the exceeded PV power. Then, the stored power is used to satisfy the load demand of prosumers in time slots with high RTP, thereby improving the total revenues throughout the day.
In order to verify the stability of the optimization effect of the algorithm, the optimization effect of 10 consecutive days is counted in the test set, and the total revenue of these 10 days is shown in
Table 4. It can be seen that the total revenue of MGO in these 10 days has reached USD 59.37, which is 29.30% and 9.18% higher than that of case 1 and case 2, respectively, indicating that the algorithm proposed in this paper has stable performance in continuous operation.
5.3. Performance Analysis of Day-Ahead ES Capacity Sharing
However, MGO needs to pay for the shared ES capacity, so the change in required ES capacity and the corresponding power in each time slot are shown in
Figure 7.
The unmarked solid line in
Figure 7 is the value of each variable without the influence of uncertainty, the shadow is the fluctuation range of each variable in 5000 Monte Carlo samples, and the red solid line with the diamond mark is the upper limit of the required ES capacity, i.e., the target ES capacity in the day-ahead action.
As shown in
Figure 8, in time slots with high PV output, the excess PV output is stored in the shared ES, and then the stored electricity is used to satisfy the load demand of prosumers in time slots with high RTP. As can be seen in
Figure 5, the excess PV output is not completely stored, because the unit cost for ES sharing is increasing, and excess shared ES capacity leads to a decline in revenues. In addition to controlling the shared ES to absorb excess PV power, the algorithm in this paper can also track changes in RTP, store electricity when RTP is low, e.g., time slot 8 and 9, and supply power to prosumers in subsequent time slots, so as to further enhance the revenues of MGO.
The shared ES capacity in each time slot is also affected by uncertainty, and the results of ES sharing are as follows:
The solid line in
Figure 9 represents the expected value of the shared ES capacity obtained by MGO, and the shaded part is the distribution interval of the ES capacity obtained by MGO in 5000 Monte Carlo samples. In Case 3, MGO predicts the maximum ES capacity required for the next day, and then aggregates shared ES based on this value. MGO does not need to make a decision for each time slot, which reduces the difficulty of ES sharing, but the shared ES capacity is not fully utilized, thereby reducing the revenues of MGO. In Case 4, the uncertainty in ES sharing has not been fully considered. Although its cost for shared ES capacity is the lowest, there may be insufficient ES capacity for intraday optimization. The Monte Carlo sampling process added in the algorithm proposed in this paper ensures that the shared ES capacity can always meet the needs of intraday optimization while minimizing the cost for shared ES capacity.
Prosumers can obtain revenues by sharing ES capacity, and the revenues in each time slot throughout the day are shown in
Figure 10. Prosumers can obtain considerable revenues through ES capacity sharing, and the revenues are positively related to the shared ES capacity.
5.4. Performance Analysis of the Proposed Algorithm
The performance of Monte Carlo sampling determines whether the ES capacity required for intraday optimization can be met in the worst case. There are two independent Monte Carlo samples in the proposed algorithm. Because there are more uncertain factors when determining the required ES capacity, this sampling process is selected to analyze the impact of different sampling times on the determination of the required ES capacity, as shown in
Figure 11:
In order to deal with the impact of multiple uncertainties and make the shared ES capacity satisfy the requirements of intraday optimization in the worst case, Monte Carlo sampling is applied to find the required ES capacity in the worst case. Too many Monte Carlo sampling times consume a lot of computing resources, and too few Monte Carlo sampling times may be difficult to accurately reflect the impact of uncertainty. Therefore, 10, 100, 1000 and 5000 Monte Carlo samples are set to verify the impact of sampling times on the required ES capacity. Each type of sampling is run 10 times, and the results are shown in
Figure 10.
It can be seen from
Figure 11a, due to too few sampling times, the boundary of the required ES capacity obtained by each group of Monte Carlo sampling is quite different. Moreover, compared with the groups of other sampling times, the upper boundary of the required ES capacity determined by the group with 10 sampling times is lower, which does not reflect the worst case. With the increase in the sampling times, the upper boundary of the required ES capacity determined by each group of samples is gradually stable. In
Figure 11c,d, the upper boundary determined by each group of samples is relatively close, and the error is less than 35 kWh with the maximum required ES capacity is around 1000 kWh, indicating that the upper boundary of ES capacity required in the worst case can be stably found with sufficient sampling times.
The convergence and stability of the algorithm are important factors for evaluating its performance. This paper shows the loss convergence and reward changes of the network for DR and ES control, as shown in
Figure 12.
After about ten minutes of training, the network loss and reward of the algorithm tend to be stable, indicating that the algorithm has high training efficiency. It can be seen that the loss changes of actor network and critic network are relatively stable, and the reward can also converge smoothly after fluctuating at the beginning of training. This paper has tried several trainings, and most algorithms show similar convergence characteristics, indicating that the algorithm has high stability. Increasing the learning rate can improve the convergence speed, but an excessive learning rate may lead to non-convergence. Therefore, in practical applications, a high learning rate should be used within an appropriate range. In addition, the depth and width of the network also affect the performance of the algorithm. A too-large network is not conducive to training, and a too-small network cannot meet the requirements of optimization. In practice, it is necessary to build an appropriate network according to the complexity of the optimization task.