Next Article in Journal
3D Numerical Modelling of the Application of Cemented Paste Backfill on Displacements around Strip Excavations
Previous Article in Journal
Method to Evaluate the Resistance–Capacitance Voltage Divider and Uncertainty Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Operation of Distributed Battery Considering Demand Response Using Deep Reinforcement Learning in Grid Edge Control

Sichuan Energy Internet Research Institute, Tsinghua University, Chengdu 610042, China
*
Author to whom correspondence should be addressed.
Energies 2021, 14(22), 7749; https://doi.org/10.3390/en14227749
Submission received: 13 October 2021 / Revised: 7 November 2021 / Accepted: 16 November 2021 / Published: 18 November 2021

Abstract

:
Battery energy storage systems (BESSs) are able to facilitate economical operation of the grid through demand response (DR), and are regarded as the most significant DR resource. Among them, distributed BESS integrating home photovoltaics (PV) have developed rapidly, and account for nearly 40% of newly installed capacity. However, the use scenarios and use efficiency of distributed BESS are far from sufficient to be able to utilize the potential loads and overcome uncertainties caused by disorderly operation. In this paper, the low-voltage transformer-powered area (LVTPA) is firstly defined, and then a DR grid edge controller was implemented based on deep reinforcement learning to maximize the total DR benefits and promote three-phase balance in the LVTPA. The proposed DR problem is formulated as a Markov decision process (MDP). In addition, the deep deterministic policy gradient (DDPG) algorithm is applied to train the controller in order to learn the optimal DR strategy. Additionally, a life cycle cost model of the BESS is established and implemented in the DR scheme to measure the income. The numerical results, compared to deep Q learning and model-based methods, demonstrate the effectiveness and validity of the proposed method.

1. Introduction

In the past ten years, incentive-based DR has developed rapidly in the form of load shedding and transfer, which can greatly improve the flexibility of the grid. According to the National Energy Administration (NEA), BESS capacity will exceed 100 GW·h by 2030 [1]. On the one hand, residential and commercial BESS, etc., are some of the best DR resources, and are increasing sharply with the prosperity of distributed PV, accounting for more than 50% of the newly installed capacity of PV [2]. On the other hand, daily operation of BESS is designed only for the storage of photovoltaic power and saving electricity cost, without consideration of idle use, such as DR programs. Thus, it is necessary to investigate an optimal operation method for BESS in consideration of DR.
However, demand response mechanisms at home and abroad are not yet perfect, and demand response resources are still scarce compared to peak loads. Grid edge control technology provides the potential for exciting transformations in the power industry, creating more choices, higher efficiency, more comprehensive and more efficient decarbonization for customers, and better economic benefits for stakeholders in the value chain [3]. Grid edge control is at the critical point of the adoption curve. Residents, industry and regulators are preparing to connect to digitally distributed resources. Therefore, grid edge control technology could be feasible way to expand resources in a manner that corresponds to demand.
By reducing peak load demand and transferring load demand to low-price and off-peak times, price-based DR is conducive to reducing customer bills [4]. Recent studies have mainly focused on arranging home power consumption using price-based DR. In [5], Nilsson A et al. controlled smart home appliances using home energy management system (HEMS) and tested the energy-saving potential of smart homes in Sweden. The results showed that the impact of different households on energy consumption varies greatly, indicating that households have a high degree of independence in response to demand. In incentive-based demand response, each customer receives compensation for participating in power demand according to the unified price determined by the utility company [6]. For example, in [7], the authors aimed to maximize the comprehensive benefits of demand response service providers and users, improve the demand response model for power users, and solve the strategy using a deep Q network. In fact, due to the different sensitivities of consumers to prices, a unified price is not able to stimulate all demand response resources. Therefore, ref. [8] established a Stackelberg game model in which a single electricity retailer and multiple users responded based on incentive demand. In this model, when the spot market electricity price was higher than its selling price, the retail e-commerce company developed a demand response subsidy strategy to reduce the loss of electricity sales. In the corresponding period, the user determined the response power according to the subsidy price set by the electricity retailer to obtain additional profits. Incentive-based DR was proposed, acting as an important supplement to price-based DR. However, these papers did not integrate the two DR program. In addition, in the existing literature, DR has always been regarded as an independent procedure, making decisions when there is no information about the operation of the low-voltage distribution network. However, in fact, load reduction or load transfer inevitably change the variables of the station power system, such as active power, increasing its uncertainty. Therefore, it is necessary to integrate the information regarding the power operation of the station area in order to reduce the uncertainty in demand response decision making. For example, in an LVTPA, changes in load cause changes in the three-phase unbalance, which affects the power loss and power quality of the transformer [9].
However, applying conventional algorithms to schedule the appliances of several consumers with mixed-integer variables is difficult due to increased computational complexity and high-dimensional data. For instance, [10] scheduled large-scale residential appliances for DR using a computational algorithm with the cutting-plane method, but did not consider the convergence or calculation time. Recently, a massive number of studies on residential DR have illustrated that reinforcement learning (RL) can tackle the mixed integer programming [11]. The field of deep reinforcement learning (DRL) makes up an active area of research, due to the fact that it is able to find the best implementation strategy by means of “trial and error” in the external environment. Refs. [12,13] implemented batch RL to optimize the operation strategy of heating ventilation and air conditioning, and found an open-loop schedule was required to participate in the day-ahead market. Batch RL can be used to collect empirical data to train the RL agent, and this is popular and critical in areas where it is difficult to build a simulator, or where simulation costs are very high. In addition, the Sara algorithm is an on-policy RL approach, and was applied in thermal comfort control for vehicle cabins based on MDP in [14]. In contrast, Q-learning is an off-policy RL approach that save the learned information in the Q matrix. In [15], the authors focused on the DR of residential and small commercial buildings and studied the optimal energy consumption and storage using Q-learning, the most popular method of RL. Despite of its superior convergence and robustness, Q-learning still suffers from the limitations of continuous states and actions [16,17]. Furthermore, underlying the deep Q-learning networks (DQNs), which combine deep networks and Q-learning by means of the value function [18], the deep deterministic policy gradient (DDPG) algorithm, which is based primarily on the actor–critic method [19], is able to overcome the weakness of DQN in a huge action space. In [20], DDPG was implemented in the state of charge (SOC) management of multiple electrical energy storage systems, and enabled continuous action and smoother SOC control. However, when the application scenario includes a lot of variables, DDPG performs worse than the multi-agent deep deterministic policy gradient (MADDPG), where each agent represents a demand unit.
In conclusion, the gaps not covered by the current research can be listed as follows.
(1)
The BESS, which is regarded as the best load for DR, is not used fully in the existing DR scheme.
(2)
DR mechanisms and sources are not fully developed, thus limiting the performance of DR programs.
(3)
The current DR models do not consider the relationship between DR and LVTPA.
(4)
There is still a lack of algorithms to solve large-scale DR problems efficiently.
Therefore, inspired by the previous works, this paper proposes a novel DRL-based grid edge controller for BESS, aiming to increase the operation benefits of BESS and three-phase unbalance (TU). The main contributions are summarized as follows:
(1)
Compared to the current DR management, a novel autonomous DR method that considers the BESS within LVTPA is proposed, significantly enhancing the capability of DR.
(2)
The proposed method takes idle utilization of BESS into consideration, rather than normal utilization.
(3)
The proposed method enables the LVTPA to access to information about the local BESS in order to perform global optimization, significantly improving the efficiency of power consumption.
(4)
The proposed DRL-based grid edge controller combines incentive-based edge control with price-based edge control and achieves a better performance compared to conventional methods.
The remainder of this paper is organized as follows. Section 2 details the general structure of the DRL-based grid edge controller and describes the LVTPA. Section 3 formulates the operation problem as a Markov Game, and the optimal strategy is learned by applying the multi-agent deep deterministic policy gradient (MADDPG) algorithm. The numerical study is given in Section 4. Section 5 concludes the paper.

2. Typical Structure of LVPTA for Implementing BESS Demand Response

This section presents the structure of the proposed DRL-based DR controller for residential loads in LVTPA. As shown in Figure 1, a typical LVTPA mainly comprises one low-voltage transformer, one grid edge controller, and several home energy management systems. In a three-phase power system, phase ABC, which carries different loads, especially where the load distribution is unbalanced, easily leads to unbalanced three-phase operation in LVTPA.

2.1. Fuzzification

2.1.1. BESS States

The modeling of BESS is shown in Figure 2 and is mathematically expressed by Expressions (1) and (2) [16].
{ SOC ( t n + 1 ) = SOC ( t n ) + η char P BESS ( t n ) σ ( t n ) Δ t B r σ ( t n ) = 0 1 SOC ( t n + 1 ) = SOC ( t n ) + P BESS ( t n ) σ ( t n ) Δ t B r η σ ( t n ) = 1
where Prated is the maximum charge and discharge active power of BESS, PBESS is the current BESS active power, and Ssoc, max and Ssoc, min are the maximum and minimum values of SOC, respectively. SOC(tn) is the state of the battery at tn; PBESS(tn) is the charging and discharging power at tn. Please note that the charging and discharging operations cannot be performed at the same time. η is the charging and discharging efficiency; Br is the rated capacity of the battery; σ (tn) represents the state at tn where 1 represents the charging state, 0 represents disconnection from the power grid, and −1 represents the discharging state.
σ ( t n + 1 ) = { 0 SOC ( t n + 1 ) > λ emg h ( t n + 1 ) = 0 1 SOC ( t n + 1 ) λ emg h ( t n + 1 ) = 1 1 SOC ( t n + 1 ) > max ( λ emg , λ dw ) h ( t n + 1 ) = 1 σ ( t n ) σ ( t n ) 0 ( t n + 1 t ini ) < t len
In Expression (2), h(tn) is a decision variable indicating the state of the electric vehicle, where 1 represents the charging state, −1 represents the discharging state, and 0 represents the disconnected state. h is the set of decision variables in which the element is h(tn). λdw is the SOC threshold for discharging operation. λemg is a SOC threshold indicating whether a BESS needs to be charged. If SOC (tn) ≤ λemg, the charging operation must be carried out immediately. The operational strategy of the battery should ensure that the SOC of the BESS is always greater than this threshold. On the basis of the user’s daily night travelling data, the frequency of a user’s personal car usage is classified into high-frequency (with λemg to be 0.9–1.0), intermediate-frequency (with λemg to be 0.6–0.9), and low-frequency (with λemg to be 0.25–0.6).

2.1.2. Energy Storage Life Cycle Costs and Benefits

The life cycle costs of energy storage on the user side mainly include the one-time fixed investment cost Cinv of the energy storage, and the total operation and maintenance cost Cope; the benefits include the recovery value Brec at the end of the energy storage life cycle, and the peak and valley arbitrage BTOC of the installation of energy storage during the full life cycle. Demand response total revenue is BDR. F is the full life cycle benefit of energy storage.
F = B TOC + B DR + B rec C inv C ope
Equation (3) is the total operation and maintenance cost of the energy storage.
C inv = c e E max + c p P rated
C ope = c om E max
In Equations (4) and (5), ce is the unit capacity cost; cp is the unit power cost; com is the annual operation and maintenance cost coefficient per unit capacity; Emax is the rated maximum capacity of energy storage; Pmax is the rated power of energy storage.
The benefits of the entire life cycle of energy storage include energy storage recovery value, full cycle peak and valley arbitrage of energy storage, and demand response benefits, as shown in Equations (6)–(8), respectively.
B rec = θ C inv
B TOC = t = 1 T 1 = 1 96 c i ( p c , i , t p d , i , t ) Δ t i
B D R = l = 0 l DR g p D R , i
In Equations (6)–(8), θ is the recovery rate of energy storage; ci is the electricity price at time i; pc,i,t,, pd,i,t are the energy storage charge and discharge power of BESS i at time t, respectively. For discharge power, the sampling interval Δti is 15 min; T is the total number of days in the full life cycle of the energy storage; pDR, i is the reported demand response power of the energy storage participation of BESS i; g is the response speed coefficient; lDR is the total number of times that energy storage participates in demand response during the life cycle.

2.2. Model for the Operation of BESS

2.2.1. Energy Storage Operation Constraints

The life loss of energy storage batteries is closely related to throughput, and reducing throughput can prolong service life. In order to make more reasonable use of energy storage, Ref. [21] compared actual user load data with the peak and valley prices of electricity, and found that the daily throughput of energy storage batteries was limited, which not only reduces the throughput of energy storage, it also greatly limits the number of charge–discharge state transitions within a day of energy storage.
i = j + 1 j + 96 p c , i Δ t i m E max ( SOC max SOC min ) j = 0 , 1 , 2 , n d a y i = j + 1 j + 96 p d , i Δ t i m E max ( SOC max SOC min ) j = 0 , 1 , 2 , n d a y
where m is the equivalent number of charge and discharge of energy storage; Emax is the rated maximum capacity of energy storage; and SOCmax and SOCmin are the maximum and minimum values of the state of charge of energy storage, respectively. The values in this article are 0.9 and 0.1; nday is the number of days.
BESS operates in order to obtain benefits. As shown in Figure 3, these benefits exist in three respects. Firstly, by storing electricity at low-price times and discharging at high-price times, consumers receive the benefit of saving electricity costs. Secondly, by storing electricity at low-price times and discharging to the power grid at high-price times, consumers receive the benefit of spread. Thirdly, by storing electricity when photovoltaic power is sufficient and discharging when there is no photovoltaic power, consumers receive the benefit of saving electricity cost.

2.2.2. Physical Constraints of Energy Storage Batteries

SOC min SOC i SOC max
SOC i + 1 = SOC i + p c , i Δ t i η ch E max p d , i η dis Δ t i E max
SOCi is the state of charge of energy storage at time i; ηch and ηdis are the charging and discharging efficiencies of energy storage, respectively.
0 sw c , i + sw d , i 1
0 p c , i sw c , i P rated
0 p d , i sw d , i P rated
swc,i and swd,i are 0–1 variables to indicate the charging and discharging state of energy storage; Pmax is the rated power of energy storage. Equations (12)–(14) ensure that the energy storage is not in the charging and discharging state at the same time, and that charging and discharging power do not exceed the rated power.

2.2.3. Demand Response Constraints

If energy storage participates in demand response on day t, it must meet the conditions for effective demand response. Equations (15) and (17) restrict the load situation after energy storage has participated in demand response.
max ( Load k + p c , k p d , k ) max ( Load j + p c , j p d , j )
mean ( Load j + p c , j p d , j ) mean ( Load k + p c , k p d , k ) 0.8 p DR
0.05 L o a d max pre p DSM 0.2 Load max pre
In Formula (15)–Formula (17), k is the response time of the demand response day; j is the corresponding time of the baseline; pc,k and pc,j are the charging power in the corresponding period of energy storage; pd,k and pd,j are the discharge power in the corresponding period of energy storage; Loadk is the load participating in the demand response period; Loadj is the load at the corresponding time 5 days before the response date; pDSM is the optimal response power reported by the user; and Load max pre is the maximum peak load of the user in the previous year. Equation (15) indicates that the maximum load during the response period does not exceed the baseline maximum load, Equation (17) is the average load constraint during the response period, and Equation (13) restricts the range of the agreed response power.

2.2.4. Power Consumption Constraints of Consumers

In order to satisfy the daily demand of consumers using other stored power, the operation of BESS must consider the following sequence constraints.
SOC i , t SOC c o n
SOC i , t SOC c o n
where SOCcon is the power consumption demand without photovoltaic generation. SOCstr is the power storage demand under photovoltaic generation.

3. Methodology

In this section, a Markov Game is designed for DR, and is played by the grid edge controller. The MADDPG algorithm is applied to train the grid edge controller to learn the optimal control strategy. Detailed descriptions are provided as Figure 4.

3.1. Markov Decision Process for DR

A discrete time MDP consisting of three elements, combined with the model presented in this paper, is described below.

3.1.1. State

In this paper, two types of state are defined for the grid edge controller to perform: day-ahead scheduling and real-time load reduction. For day-ahead scheduling, the state is defined as follow:
For price-based DR, the state si,j is as shown in Formula (20), where T represents the daily time slots, and pi,j is the charge or discharge power at every time slot j of user i. The target state is of the satisfaction of all power consumption demand during one day.
s i , j ( t ) = S O C i , j . j T
For incentive-based DR, the state si,j is as shown in Formula (21), where TDR represents the time slots of DR, M is the number of consumers participated in DR, m is the m-th consumer, and l represents the time slot. The target state is to achieve the load management value during one DR period.
s m , j ( t DR ) = l = 0 t DR p m , j ( l ) . l T D R

3.1.2. Action

The grid edge controller also has two types of action, corresponding to the two states as mentioned above. For day-ahead scheduling, the action space is defined as follow:
For price-based DR, the action space is defined as shown in Formulas (8) and (22), where pm,j represents the active power of agent m.
a i , j ( t ) = p i , j ( t )
For incentive-based DR, the action space is defined as
a m , j ( t DR ) = p m , j ( t DR )
From the above definition of Formula (20), we can derive:
s i , j ( t 2 ) s i , j ( t 1 ) = p i , j ( t 1 ) = a i , j ( t 1 )
For any positive integer z, when 0 t 0 < t 1 < < t z is satisfied, we have:
s i , j ( t 1 ) s i , j ( t 0 ) , s i , j ( t 2 ) s i , j ( t 1 ) , , s i , j ( t z ) s i , j ( t z )
Hence, the day-ahead load scheduling is an independent increment process that must satisfy the definition of MDP [22]. In the same way, that the definition of the state and action of incentive-based DR can be proved.

3.1.3. Reward

To guide the grid edge controller to learn the optimal DR strategy, the reward functions for day-ahead scheduling and real-time load reduction are defined as follows.
For price-based DR, reward ri,j(t) is defined as shown in Formula (26), where ρ(t) is the electricity price in time slot y, tm,i represents shiftable appliance working time, and δ is the parameter of TU.
r s , k ( t ) = ρ ( t ) t m , i a i , j ( t ) ( δ s s , k + 1 , TPU δ s s , k , TPU )
δs,TU is defined as shown in Formula (14), where IA,t is the current of phase A in time slot t. In addition, Formula (27) defines the average current of the three phases.
δ s , TU = ( m a x ( I A , t , I B , t , I C , t ) I ave ) / I ave
I ave = ( I A , k + I B , k + I C , k ) / 3
For incentive-based DR, reward rr,t is defined as shown in Formula (29), where κm is the m-th consumer, and ς is an adjustable parameter that represents the importance of TU.
r r , k = p m , j κ m + ( δ s s , k + 1 , TPU δ s s , k , TU ) ζ

3.1.4. Objective Function

This paper presents two objective: day-ahead scheduling and real-time load reduction.
For price-based DR, the objective function of the controller is defined as shown in Formula (30) to maximize the energy benefits and reduce the TU of the LVTPA.
max k T m M i N ( ρ k t m , i , k a m , i , k ( δ s s , k + 1 , TU δ s s , k , TU ) )
For incentive-based DR, the objective function of the controller is defined as shown in Formula (18) to maximize the DR benefits and reduce the TU of the LVTPA.
max l L m M i N ( p m , j , l κ m , l + ( δ s s , l + 1 , TU δ s s , l , TPU ) ζ )

3.2. Multi-Agent Deep Deterministic Policy Gradients

In the context of DRL, a computerized agent learns to take actions at a discrete time step t to maximize a numerical reward from the environment, as shown in Figure 5.
In the intelligent DR system designed in this paper with the station area as the core, the energy management system is a single agent that manages the electricity consumption arrangements of every BESS, and communicates with others to obtain information regarding its own actions. The improved features of the MADDPG algorithm are as follows:
(1)
Each agent has its own goals and behavior constraints.
(2)
Each agent can interact with the environment and change the state of the environment.
(3)
Each agent can optimize itself when information is incomplete. The calculation of the whole system is asynchronous and concurrent.
As shown in Figure 6, MADDPG is able to use centralized training and distributed applications to fully improve the optimization efficiency of multiple agents. The detailed pseudocode is as follows in Algorithm 1:
Algorithm 1. MADDPG for grid edge control.
1. Initialize the parameters of critic networks Q i , j ( s , a i , j | Φ i , j ) and the actor networks μ i , j ( o i , j | θ i , j ) with weights Φ i , j and θ i , j for each agent (i, j) randomly.
2. Randomly initialize each agent target network parameters with critic networks and actor networks Φ i , j Φ i , j ,   θ i , j θ i , j .
3. Clear replay buffer D.
4. For episode =1 to N do
5.  Randomly initialize a random process χ for every exploration episode.
6.  Observe initial state s.
7.  For h =1 to H do
 8.    Select action a i , j   = μ i , j ( o i , j | θ i , j ) + χ for each agent (i, j).
 9.    Execute actions a i , j   and calculate rewards r i , j and next state s .
10.      Store transition (s,   a i , j ,   r i , j ,   s ) in replay buffer D.
11.      Update the state observation s s
12.     For agent (i, j), i=1 to I, j=1 to J do
13.      Sample a minibatch of K samples randomly. (sk,   a l , j k ,   r l , j k , s k ).
14.      Set calculating function y i , j k = r i , j k + γ Q i , j μ ( s k ,   a i , j k | Φ i , j ) | a i , j k = μ i , j k ( o i , j k | θ i , j )
15.      Update the critic network by minimizing loss: ζ i , j ( Φ i , j ) = 1 K k [ Q i , j μ ( s k , a i , j k | Φ i , j y i , j k ) ] 2
16.      Update the actor network using the sampled and calculated policy gradient:
θ i , j ( ψ i , j ) = 1 K k θ i , j μ i , j ( o i , j | θ i , j ) a i , j   Q i , j μ ( s k ,   a i , j k | Φ i , j )
17.     End for
18.   Softly update the target network parameters for each agent (i, j):
Φ i , j τ Φ i , j + ( 1 τ ) Φ i , j
θ i , j τ θ i , j + ( 1 τ ) θ i , j
19.  End for
20. End for

4. Case Study

4.1. Simulation Environment

In this paper, a simulation is carried out on the basis of two-year operation data in a three-phase power system including six households who take part in the DR program. The agents were programmed using a personal computer with six Intel cores [email protected], 12 threads, and 16 GB memory. The maximum training episode was set to be 20,000. Python 3.6.2 and Tensorflow 2.1 were adopted to program the MADDPG algorithm. In addition, the adam optimizer was applied to update the parameters of the critic network. Other parameters adopted in the MADDPG algorithm are shown in Table 1. The time-of-use prices are given in Table 2. The incentive-based DR subsidy is 8 CNY/kW·h. In addition, the PV feed-in tariff is 0.6 CNY/kW·h. Table 3 shows the other scenario data and benefits.

4.2. Numerical Results Discussion

The following section analyzes the case from three perspectives: benefits, three-phase unbalance, and algorithm performance.

4.2.1. Analysis of Cost

This paper compares three types of BESS in Table 4: lithium ion is the cheapest battery, while also having the shortest cycle charge and discharge times. As a result, its unit energy call cost is the highest. Conversely, lithium iron titanate hatteries have the longest service life and the lowest unit energy call cost. However, in this work, the lithium iron phosphate battery was tested in the case study due to its popularity.

4.2.2. Analysis of Benefits

Due to space limitations, three of the six consumers are compared in this paper. The scheduling of consumer I is shown in Figure 7. It is noticeable that BESS charges in the valley time at 3 kW. Thus, EV can save CNY 2.46 per hour. BESS charges at a low price during 23:00–03:00 and discharges during the periods 07:00–08:00 and 20:00–23:00. During peak times, BESS is fully charged in the 8:00–12:00 period by the PV. When the electricity price increases during the 20:00–23:00 period, BESS is arranged to supply the household power consumption.
In this LVTPA, the power consumption and the PV power values of the consumers were different. Among them, consumer I did not participate in incentive-based DR. As shown in Figure 8, consumer I had a relatively small photovoltaic capacity compared to its power consumption. Therefore, it has no capacity for incentive-based DR. Without optimization using the MADDPG-based gird controller, consumers I, II, and III only considered storing the remaining electrical energy in BESS and using it in the absence of photovoltaic power generation. With optimization using the proposed method in this paper, BESS was arranged to charge at 00:00–02:00 during the valley price and discharge at 08:00–12:00 and 20:00–23:00 during the peak price. PV still generated power to the grid when photovoltaic power was adequate. Meanwhile, BESS still stored the remaining electricity. Similarly, consumers II and III, as shown in Figure 9 and Figure 10, were charged at low price and discharged at high price, respectively, with 2 kW and 3 kW. The BESS of consumers II and III were able to obtain a subsidy during the incentive-based DR period (12:00–14:00).
As shown in Figure 11, every consumer only acquires a small benefit from electricity cost savings and PV electricity earnings. Using MADDPG in gird edge control, the electricity benefit increases significantly. For example, Figure 11 shows that, for consumer I, the earnings increased from −4.32 to 2.23. Moreover, participation in incentive-based DR is really valuable for BESS owners, and can bring CNY 48 per day at least.

4.2.3. Analysis of Three-Phase Unbalance

As shown in Figure 12, the MADDPG-based grid controller can adjust the power of BESS to reduce the TU. The degree of TU is decreased by 10.3% and the power loss is reduced by 15.1%, as shown in Figure 12.

4.2.4. Analysis of Algorithm Performance

As shown in Figure 13, the average reward of each agent keeps increasing throughout the training. After 20,000 rounds of training, the rewards that each agent receives are concentrated around one convergence, which means that the agents have learned to hold the power at an arbitrarily predefined value by adjusting the output of BESS under randomly generated scenarios. This proves that the proposed method has good convergence stability.
Figure 14 shows the performance comparison of the three main reinforcement learning algorithms, i.e., MADDPG (adopted in this paper), DDPG, and DQN. MADDPG outperforms the other two reinforcement learning algorithms, with faster and more precise convergence.
Figure 15 and Figure 16 show a performance comparison with the model-based control method in strong light and low light. The model-based method applies convex optimization techniques to calculate the strategy. Both methods successfully maintain good power scheduling. The performance gap is small, but the average computational time of the proposed method is much shorter than that of the conventional model-based method. Especially in large-scale power systems, the proposed method is more efficient than the conventional model-based method.

5. Conclusions

This work proposed a multi-agent grid edge controller based on MADDPG for DR management of distributed BESS in LVTPA. The controller was integrated in an automatic grid edge control system, designed to improve the utilization and increase the DR benefits of consumers. The case study demonstrated that the proposed method is able to reasonably arrange power consumption, thus increasing the total electricity benefits by 40% and reducing TU by 10.3%. In addition, compared to other RL, the proposed method has a better performance with respect to convergence. Compared to the model-based method, MADDPG possesses the same accuracy and displays a superior calculation time under the two different light conditions. These results demonstrate that the proposed DR program has great potential to motivate residential consumers to participate in the DR program. In future work, the proposed DR program can be extended to multiple appliances, including air conditioners.

Author Contributions

Conceptualization, W.L. and M.T.; methodology, W.L. and M.T.; software, X.Z.; validation, W.L., M.T. and D.G.; formal analysis, W.L. and M.T.; investigation, W.L. and M.T.; resources, W.L. and M.T.; data curation, X.Z.; writing—original draft preparation, M.T.; writing—review and editing, W.L. and M.T.; visualization, X.Z.; supervision, J.W.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

DR    Demand response
DRL  Deep reinforcement learning
MDP  Markov decision process
TU  Three-phases unbalance
BESS  Battery energy storage system
RL  Reinforcement learning
DDPG  Deep deterministic policy gradient
IoT  Internet of things
A  A set of actions of agent
MA  Multi-agent
HEMS  home energy management system
LVS  Low-voltage substation
MPC  Model predictive and control
ET  Environment temperature
i  Serial production-line branch index
j  Day-ahead or real time
h  Hour index
ψ i , j   The parameter of policy network
Qi,j  Individual Q-network
Q i , j μ   Critic network under policy
Q i , j μ   Target critic network under policy
SOC  State of charge
WM  Wash machine
ANN  Artificial Neural Network
χ  A Noise Process
O  A set of observations
Ri,j  Cumulative rewards of an agent
ai  Action of agent i
γ  Discount factor
μi,j  Policy of an agent
r  Reward
s  States
prated  Rated active power
ps  Active power of standby
R  Real-valued reward function
H  Time slot set {1,…, tT }
Di  Appliance set {1, ···, a···, m}, i ∈B
MILP  Mixed integer liner programming
DQN  Deep Q-network
D  Experience replay buffer
θ i , j   The parameter of policy network
Φ i , j   Target parameter of Q-network
ψ i , j   Target parameter of policy network
θ i , j   Target parameter of policy network
ζ i , j   Loss function of Q-network
oi,j  Observation of an agent
ε  Probability of ε-greedy policy
αA  Learning rate of policy network
αC  Learning rate of Q-network
TTU  Transformer terminal unit

References

  1. Guiding Opinions on Promoting the Development of Energy Storage Technology and Industry, NEA, Beijing. 2017. Available online: http://www.nea.gov.cn/2017-10/11/c_136672019.htm (accessed on 1 June 2021).
  2. Global Energy Storage Market Outlook for the First Half of 2021, BJX Energy Storage Website, Beijing. 2021. Available online: https://chuneng.bjx.com.cn/news/20210429/1150160.shtml (accessed on 1 June 2021).
  3. Report on Energy Supply and Demand in Canada: 2014 Preliminary, Canada’s Ministry Ind., Ottawa, ON, Canada, Tech. Rep. 57-003-X. 2016. Available online: http://www.statcan.gc.ca/pub/57-003-x/57-003-x2016002-eng.pdf (accessed on 1 June 2021).
  4. Chen, Z.; Wu, L.; Fu, Y. Real-time price-based demand response management for residential appliances via stochastic optimization and robust optimization. IEEE Trans. Smart Grid 2012, 3, 1822–1831. [Google Scholar] [CrossRef]
  5. Nilsson, A.; Wester, M.; Lazarevic, D. Smart homes, home energy management systems and real-time feedback: Lessons for influencing household energy consumption from a Swedish field study. Energy Build. 2018, 179, 15–25. [Google Scholar] [CrossRef]
  6. Hu, Q.; Li, F.; Fang, X.; Bai, L. A Framework of Residential Demand Aggregation with Financial Incentives. IEEE Trans. Smart Grid 2018, 9, 497–505. [Google Scholar] [CrossRef]
  7. Guo, K.; Gao, C.; Lin, G.; Lu, S.; Feng, X. Optimization Strategy of Incentive Based Demand Response for Electric Retailer I,n Spot Market Environment. Autom. Electr. Power Syst. 2020, 44, 28–35. [Google Scholar]
  8. Sun, Y.; Liu, D.; Cui, X.; Li, B.; Huo, M.; Xi, W. Equal Gradient Iterative Learning Incentive Strategy for Accurate Demand Response of Resident Users. Power Syst. Technol. 2019, 43, 3597–3605. [Google Scholar]
  9. Anum, M.; Yang, F.; Dong, J.; Luo, Z.; Yi, L.; Wu, T. Modeling and Load Flow Analysis for Three phase Unbalanced Distribution System. In Proceedings of the 2021 4th International Conference on Energy, Electrical and Power Engineering (CEEPE), Chongqing, China, 22–24 April 2021; pp. 44–48. [Google Scholar]
  10. Elghitani, F.; Zhuang, W. Aggregating a Large Number of Residential Appliances for Demand Response Applications. IEEE Trans. Smart Grid 2018, 9, 5092–5100. [Google Scholar] [CrossRef]
  11. Zoltan, V.N. Reinforcement learning for demand response: A review of algorithms and modeling techniques. Appl. Energy 2018, 235, 1072–1089. [Google Scholar]
  12. Ruelens, F.; Claessens, B.J.; Vandael, S.; de Schutter, B.; Babuška, R.; Belmans, R. Residential Demand Response of Thermostatically Controlled Loads Using Batch Reinforcement Learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [Google Scholar] [CrossRef] [Green Version]
  13. Schmidt, M.; Moreno, M.V.; Schülke, A.; Macek, K.; Mařík, K.; Pastor, A.G. Optimizing legacy building operation: The evolution into data-driven predictive cyber-physical systems. Energy Build. 2017, 148, 257–336. [Google Scholar] [CrossRef]
  14. Brusey, J.; Hintea, D.; Gaura, E.; Beloe, N. Reinforcement learning-based thermal comfort control for vehicle cabins. Mechatronics 2018, 50, 413–421. [Google Scholar] [CrossRef] [Green Version]
  15. Mahapatra, C.; Moharana, A.K.; Leung, V.C.M. Energy Management in Smart Cities Based on Internet of Things: Peak Demand Reduction and Energy Savings. Sensors 2017, 17, 2812. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Shao, M.M.; Liu, Y.B. Reinforcement learning driven self-optimizing operation for distributed electrical storage system. Power Syst. Technol. 2020, 44, 1696–1705. [Google Scholar]
  17. Lu, R.; Hong, S.H.; Yu, M. Demand Response for Home Energy Management Using Reinforcement Learning and Artificial Neural Network. IEEE Trans. Smart Grid 2019, 10, 6629–6639. [Google Scholar] [CrossRef]
  18. Lu, R.; Hong, S.H. Incentive-based demand response for smart grid with reinforcement learning and deep neural network. Appl. Energy 2019, 236, 937–949. [Google Scholar] [CrossRef]
  19. David, S.; Guy, L.; Nicolas, H.; Thomas, D. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on Machine Learning, ICML, Beijing, China, 21–26 June 2014. [Google Scholar]
  20. Mocanu, E. On-Line Building Energy Optimization Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 3698–3708. [Google Scholar] [CrossRef] [Green Version]
  21. Nayak, C.; Kumar, N.; Ranjan, M. Optimal Design of Battery Energy Storage System for Peak Load Shaving and Time of Use Pricing. In Proceedings of the Second International Conference on Electrical, Computer and Communication Technologies, Coimbatore, Tamil Nadu, India, 22 February 2017. [Google Scholar]
  22. He, S. Stochastic Process; CA: Beijing, China, 2008; pp. 150–218. [Google Scholar]
Figure 1. The system structure in LVTPA.
Figure 1. The system structure in LVTPA.
Energies 14 07749 g001
Figure 2. BESS system model diagram.
Figure 2. BESS system model diagram.
Energies 14 07749 g002
Figure 3. Schematic diagram of BESS daily operation.
Figure 3. Schematic diagram of BESS daily operation.
Energies 14 07749 g003
Figure 4. Grid edge controller schematic diagram.
Figure 4. Grid edge controller schematic diagram.
Energies 14 07749 g004
Figure 5. Structure of a DRL agent trained with the DDPG algorithm.
Figure 5. Structure of a DRL agent trained with the DDPG algorithm.
Energies 14 07749 g005
Figure 6. Structure of training with the MADDPG algorithm.
Figure 6. Structure of training with the MADDPG algorithm.
Energies 14 07749 g006
Figure 7. Electricity consumption scheduling of consumer I.
Figure 7. Electricity consumption scheduling of consumer I.
Energies 14 07749 g007
Figure 8. Power consumption, PV power, and BESS power of consumer I.
Figure 8. Power consumption, PV power, and BESS power of consumer I.
Energies 14 07749 g008
Figure 9. Power consumption, PV power, and BESS power of consumer II.
Figure 9. Power consumption, PV power, and BESS power of consumer II.
Energies 14 07749 g009
Figure 10. Power consumption, PV power, and BESS power of consumer III.
Figure 10. Power consumption, PV power, and BESS power of consumer III.
Energies 14 07749 g010
Figure 11. Daily electricity benefits of consumers.
Figure 11. Daily electricity benefits of consumers.
Energies 14 07749 g011
Figure 12. TU reduction of the whole LVTPA.
Figure 12. TU reduction of the whole LVTPA.
Energies 14 07749 g012
Figure 13. The training process of each DDPG agent.
Figure 13. The training process of each DDPG agent.
Energies 14 07749 g013
Figure 14. Convergence of the cumulative mean rewards.
Figure 14. Convergence of the cumulative mean rewards.
Energies 14 07749 g014
Figure 15. Comparison of consumption power between MAPPDG and the model-based method in strong light.
Figure 15. Comparison of consumption power between MAPPDG and the model-based method in strong light.
Energies 14 07749 g015
Figure 16. Comparison of consumption power between MAPPDG and the model-based method in low light.
Figure 16. Comparison of consumption power between MAPPDG and the model-based method in low light.
Energies 14 07749 g016
Table 1. Main parameter configuration of MADDPG.
Table 1. Main parameter configuration of MADDPG.
Parameter εαAγαCτ Batch size
Value0.010.00050.970.0010.0011024
Table 2. Time-of-use prices of grid.
Table 2. Time-of-use prices of grid.
PeriodsTimePrice (CNY/kW·h)
Peak08:00–12:001.1751
20:00–23:00
Flat12:00–20:000.7054
Valley23:00–08:000.3351
Table 3. Scenario data.
Table 3. Scenario data.
ConsumerPhasePrice-Based DRIncentive-Based DRCapacity Emax (kW·h)Charging Power (kW)Electricity Benefit (CNY)Optimized Benefit (CNY)Subsidy (CNY)
IA-4.52−4.322.23-
IIA7.037.8812.2148
IIIA8.048.1615.1664
IVB9.0310.2616.1148
VB6.036.8913.8848
VIC6.537.1813.4548
Table 4. Cost of BESS.
Table 4. Cost of BESS.
BESSCycle Charge and Discharge TimesInstallment Cost (CNY/kW·h)Maintenance Cost (CNY/kW·h)Salvage Value (CNY/kW·h)Charge EfficiencyUnit Energy Call Cost (CNY)
Lithium ion80013002003000.901.5
Lithium iron phosphate300023002004000.930.7
Lithium iron titanate10,00040002005000.950.37
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, W.; Tang, M.; Zhang, X.; Gao, D.; Wang, J. Operation of Distributed Battery Considering Demand Response Using Deep Reinforcement Learning in Grid Edge Control. Energies 2021, 14, 7749. https://doi.org/10.3390/en14227749

AMA Style

Li W, Tang M, Zhang X, Gao D, Wang J. Operation of Distributed Battery Considering Demand Response Using Deep Reinforcement Learning in Grid Edge Control. Energies. 2021; 14(22):7749. https://doi.org/10.3390/en14227749

Chicago/Turabian Style

Li, Wenying, Ming Tang, Xinzhen Zhang, Danhui Gao, and Jian Wang. 2021. "Operation of Distributed Battery Considering Demand Response Using Deep Reinforcement Learning in Grid Edge Control" Energies 14, no. 22: 7749. https://doi.org/10.3390/en14227749

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop