1. Introduction
In the past ten years, incentive-based DR has developed rapidly in the form of load shedding and transfer, which can greatly improve the flexibility of the grid. According to the National Energy Administration (NEA), BESS capacity will exceed 100 GW·h by 2030 [
1]. On the one hand, residential and commercial BESS, etc., are some of the best DR resources, and are increasing sharply with the prosperity of distributed PV, accounting for more than 50% of the newly installed capacity of PV [
2]. On the other hand, daily operation of BESS is designed only for the storage of photovoltaic power and saving electricity cost, without consideration of idle use, such as DR programs. Thus, it is necessary to investigate an optimal operation method for BESS in consideration of DR.
However, demand response mechanisms at home and abroad are not yet perfect, and demand response resources are still scarce compared to peak loads. Grid edge control technology provides the potential for exciting transformations in the power industry, creating more choices, higher efficiency, more comprehensive and more efficient decarbonization for customers, and better economic benefits for stakeholders in the value chain [
3]. Grid edge control is at the critical point of the adoption curve. Residents, industry and regulators are preparing to connect to digitally distributed resources. Therefore, grid edge control technology could be feasible way to expand resources in a manner that corresponds to demand.
By reducing peak load demand and transferring load demand to low-price and off-peak times, price-based DR is conducive to reducing customer bills [
4]. Recent studies have mainly focused on arranging home power consumption using price-based DR. In [
5], Nilsson A et al. controlled smart home appliances using home energy management system (HEMS) and tested the energy-saving potential of smart homes in Sweden. The results showed that the impact of different households on energy consumption varies greatly, indicating that households have a high degree of independence in response to demand. In incentive-based demand response, each customer receives compensation for participating in power demand according to the unified price determined by the utility company [
6]. For example, in [
7], the authors aimed to maximize the comprehensive benefits of demand response service providers and users, improve the demand response model for power users, and solve the strategy using a deep Q network. In fact, due to the different sensitivities of consumers to prices, a unified price is not able to stimulate all demand response resources. Therefore, ref. [
8] established a Stackelberg game model in which a single electricity retailer and multiple users responded based on incentive demand. In this model, when the spot market electricity price was higher than its selling price, the retail e-commerce company developed a demand response subsidy strategy to reduce the loss of electricity sales. In the corresponding period, the user determined the response power according to the subsidy price set by the electricity retailer to obtain additional profits. Incentive-based DR was proposed, acting as an important supplement to price-based DR. However, these papers did not integrate the two DR program. In addition, in the existing literature, DR has always been regarded as an independent procedure, making decisions when there is no information about the operation of the low-voltage distribution network. However, in fact, load reduction or load transfer inevitably change the variables of the station power system, such as active power, increasing its uncertainty. Therefore, it is necessary to integrate the information regarding the power operation of the station area in order to reduce the uncertainty in demand response decision making. For example, in an LVTPA, changes in load cause changes in the three-phase unbalance, which affects the power loss and power quality of the transformer [
9].
However, applying conventional algorithms to schedule the appliances of several consumers with mixed-integer variables is difficult due to increased computational complexity and high-dimensional data. For instance, [
10] scheduled large-scale residential appliances for DR using a computational algorithm with the cutting-plane method, but did not consider the convergence or calculation time. Recently, a massive number of studies on residential DR have illustrated that reinforcement learning (RL) can tackle the mixed integer programming [
11]. The field of deep reinforcement learning (DRL) makes up an active area of research, due to the fact that it is able to find the best implementation strategy by means of “trial and error” in the external environment. Refs. [
12,
13] implemented batch RL to optimize the operation strategy of heating ventilation and air conditioning, and found an open-loop schedule was required to participate in the day-ahead market. Batch RL can be used to collect empirical data to train the RL agent, and this is popular and critical in areas where it is difficult to build a simulator, or where simulation costs are very high. In addition, the Sara algorithm is an on-policy RL approach, and was applied in thermal comfort control for vehicle cabins based on MDP in [
14]. In contrast, Q-learning is an off-policy RL approach that save the learned information in the Q matrix. In [
15], the authors focused on the DR of residential and small commercial buildings and studied the optimal energy consumption and storage using Q-learning, the most popular method of RL. Despite of its superior convergence and robustness, Q-learning still suffers from the limitations of continuous states and actions [
16,
17]. Furthermore, underlying the deep Q-learning networks (DQNs), which combine deep networks and Q-learning by means of the value function [
18], the deep deterministic policy gradient (DDPG) algorithm, which is based primarily on the actor–critic method [
19], is able to overcome the weakness of DQN in a huge action space. In [
20], DDPG was implemented in the state of charge (SOC) management of multiple electrical energy storage systems, and enabled continuous action and smoother SOC control. However, when the application scenario includes a lot of variables, DDPG performs worse than the multi-agent deep deterministic policy gradient (MADDPG), where each agent represents a demand unit.
In conclusion, the gaps not covered by the current research can be listed as follows.
- (1)
The BESS, which is regarded as the best load for DR, is not used fully in the existing DR scheme.
- (2)
DR mechanisms and sources are not fully developed, thus limiting the performance of DR programs.
- (3)
The current DR models do not consider the relationship between DR and LVTPA.
- (4)
There is still a lack of algorithms to solve large-scale DR problems efficiently.
Therefore, inspired by the previous works, this paper proposes a novel DRL-based grid edge controller for BESS, aiming to increase the operation benefits of BESS and three-phase unbalance (TU). The main contributions are summarized as follows:
- (1)
Compared to the current DR management, a novel autonomous DR method that considers the BESS within LVTPA is proposed, significantly enhancing the capability of DR.
- (2)
The proposed method takes idle utilization of BESS into consideration, rather than normal utilization.
- (3)
The proposed method enables the LVTPA to access to information about the local BESS in order to perform global optimization, significantly improving the efficiency of power consumption.
- (4)
The proposed DRL-based grid edge controller combines incentive-based edge control with price-based edge control and achieves a better performance compared to conventional methods.
The remainder of this paper is organized as follows.
Section 2 details the general structure of the DRL-based grid edge controller and describes the LVTPA.
Section 3 formulates the operation problem as a Markov Game, and the optimal strategy is learned by applying the multi-agent deep deterministic policy gradient (MADDPG) algorithm. The numerical study is given in
Section 4.
Section 5 concludes the paper.