Next Article in Journal
Investigation of Heat Transfer Fluids Using a Solar Concentrator for Medium Temperature Storage Receiver Systems and Applications
Previous Article in Journal
Analyzing Optimal Battery Sizing in Microgrids Based on the Feature Selection and Machine Learning Approaches
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

System Frequency Control Method Driven by Deep Reinforcement Learning and Customer Satisfaction for Thermostatically Controlled Load

1
State Grid Hubei Electric Power Research Institute, Wuhan 430077, China
2
Department of Electrical Engineering, Shanghai University of Electric Power, Shanghai 201306, China
3
State Grid Hubei Electric Power Co., Ltd., Wuhan 430072, China
*
Author to whom correspondence should be addressed.
Energies 2022, 15(21), 7866; https://doi.org/10.3390/en15217866
Submission received: 18 September 2022 / Revised: 16 October 2022 / Accepted: 18 October 2022 / Published: 24 October 2022
(This article belongs to the Topic Artificial Intelligence and Sustainable Energy Systems)

Abstract

:
The intermittence and fluctuation of renewable energy aggravate the power fluctuation of the power grid and pose a severe challenge to the frequency stability of the power system. Thermostatically controlled loads can participate in the frequency regulation of the power grid due to their flexibility. Aiming to solve the problem of the traditional control methods, which have limited adjustment ability, and to have a positive influence on customers, a deep reinforcement learning control strategy based on the framework of soft actor–critic is proposed, considering customer satisfaction. Firstly, the energy storage index and the discomfort index of different users are defined. Secondly, the fuzzy comprehensive evaluation method is applied to evaluate customer satisfaction. Then, the multi-agent models of thermostatically controlled loads are established based on the soft actor–critic algorithm. The models are trained by using the local information of thermostatically controlled loads, and the comprehensive evaluation index fed back by users and the frequency deviation. After training, each agent can realize the cooperative response of thermostatically controlled loads to the system frequency only by relying on the local information. The simulation results show that the proposed strategy can not only reduce the frequency fluctuation, but also improve customer satisfaction.

1. Introduction

With the increasing proportion of renewable energy in the power grid, the characteristics of intermittence and fluctuation will bring considerable challenges to the active power balance and frequency stability of the power grid [1]. The traditional power system maintains the balance of the system by adjusting the output of the generating side units. The regulation method is relatively simple and will generate additional economic and environmental costs [2]. In addition, with the increase in power load and the extensive access to renewable energy, the regulation capacity of the power generation side gradually decreases [3]. The power system with renewable energy as the main body can utilize advanced information technology to integrate and dispatch demand-side resources to provide a variety of auxiliary services [4,5]. Therefore, reasonable control of demand-side resources can supplement the traditional system frequency regulation, and thus enhance the stability of the power system [6].
In the demand-side resources, the thermostatically controlled load (TCL) is a kind of electric equipment controlled by a thermostat, which can realize electric heating conversion and adjustable temperature, including in heat pumps, electric storage water heaters (ESWHs), refrigerators, and heating, ventilation and air conditioning (HVAC) systems [7]. TCL can be used to provide frequency regulation services, and is mainly based on the following three points. Firstly, it is widely distributed in residential, commercial and industrial buildings, with adjustable potential. Secondly, it has sufficient thermal storage capacity and can be regarded as distributed energy storage equipment. Thirdly, the control method is flexible and can respond to the power demand of the system in time [8]. Therefore, in order to fully excavate the frequency regulation potential of flexible resources on the demand side and maintain the grid frequency within a certain offset range, it is necessary to conduct in-depth research on the control strategy of large-scale TCLs on the demand side.
In the current research, there are mainly three control methods for TCLs to participate in ancillary services: centralized control, decentralized control and hybrid control [9,10,11]. In centralized control, the control center sends control signals to all controlled loads, but it needs to build a large number of communication channels, leading to high control costs. Hu et al. [9] established a hierarchical centralized load tracking control framework that coordinates demand-side heterogeneous TCL aggregators and uses a state–space model for modeling. The decentralized control decentralizes the judgment mechanism of load control to the local control terminal, and pre-sets the procedures or thresholds at the local control terminal. When the demand-side device detects important parameter changes, the load acts according to the pre-set strategy. Because the judgment of decentralized control is performed at the local port, the demand for communication is low and the response speed is fast. However, the control effect is largely influenced by the user behavior and the error of the detection device. Delavari and Kamwa [10] applied a multi-objective optimization approach for optimizing each load setting to reduce the amount of load response required and trigger the load based on the frequency response index of decentralized control. The hybrid control combines the features of centralized and decentralized control, and establishes a control framework of “centralized parameter setting–decentralized decision making”, and coordinates large-scale users and grid control centers through load aggregators (LAs). Song et al. [11] built a two-stage control model based on hybrid control to participate in energy market trading. Based on hybrid control, Wang et al. [12] used TCLs to mitigate PV and load variations in microgrid communities. The above methods require a communication network between the control center and all aggregates, increasing the cost and difficulty of demand-side load control.
In the research on the participation of TCLs in auxiliary service, Ref. [13] built a dynamic model and verified the performance of a variable-frequency heat pump in providing frequency modulation services by using direct load control. This paper mainly studied the dynamic response performance of a single air-conditioning system, but focused less on the coordinated control of large-scale air-conditioning loads. Ref. [14] established a virtual energy storage model of variable-frequency air conditioning, shielded part of the model information through a hierarchical control framework and simplified the downlink control by using a unified broadcast signal. However, in this paper, the adjustable capacity of the air conditioning cluster will be sacrificed in order to simplify the downlink control. There are two main control modes of TCL, namely direct switch and temperature setting [15]. Ref. [16] realized frequency adjustment based on load direct switch. The advantage of this method is that the tracking accuracy of the system is high and the influence on the user’s comfort is low within the range of load regulation ability. The disadvantage is that when the indoor temperature of the load is concentrated near the temperature boundary, the equipment will be frequently switched on and off, which will not only fail to complete the adjustment task, but also reduce the service life of the equipment [17]. Temperature setting can avoid the above disadvantages, but its limitation is that the tracking effect of power depends on the designed controller including a minimum variance controller [18], a sliding mode controller [19] and an internal model controller [20]. In addition, its limitations are also shown in the large range of temperature changes, which will have an impact on the comfort of users [21]. Pallonetto et al. [22] established a residential building energy management system (EMS) based on a combination of optimization techniques and machine learning models. This EMS reduces energy consumption while maintaining thermal comfort. Therefore, it is important to consider the influence of consumer satisfaction in the process of load response control to dispatch the motivation of user participation in demand response.
The above-mentioned documents all adopt a single control method to schedule and control TCL, but they often fail to meet the application requirements. This is not only because of their inherent defects, but also because of the different requirements of users. Direct switching is suitable for loads with high temperature requirements. At this time, the user expects the start and stop frequency of the equipment to be kept at a low level to extend the service life of the equipment. The load with a low temperature requirement is suitable for temperature setting. At this time, the user expects the temperature change to be maintained within a certain range to improve the comfort. Therefore, it is of practical significance to combine the two control modes. Ref. [23] proposed a hybrid control strategy based on a parallel structure. This strategy can improve the tracking accuracy of the system and reduce the number of switches in the equipment. However, the temperature changes widely, which will reduce the user’s comfort. In recent years, the deep reinforcement learning algorithm has provided a new solution to the frequency control problem of power system.
With its strong search and learning ability, the deep reinforcement learning algorithm has the potential of online optimization of decision making in the face of complex nonlinear frequency control problems. The Q-learning algorithm of deep reinforcement learning is used to realize the cooperative control of distributed generation units, thus eliminating the frequency deviation of the system in [24,25]. Ref. [26] combined electric water heater buffer models with domain randomization to reduce the initialization time of Q-learning in demand response control. Ruelens et al. [27] applied batch reinforcement learning to coordinate the power consumption of users with thermostatically controlled loads. However, the Q-learning algorithm can only discretely select control actions from low-dimensional action domains, so it cannot deal with problems with continuous variables [28]. Ref. [29] proposed a deep reinforcement learning algorithm that acts on the continuous action domain, thus realizing the adaptive control of load frequency. An energy management scheme for AC control based on the deep deterministic policy gradient (DDPG) algorithm is proposed in [30]. However, this algorithm is only suitable for the optimal control of a single generator set, and is not suitable for the control of large-scale thermostatically controlled load. In [31], a distributed soft actor–critic (DSAC)-based data-driven frequency control method is proposed; the DSAC model estimates the distribution of value function over returns instead of only estimating the mean. The DSAC method based on entropy regularization has a faster learning speed compared to the traditional expectation-based reinforcement learning methods.
In view of the above problems, we take the thermostatically controlled load as the frequency control object, and based on the deep reinforcement learning algorithm, propose a frequency stability control method with the participation of thermostatically controlled load, considering customer satisfaction. Firstly, considering the operation characteristics of thermostatically controlled load with different control types, the energy storage index and the discomfort index are established, and the fuzzy comprehensive discrimination method is used to evaluate the customer satisfaction. Then, in order to realize the frequency cooperative control of large-scale thermostatically controlled load, a multi-agent control model is established based on the soft actor–critic algorithm to realize the continuous action control of thermostatically controlled load. Through the multi-agent reinforcement learning model considering customer satisfaction, the frequency response control of each thermostatically controlled load cluster can be coordinated online. The main contributions of this paper are as follows:
  • The influence of consumer satisfaction on the frequency response of TCLs is considered, and indicators reflecting satisfaction for TCLs with different control modes are established.
  • For the control problem of frequency response control for large-scale TCLs, this paper proposes a deep reinforcement learning control method based on the SAC algorithm. This method can rely only on local load and frequency data to achieve real-time cooperative control of large-scale TCLs, which reduces the communication pressure in the scheduling process.
The remainder of this paper is organized as follows. In Section 2, the TCL dynamic model and control methods are formulated. Section 3 is concerned with the comprehensive control index of TCLs considering customer satisfaction. In Section 4, the frequency response control of TCLs based on SAC deep reinforcement learning is modeled. Case studies are provided in Section 5. Finally, conclusions are summarized in Section 6.

2. TCL Dynamic Model and Control Methods

2.1. TCL Dynamic Model

The first-order ordinary differential equation model considering indoor environment, outdoor environment and building characteristics has high accuracy and simple calculation, and is widely used in practice [32,33]. State variables T i and virtual variables s i are introduced into the model. The operating characteristics of the i-th TCL in the cooling mode can be expressed as:
d T i ( k ) d k = 1 C i R i ( T ( k ) T i ( k ) s i ( k ) R i P i )
Among them, the change rule of s i ( k ) is as follows:
s i ( k + Δ k ) = { 0 s i ( k ) = 1   and   T i ( k ) T i min 1 s i ( k ) = 0   and   T i ( k ) T i max s i ( k ) others
{ T i min = T i s e t δ 2 T i max = T i s e t + δ 2
where T ( k ) and T i ( k ) are outdoor temperature and indoor temperature, respectively; C i , P i and P i are the equivalent heat capacity, equivalent heat resistance and energy transfer rate of the i-th TCL, respectively; s i ( k ) indicates the load switch state, on state s i ( k ) = 1 and off state s i ( k ) = 0 . T i max and T i min are the upper and lower limits of the temperature during load operation, respectively. T i s e t is the temperature setting value, δ is the temperature dead zone interval and is a constant. k and Δ k are operation time and control period, respectively. By solving Equation (1):
T i ( k ) = T ( k ) s i ( k ) R i P i ( T ( k ) s i ( k ) R i P i T i ( 0 ) ) e k C i R i
where T i ( 0 ) represents the initial indoor temperature. For a load cluster composed of N TCLs, the aggregate power consumption is the sum of the rated power of all loads.
P t o t a l ( k ) = i = 1 N P i n s i ( k )
P i n = P i η i
where P i n is the rated power of the i-th TCL and η i is the energy conversion efficiency coefficient of the i-th TCL.
Figure 1 shows the frequency response model of the power system with TCL load participating in frequency modulation, where T G a and T G b are the time constants of the governor and the turbine, respectively. The governor and the turbine are the instantaneous characteristic compensation links, which are expressed by the lead lag transfer function between the time constants T 1 and T 2 . T R is the TCL response delay time constant, T c is the communication delay time constant and R eq is the unit adjustment rate. Δ P G , Δ P L , H and D represent the total output power, the disturbance power, inertia time constant and load damping coefficient of the system, respectively. Δ f is the frequency offset.

2.2. TCL Control Methods

The control methods of TCL are mainly divided into direct switch control and temperature setting control. The TCL operation characteristics of the direct switch are shown in Figure 2a. The temperature setting value of the load remains unchanged, and the dispatching command directly acts on the equipment switch during the operation time k = k 0 . The advantage of this method is that it can accurately track the power within the adjustable temperature range and has little impact on the user’s comfort. However, when the indoor temperature approaches the temperature boundary, it will cause frequent switching, thus reducing the service life of the equipment. The TCL operating characteristics of the temperature setting are shown in Figure 2b. The dispatching command increases the temperature setting value at time k = k 0 . Since the temperature dead zone of the load is unchanged, its operating range will change, thus indirectly changing the switching state of the equipment. Since the indoor temperature of the load is always uniformly distributed in the temperature dead zone, the indoor temperature will not approach the temperature boundary, but it has a notable impact on the user’s comfort, and its tracking effect depends on the designed controller.
In practical application, the reasonable distribution of power regulation can make load clusters with different control modes cooperate with each other, which can not only realize accurate tracking of power, but also avoid their limitations and meet the different needs of users. Many factors need to be considered when allocating the power regulation amount. On the one hand, it is necessary to meet the requirements of the power system for frequency regulation and ensure the tracking accuracy within a certain range. On the other hand, for the load with direct switch, the number of switches should be reduced as much as possible to prolong the service life of the equipment. For the load with temperature setting, the temperature change should be reduced as much as possible to improve the user’s comfort. There is often a highly nonlinear relationship between these factors and power distribution, and the frequency regulation requires high real-time performance. Therefore, the conventional optimization method cannot obtain the optimal power allocation.

3. Comprehensive Control of TCL Considering Customer Satisfaction

3.1. Calculation of TCL Load Regulation Index

The temperature setting value of the direct switch load is almost constant, and the influence on the user’s comfort is negligible. Its control mode directly acts on the equipment switch. When the indoor temperature is close to the temperature boundary, the load regulation capacity will decrease, and the start and stop frequency of the equipment will be higher and the service life will be shortened. When the indoor temperature is close to the temperature setting value, the load regulation capacity will rise. At this time, the start and stop frequency of the equipment is low and the service life is extended. In order to characterize the regulation ability of TCL and provide a basis for evaluating the service life of equipment, we use the definition of battery state of charge for reference, and define the energy storage index C s for TCL cluster with a direct switch under refrigeration mode:
C s = 2 δ | i = 1 N T max T i N δ 2 |     T i ( T i min , T i max )
According to the definition of C s , the closer C s is to 0, the closer the indoor temperature is to the temperature set value. At this time, the temperature distribution of TCL is relatively uniform, its adjustable potential is large and the switch switching is not frequent. The closer C s is to 1, the closer the indoor temperature is to the upper and lower temperature limits. At this time, the temperature distribution of TCL is more concentrated, its adjustable potential is small and the switch switching is more frequent. Therefore, when the agent outputs the control command, it should make C s close to 0 as much as possible, so as to reduce the start and stop frequency of the equipment.
The tracking effect of the load depends on the designed controller, and because the set value changes, the temperature changes in a large range, which will reduce the comfort level of the user. To characterize the user’s comfort, the discomfort index C u is defined as follows for TCL cluster with temperature setting:
C u = | T i s e t T i s e t ( 0 ) |
where T i s e t ( 0 ) indicates the initial temperature setting value. According to the definition of C u , the more the temperature setting value deviates from the initial temperature setting value, the higher the user’s discomfort level. Therefore, when the agent outputs the control command, it should make C u as close to 0 as possible to reduce the user’s discomfort.

3.2. Customer Satisfaction Assessment Method Based on Regulation Index

According to the above analysis, users’ needs are different under different control methods. In order to comprehensively evaluate customer satisfaction, the fuzzy comprehensive evaluation (FCE) method is adopted for evaluation. The specific operations are as follows.
(1)
Construct the factor set of customer satisfaction U = { C s , C u } .
(2)
Build a customer satisfaction evaluation set V = {satisfied, more satisfied, general, less satisfied, dissatisfied}.
(3)
Determine the weight of each factor. Since the factor set in this paper is composed of two factors, C s and C u , which have the same importance to users, the weight is A = [ a 1 , a 2 ] = [ 0.5 , 0.5 ] .
(4)
The fuzzy evaluation matrix is established. Firstly, the degree to which each factor belongs to each comment is evaluated. Since most things follow a normal distribution, the membership function is selected as a Gaussian function.
r s p ( y s ) = e ( y s u s p σ s p ) 2
where y s is the input of the s-th factor, C s and C u . u s p and σ s p are the mean and standard deviation of the s-th factor and the p-th comment, respectively.
Then, the fuzzy evaluation matrix R is:
R = [ r 11     r 12     r 13     r 14     r 15 r 21     r 22     r 23     r 24     r 25 ]
(5)
Fuzzy comprehensive evaluation is carried out. The fuzzy evaluation set is:
B = A R = [ b 1     b 2     b 3     b 4     b 5 ]
where represents the operation of the fuzzy matrix. Since the weighted average fuzzy synthesis operator has an obvious weight effect and strong comprehensive degree, and can make full use of the information of R, the element b p is:
b p = min ( 1 , s = 1 2 a s r s p )
(6)
Evaluate customer satisfaction. In order to make the level continuous and quantitative, the level rank corresponding to the matrix B element is set as 1, 2, 3, 4 and 5, and the customer satisfaction m is defined as:
m = b 1 + 2 b 2 + 3 b 3 + 4 b 4 + 5 b 5 b 1 + b 2 + b 3 + b 4 + b 5
According to the definition of m, the smaller m is, the higher the user’s satisfaction.
In practical application, the agent should not only consider customer satisfaction, but also meet the requirements of the power system for frequency regulation, that is, to ensure the tracking accuracy within a certain range. In order to evaluate the tracking performance of the system, the root mean square error index E RMS is defined as:
E RMS = Δ k = 1 N s ( e ( Δ k ) ) 2 N s ( P target max P target min ) 2 × 100 %
where N s is the number of control cycle Δ k , e ( Δ k ) is an error signal in a control period, and P target max and P target min are the minimum value and the maximum value of the tracking power signal, respectively.
According to the definition of E RMS , the smaller E RMS is, the higher the tracking accuracy of the system. In order to comprehensively evaluate the regulation effect and provide the basis for the optimization of the power distribution signal, the comprehensive evaluation index J is defined as:
J = ( 1 λ ) E RMS + λ m
where λ is the proportion of satisfaction.
In fact, priority should be given to ensuring the frequency stability of the power grid: when the tracking accuracy is within a certain range, customer satisfaction can be considered; otherwise, it will not be considered. The relationship between λ and E RMS is as follows:
λ = { G 1 0 < E RMS F 1 G 2 F 1 < E RMS F 2 G 3 F 2 < E RMS F 3 0 E RMS > F 3
where F 1 , F 2 , F 3 , G 1 , G 2 and G 3 are constants, which can be set according to actual operation conditions.

4. Frequency Response Control of TCL Based on SAC Deep Reinforcement Learning

4.1. A Deep Reinforcement Learning Model of Soft Actor–Critic

Reinforcement learning is adaptive learning in the way of trial and error of agents. The agent interacts with the environment continuously, and takes actions to change the environmental state by acquiring the environmental state. The agent obtains corresponding rewards or punishments as the update guidance of the model parameters, so as to obtain the maximum cumulative rewards in continuous learning. Through this perception action evaluation learning method, the agent continuously obtains knowledge in the interaction process, and constantly adjusts and improves its action strategy to adapt to the environment, and finally gives a better task execution strategy. The environment interaction mode is generally described by the Markov decision process (MDP) and is composed of five tuples (S, A, P, r, γ ), namely state space S, action space A, state transition probability P, return function r and discount factor γ .
In this paper, the deep reinforcement learning based on the flexible actor evaluator framework is used to control the frequency response of TCL. The framework of the proposed control model is shown in Figure 3. In the iterative calculation at time t, the actor first generates the action a t through the policy network according to the operating state s t of the TCL cluster observed at this time. After that, the TCL cluster performs state transition according to the control strategy at this time, and reaches the state s t + 1 at the next time. At the same time, the system environment calculates the reward r ( s t , a t ) at time t and feeds it back to the agent, who records ( s t , a t , r ( s t , a t ) , s t + 1 ) in the experience pool. Then, the action strategy sampling of the actor and the system state are input to the critical at the same time, and the action value function Q ( s t , a t ) is output to evaluate the strategy. This process is carried out circularly, and the actor and the judge update their neural network parameters through the gradient descent method, so as to realize the adaptive learning of the model. During the training process, the accumulated return of the agent in the response period will gradually increase and eventually become stable. By introducing the maximum entropy encouragement strategy, the SAC reinforcement learning algorithm can improve the robustness of the algorithm and speed up the training speed. It can make accurate and effective control decisions for large-scale temperature control loads in the complex power supply and demand environment.

4.2. SAC Deep Reinforcement Learning Method

4.2.1. SAC Objective Function

The objective function of SAC requires the strategy to maximize the policy entropy while maximizing the cumulative return, so as to avoid greedy sampling in the learning process and falling into local optimization. According to this, the objective function π max * is constructed as shown in the equation.
π max * = arg max π t = 1 T E ( s q , a q ) ~ p π ( r ( s q , a q ) + α H ( π ( · | s q ) ) )
where E ( ) is the expected function, π is a policy, s q is the state space of the q-th agent, a q is the action space of the TCL and r ( s q , a q ) is the reward function of the q-th agent. The state action trajectory ( s q , a q ) p π formed by the strategy π . α is the temperature term, which determines the influence of entropy on reward. H ( π ( · | s q ) ) is the entropy term of the strategy in the state, and its calculation method is shown in Equation (18).
H ( π ( a q | s q ) ) = a q π ( a q | s q ) log ( π ( a q | s q ) ) d a q = E a q ~ p π ( log ( π ( a q | s q ) ) )

4.2.2. SAC Iteration Strategy

The value function in the reinforcement learning process is shown in Equation (19), which is used for strategic value evaluation Q ( s q , a q ) . The bellman backup operator is shown in Equation (20) and is used for policy updating.
Q ( s q , a q ) = r ( s q , a q ) + γ E s t + 1 ~ p ( Q ( s q + 1 , a q + 1 ) )
T π Q ( s q , a q ) r ( s q , a q ) + γ E s t + 1 ~ p ( V ( s q + 1 ) )
where E s t + 1 ~ p is the expected function of the state space, T π is the bellman backup operator under the policy π and γ is the discount factor of the reward. V ( s q + 1 ) is a new value function of the state, and its calculation method is shown in the following equation.
V ( s q ) = E a q ~ π ( Q ( s q , a q ) log π ( a q | s q + 1 ) )
Meanwhile, there are
Q k + 1 = T π Q k
where Q k is the value function of the kth calculation.
Equation (23) can be realized by iterating Equations (20) and (22) continuously.
lim k Q k = Q ^
where Q ^ is the soft Q-value.

4.2.3. SAC Policy Update

The strategy updating method in the calculation process is shown in Equation (24).
π n e w = arg min π Π   D K L ( π ( | s q ) | | exp ( 1 α Q π o l d ( s q , ) ) Z π o l d ( s q ) )
where KL is divergence, Π is a policy set and Q π o l d ( ) is the value function under the old strategy π o l d . Z π o l d ( s q ) is the partition function under the old strategy π, which is used to normalize the distribution.

4.2.4. Construction of SAC Algorithm

The SAC algorithm needs to construct neural networks, including Q-value network and strategy network. The Q-value network outputs a single value through several layers of neural networks, and the strategy network outputs a Gaussian distribution. In this process, the neural network will be updated. The Q-value network parameter has an update strategy as shown in Equation (25), and the policy network parameter has an update strategy as shown in Equation (26).
J Q ( θ ) = E ( s q , a q , s q + 1 ) ~ D ( 1 2 ( Q ( s q , a q ) ( r ( s q , a q ) + γ V θ ¯ ( s q + 1 ) ) ) 2 )
J π ( ϕ ) = D K L ( π ( | s q ) | | exp ( 1 α Q θ ( s q , ) log Z ( s q ) ) )
where θ is the network parameter of Q-value, ϕ is the policy network parameter and V θ ¯ and Q θ are the new value function and the value function after substituting the Q-value network parameters. Z ( s q ) is the partition function of the state.
The temperature parameter is an important parameter to assist in maximizing entropy, which can maximize the exploration of action space. A reasonable temperature parameter setting is helpful to realize iterative testing of all feasible actions. Therefore, the update of the temperature parameter is as shown in Equation (27).
J ( α ) = E a q ~ π q ( α log π q ( a q | π q ) α H 0 )
where π q is the control strategy of the q-th agent. H 0 is the entropy term.
The expressions (25)–(27) are all obtained by calculation. Throughout the whole process, the Q-value network parameters, strategy network parameters and temperature parameters are continuously updated through deep neural network learning, which can make the model converge continuously and solve the optimal strategy.

4.3. Design of SAC Deep Reinforcement Learning Model for TCL Frequency Response

In this paper, the SAC algorithm is used to solve the control strategy problem of large-scale temperature control load participating in system frequency regulation. The structural model of the proposed control method is shown in Figure 3. The agent in the figure is an agent based on a deep neural network. The environment of the controller is the frequency Δ f deviation of the power system, the differentiation and integration of Δ f , the baseline power signal P b a s e of the aggregated TCL, the aggregated power consumption P total and the automatic generation control signal PAGC. The automatic generation control signal is a series of positive and negative power signals, which represents the active power deviation between the supply and demand of the power system. The baseline power signal of aggregated TCL is the sum of the rated power P i . s e t n of TCL cluster under a certain temperature setting value.
P b a s e = i = 1 N P i . s e t n
P i . s e t n = T T i s e t η i R i
A tracking power signal P target can be generated by superimposing the P AGC and P b a s e . The tracking error signal e can be generated by subtracting the P target and P total , that is
P target = P AGC + P b a s e
e = P target P total
The agent obtains the response power of TCL cluster after optimization according to the environmental information. Then, the direct switch control cluster and the temperature setting control cluster complete the adjustment task according to the obtained control signals. The method includes two stages: offline pre-learning and online application. In the offline pre-learning stage, the pre-learning process will iteratively update all parameters of the agent. During each self-learning iteration, the agent will conduct action exploration (i.e., generate different commands) to interact with the environment. After exploration, the parameters of the agent will be updated according to the system frequency deviation and the reward function of the TCL controller. With an appropriate reward function R and considering environmental constraints, the gradient of the actor (i.e., the gradient of the control target relative to the parameters of the agent) will be calculated and used to update all the parameters of the agent. In the online application stage, the intelligent personnel will calculate the operation value (i.e., generate command) according to the observation value and parameters obtained by themselves for each control cluster.
For the frequency response model of large-scale temperature control load, the negative value of the comprehensive evaluation index J is taken as the reward function of the agent, that is
R = J = ( 1 λ ) E RMS + λ m
By introducing the system frequency deviation and customer satisfaction into the reward function, the obtained control strategy can improve the tracking accuracy of the system, reduce the switching frequency of the equipment, reduce the temperature change and improve the customer satisfaction.

5. Result Analysis

5.1. Example Introduction and Scenario Setting

In order to verify the effectiveness of the method proposed in this paper for large-scale thermostatically controlled loads to participate in power grid frequency control, we take the distribution network with large-scale thermostatically controlled loads as an example, and set the disturbance power of the regional power grid as the net load power, that is, the difference between the original load power and the power generated by new energy, such as photovoltaic and wind power. The disturbance power in the simulation time is shown in Figure 4. As an important thermostatically controlled load, HVAC accounts for a large proportion and is easy to control and manage. Therefore, we selected 2000 HVAC units for simulation experiments. The load parameter settings are shown in Table 1. The initial indoor temperature of the load is evenly distributed in the temperature dead zone, and the temperature dead zone is set to 1.2 °C. P AGC is the actual frequency regulation signal from the PJM power market in the United States, which changes every four seconds.
The parameter settings in the customer satisfaction evaluation and comprehensive evaluation indicators are shown in Table 2. When the value of ERMS is (0, 2%], the frequency regulation effect of the power system is better. At this time, customer satisfaction is mainly considered. When the value of E RMS is (2, 3%], λ is 0.5, indicating that the root mean square error index and the customer satisfaction have the same impact on the system regulation. When the value of E RMS is (3%, 5%], the frequency regulation effect of the power system is poor, and the root mean square error index is mainly considered at this time. When E RMS is greater than 5%, λ is 0, indicating that the system regulation will no longer consider the customer satisfaction, and will focus on improving the tracking accuracy.

5.2. Frequency Control Effect Analysis Considering Customer Satisfaction

Figure 5 compares the changes in customer satisfaction before and after customer satisfaction is included in the agent optimization process. As can be seen from Figure 5, on the one hand, the peak value of customer satisfaction increased from 3.5 before optimization to 4.7 after optimization. On the other hand, within the simulation time range, customer satisfaction after optimization is higher than that before optimization. Figure 6 compares the frequency control effect of the method proposed in this paper before and after considering customer satisfaction. As can be seen from Figure 6, since the customer satisfaction index is added to the control target, the algorithm’s penalty weight for frequency deviation is relatively reduced, resulting in an increase in the maximum frequency deviation of the system compared with that before considering customer satisfaction. Considering customer satisfaction will limit the number of TCLs participating in frequency response, resulting in a poor system frequency control effect, but the impact is not significant. Therefore, the method proposed in this paper can better balance the frequency deviation control and customer satisfaction.

5.3. Frequency Control EFFECT Analysis Based on SAC Deep Reinforcement Learning

In order to verify the effectiveness of the proposed SAC deep reinforcement learning algorithm in the collaborative control of large-scale thermostatically controlled load compared with the traditional PID method, we used the traditional PID controller, the PID controller optimized by particle swarm optimization (PSO) algorithm parameters, and the algorithm proposed in this paper to conduct simulation experiments on the thermostatically controlled load controlled by the direct switch control and the temperature setting control. The number of HVAC experiments is 2000. The comparison of the three control methods in the system frequency control effect is shown in Figure 7.
It can be seen from Figure 7 that under the regulation control of the conventional PID controller, there is a system frequency deviation of about 0.047 Hz in the period when the disturbance power fluctuates violently. Compared with the traditional PID controller, the frequency effect of the PID controller is improved after PSO algorithm parameter optimization, but the effect is still not ideal. The frequency deviation of the system fluctuates within the range of (−0.037 Hz, 0.038 Hz). The algorithm proposed in this paper can keep the system frequency deviation within (−0.02 Hz, 0.023 Hz), and can significantly improve the frequency stability of the power grid.

5.4. Comparative Analysis of the Algorithm and Traditional Deep Reinforcement Learning

This section shows the advantages of the algorithm in this paper compared with the data-driven deep Q network (DQN) algorithm and the DDPG algorithm from the aspects of system frequency control effect and algorithm convergence speed. The proposed algorithm and DDPG are both deep reinforcement learning algorithms based on continuous action space. As shown in Figure 8, the frequency control effect of the two algorithms is significantly better than that of the DQN algorithm designed based on the discrete action space. After integrating the advantages of model driven and data driven, the algorithm proposed in this paper further improves the real-time frequency control effect in the continuous action domain compared with the fully trained DDPG algorithm.
Figure 9 compares the iterative convergence process of the cumulative reward value of the DQN algorithm, the DDPG algorithm and the algorithm proposed in this paper. Among them, after about 250 and 300 iteration cycles, the cumulative reward value of DQN algorithm and DDPG algorithm tends to be stable and will not continue to increase. It is worth noting that each iteration cycle in this paper is an empirical trajectory containing 200 iterations. That is, the two algorithms need 50,000 and 60,000 iterations, respectively, to converge. However, the proposed algorithm only needs about 150 iteration cycles (30,000 iterations) to complete the parameter training of the deep neural network, and the oscillation amplitude of the convergence curve of the algorithm is the smallest.

6. Conclusions

In this paper, considering the support of large-scale thermostatically controlled load on the demand side to the frequency of power grid, based on the deep reinforcement learning of soft actor–critic, a frequency cooperative control method of thermostatically controlled load considering customer satisfaction is proposed to solve the frequency control problem of a power system with large-scale thermostatically controlled load on the demand side participating in frequency regulation. In the example analysis, a distribution network is taken as the research object, and the performance of different algorithms is compared and verified based on time domain simulation. The simulation results show that compared with the existing deep reinforcement learning methods, this algorithm has obvious advantages in system frequency control, customer satisfaction and algorithm training time.
The algorithm proposed in this paper is mainly used to solve the frequency cooperative control problem under the participation of large-scale thermostatically controlled load in the distribution network, and does not consider the application of other flexible resources on the demand side in the system frequency modulation. The next step is to explore the application and practice of multi-agent deep reinforcement learning method in the frequency response of demand-side flexible resources based on the operation mechanism and mathematical modeling of demand-side flexible resources.

Author Contributions

Conceptualization, R.C. and G.Y.; Data curation, C.L. and X.Y.; Investigation, H.L. and Y.Z.; Methodology, R.C.; Software, H.L.; Supervision, G.Y. and Y.Z.; Validation, G.Y.; Visualization, X.Y.; Writing—original draft, R.C. and H.L.; Writing—review and editing, C.L., X.Y. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid Hubei Electric Power Research Institute: “Research on frequency dynamic prediction and active control strategy of high proportion new energy power system under mutational weather” (project number B31532225680).

Data Availability Statement

The data in this paper are from a real distribution network and involve a confidentiality agreement. The dataset in this paper is not publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviation

TCLthermostatically controlled load
ESWHheat pump, electric storage water heater
HVACheating, ventilation and air conditioning
LAload aggregator
EMSenergy management system
DDPGdeep deterministic policy gradient
SACsoft actor–critic
DQNDeep Q network
DSACdistributed soft actor–critic
FCEfuzzy comprehensive evaluation
MDPMarkov decision process
PSOparticle swarm optimization
T i The indoor temperature of the i-th TCL
T The outdoor temperature
C i The equivalent heat capacity of the i-th TCL
R i The equivalent heat resistance of the i-th TCL
P i The energy transfer rate of the i-th TCL
s i ( k ) The load switch state of TCL
T i max The upper limits of the temperature during load operation
T i min The lower limits of the temperature during load operation
T i s e t the temperature setting value
δ the temperature dead zone interval
P i n the rated power of the i-th TCL
η i the energy conversion efficiency coefficient
C s the energy storage index for TCL cluster with direct switch control
C u the discomfort index for TCL cluster with temperature setting control
r s p the membership function of the s-th factor and the p-th comment
y s the input of the s-th factor
u s p the mean of the s-th factor and the p-th comment
σ s p the standard deviation of the s-th factor and the p-th comment
b p the element of fuzzy evaluation set B
m the customer satisfaction index
E RMS the root mean square error index of frequency
e ( Δ k ) the error signal in a control period
P target max the maximum value of the tracking power signal
P target min the minimum value of the tracking power signal
J the comprehensive evaluation index
λ the proportion of satisfaction
Δ f the frequency deviation of the power system
P b a s e the baseline power signal of the aggregated TCLs
P total the aggregated power consumption of TCLs
P AGC the automatic generation control signal

References

  1. Li, C.; Wu, Y.; Sun, Y.; Zhang, H.; Liu, Y.; Liu, Y.; Terzija, V. Continuous Under-Frequency Load Shedding Scheme for Power System Adaptive Frequency Control. IEEE Trans. Power Syst. 2020, 35, 950–961. [Google Scholar] [CrossRef]
  2. Guan, M. Scheduled Power Control and Autonomous Energy Control of Grid-Connected Energy Storage System (ESS) with Virtual Synchronous Generator and Primary Frequency Regulation Capabilities. IEEE Trans. Power Syst. 2020, 37, 942–954. [Google Scholar] [CrossRef]
  3. Hosseini, S.A.; Toulabi, M.; Dobakhshari, A.S.; Ashouri-Zadeh, A.; Ranjbar, A.M. Delay Compensation of Demand Response and Adaptive Disturbance Rejection Applied to Power System Frequency Control. IEEE Trans. Power Syst. 2020, 35, 2037–2046. [Google Scholar] [CrossRef]
  4. Trovato, V.; Sanz, I.; Chaudhuri, B.; Strbac, G. Advanced control of thermostatic loads for rapid frequency response in Great Britain. IEEE Trans. Power Syst. 2017, 32, 2106–2117. [Google Scholar] [CrossRef] [Green Version]
  5. Bao, Y.; Li, Y.; Hong, Y.; Wang, B. Design of a hybrid hierarchical demand response control scheme for the frequency control. IET Gener. Transm. Distrib. 2015, 9, 2303–2310. [Google Scholar] [CrossRef]
  6. Mallada, E.; Zhao, C.; Low, S. Optimal load-side control for frequency regulation in smart grids. IEEE Trans. Autom. Control 2017, 62, 6294–6309. [Google Scholar] [CrossRef]
  7. Yu, Y.; Quan, L.; Mi, Z.; Lu, J.; Chang, S.; Yuan, Y. Improved Model Predictive Control with Prescribed Performance for Aggregated Thermostatically Controlled Loads. J. Mod. Power Syst. Clean Energy 2022, 10, 430–439. [Google Scholar] [CrossRef]
  8. Wang, J.; Shi, Y.; Zhou, Y. Intelligent demand response for industrial energy management considering thermostatically controlled loads and EVs. IEEE Trans. Ind. Informatics 2019, 15, 3432–3442. [Google Scholar] [CrossRef] [Green Version]
  9. Hu, J.; Cao, J.; Chen, M.Z.Q.; Yu, J.; Yao, J.; Yang, S.; Yong, T. Load Following of Multiple Heterogeneous TCL Aggregators by Centralized Control. IEEE Trans. Power Syst. 2017, 32, 3157–3167. [Google Scholar] [CrossRef]
  10. Delavari, A.; Kamwa, I. Improved Optimal Decentralized Load Modulation for Power System Primary Frequency Regulation. IEEE Trans. Power Syst. 2018, 33, 1013–1025. [Google Scholar] [CrossRef]
  11. Song, M.; Sun, W.; Wang, Y.; Shahidehpour, M.; Li, Z.; Gao, C. Hierarchical Scheduling of Aggregated TCL Flexibility for Transactive Energy in Power Systems. IEEE Trans. Smart Grid 2020, 11, 2452–2463. [Google Scholar] [CrossRef]
  12. Wang, Y.; Tang, Y.; Xu, Y.; Xu, Y. A Distributed Control Scheme of Thermostatically Controlled Loads for the Building-Microgrid Community. IEEE Trans. Sustain. Energy 2020, 11, 350–360. [Google Scholar] [CrossRef]
  13. Kim, Y.J.; Fuentes, E.; Norford, L.K. Experimental Study of Grid Frequency Regulation Ancillary Service of a Variable Speed Heat Pump. IEEE Trans. Power Syst. 2016, 31, 3090–3099. [Google Scholar] [CrossRef]
  14. Song, M.; Gao, C.; Yan, H.; Yang, J. Thermal Battery Modeling of Inverter Air Conditioning for Demand Response. IEEE Trans. Smart Grid 2018, 9, 5522–5534. [Google Scholar] [CrossRef]
  15. Zhao, H.; Wu, Q.; Huang, S.; Zhang, H.; Liu, Y.; Xue, Y. Hierarchical Control of Thermostatically Controlled Loads for Primary Frequency Support. IEEE Trans. Smart Grid 2018, 9, 2986–2998. [Google Scholar] [CrossRef]
  16. Luo, F.; Dong, Z.; Meng, K.; Wen, J.; Wang, H.; Zhao, J. An operational planning framework for large-scale thermostatically controlled load dispatch. IEEE Trans. Ind. Inform. 2017, 13, 217–227. [Google Scholar] [CrossRef]
  17. Zhang, W.; Lian, J.; Chang, C.Y.; Kalsi, K. Aggregated Modeling and Control of Air Conditioning Loads for Demand Response. IEEE Trans. Power Syst. 2013, 28, 4655–4664. [Google Scholar] [CrossRef]
  18. Zhang, R.; Chu, X.; Zhang, W.; Liu, Y. Active Participation of Air Conditioners in Power System Frequency Control Considering Users’ Thermal Comfort. Energies 2015, 8, 10818–10841. [Google Scholar] [CrossRef] [Green Version]
  19. Bashash, S.; Fathy, H.K. Modeling and Control of Aggregate Air Conditioning Loads for Robust Renewable Power Management. IEEE Trans. Control Syst. Technol. 2013, 21, 1318–1327. [Google Scholar] [CrossRef]
  20. Zhao, H.; Zhao, J.; Shu, T.; Pan, Z. Hybrid-Model-Based Deep Reinforcement Learning for Heating, Ventilation, and Air-Conditioning Control. Front. Energy Res. 2021, 8, 610518. [Google Scholar] [CrossRef]
  21. Chen, Z.; Shi, J.; Song, Z.; Yang, W.; Zhang, Z. Genetic Algorithm Based Temperature-Queuing Method for Aggregated IAC Load Control. Energies 2022, 15, 535. [Google Scholar] [CrossRef]
  22. Pallonetto, F.; de Rosa, M.; Milano, F.; Finn, D.P. Demand response algorithms for smart-grid ready residential buildings using machine learning models. Appl. Energy 2019, 239, 1265–1282. [Google Scholar] [CrossRef]
  23. Tindemans, S.H.; Strbac, G. Low-Complexity Decentralized Algorithm for Aggregate Load Control of Thermostatic Loads. IEEE Trans. Ind. Appl. 2021, 57, 987–998. [Google Scholar] [CrossRef]
  24. Ma, K.; Yuan, C.; Liu, Z.; Yang, J.; Guan, X. Hybrid control of aggregated thermostatically controlled loads: Step rule, parameter optimization, parallel and cascade structures. IET Gener. Transm. Distrib. 2016, 10, 4149–4157. [Google Scholar] [CrossRef]
  25. Xi, L.; Li, H.; Zhu, J.; Li, Y.; Wang, S. A Novel Automatic Generation Control Method Based on the Large-Scale Electric Vehicles and Wind Power Integration Into the Grid. IEEE Trans. Neural Netw. Learn. Syst. 2022. [CrossRef] [PubMed]
  26. Peirelinck, T.; Hermans, C.; Spiessens, F.; Deconinck, G. Domain Randomization for Demand Response of an Electric Water Heater. IEEE Trans. Smart Grid 2020, 12, 1370–1379. [Google Scholar] [CrossRef]
  27. Ruelens, F.; Claessens, B.J.; Vandael, S.; de Schutter, B.; Babuška, R.; Belmans, R. Residential demand response of thermostatically controlled loads using batch reinforcement learning. IEEE Trans. Smart Grid 2017, 8, 2149–2159. [Google Scholar] [CrossRef] [Green Version]
  28. Liu, H.; Xu, F.; Fan, P.; Liu, L.; Wen, H.; Qiu, Y.; Ke, S.; Li, Y. Load Frequency Control Strategy of Island Microgrid with Flexible Resources Based on DQN. In Proceedings of the 2021 IEEE Sustainable Power and Energy Conference (iSPEC), Nanjing, China, 23–25 December 2021; pp. 632–637. [Google Scholar] [CrossRef]
  29. Yan, Z.; Xu, Y. A multi-agent deep reinforcement learning method for cooperative load frequency control of a multi-area power system. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
  30. Yu, L.; Xie, W.; Xie, D.; Zou, Y.; Zhang, D.; Sun, Z.; Zhang, L.; Zhang, Y.; Jiang, T. Deep Reinforcement Learning for Smart Home Energy Management. IEEE Internet Things J. 2019, 7, 2751–2762. [Google Scholar] [CrossRef]
  31. Xie, J.; Sun, W. Distributional Deep Reinforcement Learning-Based Emergency Frequency Control. IEEE Trans. Power Syst. 2021, 37, 2720–2730. [Google Scholar] [CrossRef]
  32. Yan, Z.; Xu, Y. Data-driven load frequency control for stochastic power systems: A deep reinforcement learning method with continuous action search. IEEE Trans. Power Syst. 2019, 34, 1653–1656. [Google Scholar] [CrossRef]
  33. Acharya, S.; Moursi, M.S.E.; Al-Hinai, A. Coordinated Frequency Control Strategy for an Islanded Microgrid with Demand Side Management Capability. IEEE Trans. Energy Convers. 2018, 33, 639–651. [Google Scholar] [CrossRef]
Figure 1. Frequency modulation model of power system with TCL participation.
Figure 1. Frequency modulation model of power system with TCL participation.
Energies 15 07866 g001
Figure 2. Operation characteristics of TCL in different control modes. (a) Direct switch control; (b) temperature setting control.
Figure 2. Operation characteristics of TCL in different control modes. (a) Direct switch control; (b) temperature setting control.
Energies 15 07866 g002
Figure 3. Frequency response control framework of large-scale TCLs based on deep reinforcement learning.
Figure 3. Frequency response control framework of large-scale TCLs based on deep reinforcement learning.
Energies 15 07866 g003
Figure 4. Variation of disturbance power within 2 h.
Figure 4. Variation of disturbance power within 2 h.
Energies 15 07866 g004
Figure 5. Comparison of customer satisfaction before and after optimization.
Figure 5. Comparison of customer satisfaction before and after optimization.
Energies 15 07866 g005
Figure 6. Frequency deviation before and after considering customer satisfaction.
Figure 6. Frequency deviation before and after considering customer satisfaction.
Energies 15 07866 g006
Figure 7. Comparison of frequency control effect between the proposed algorithm and PID controller.
Figure 7. Comparison of frequency control effect between the proposed algorithm and PID controller.
Energies 15 07866 g007
Figure 8. Comparison of frequency control effect between different algorithms.
Figure 8. Comparison of frequency control effect between different algorithms.
Energies 15 07866 g008
Figure 9. Iterative convergence process of cumulative reward value: (a) DQN algorithm; (b) DDPG algorithm; (c) SAC algorithm in this paper.
Figure 9. Iterative convergence process of cumulative reward value: (a) DQN algorithm; (b) DDPG algorithm; (c) SAC algorithm in this paper.
Energies 15 07866 g009aEnergies 15 07866 g009b
Table 1. Setting of load parameters.
Table 1. Setting of load parameters.
ParametersMeaningValue
C i Equivalent heat capacityN (2, 0.01)
R i Equivalent thermal resistanceN (2, 0.01)
P i Energy transfer rateN (14, 0.01)
T Outdoor temperature32
T i s e t ( 0 ) Initial temperature setting20
δ Temperature dead zone1.2
η i Energy conversion efficiency coefficient2.5
Note: N (2, 0.01) represents the normal distribution with the mean value of 2 and the variance of 0.01, and the others are the same.
Table 2. Setting of customer satisfaction evaluation and comprehensive evaluation parameters.
Table 2. Setting of customer satisfaction evaluation and comprehensive evaluation parameters.
ParametersValueParametersValue
u s 1 0 F 1 2%
u s 2 0.25 F 2 3%
u s 3 0.5 F 3 5%
u s 4 0.75 G 1 0.8
u s 5 1 G 2 0.5
σ s p 0.2 G 3 0.3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Chen, R.; Liu, H.; Liu, C.; Yu, G.; Yang, X.; Zhou, Y. System Frequency Control Method Driven by Deep Reinforcement Learning and Customer Satisfaction for Thermostatically Controlled Load. Energies 2022, 15, 7866. https://doi.org/10.3390/en15217866

AMA Style

Chen R, Liu H, Liu C, Yu G, Yang X, Zhou Y. System Frequency Control Method Driven by Deep Reinforcement Learning and Customer Satisfaction for Thermostatically Controlled Load. Energies. 2022; 15(21):7866. https://doi.org/10.3390/en15217866

Chicago/Turabian Style

Chen, Rusi, Haiguang Liu, Chengquan Liu, Guangzheng Yu, Xuan Yang, and Yue Zhou. 2022. "System Frequency Control Method Driven by Deep Reinforcement Learning and Customer Satisfaction for Thermostatically Controlled Load" Energies 15, no. 21: 7866. https://doi.org/10.3390/en15217866

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop