1. Introduction
China has announced to the world its goal of achieving carbon neutrality by 2060, which fully reflects the responsibility of a major country and further emphasizes the important position of ecological civilization construction in the national strategy. The key to achieving China’s carbon neutrality goal is building a clean, low-carbon, recycling economic system and a green, carbon-reducing, secure, and highly efficient energy system [
1]. To continuously promote the energy revolution and achieve the carbon peaking and carbon neutrality goals, the National Energy Administration plans that the country will install more than twice the current amount of capacity of installed wind and solar power in the next 10 years [
2].
Wind and photovoltaic power generation are characterized by randomness, fluctuation, and intermittence. If wind or photovoltaic power with randomness is directly connected to the grid, it would bring great instability to the power grid. To solve this problem, energy storage has been employed in renewable green and clean energy power stations, and it has been proven to be an effective way of fluctuation smoothing [
3], peak cutting, and valley filling [
4]. In [
5], a pumped storage power station was used to establish an optimization model to overcome the risks associated with excessive power fluctuations in wind power generation. Ref. [
6] equips energy storage systems in energy communities to sell surplus energy to energy retailers when consumers have excess energy over demand energy and to buy energy from the storage system when consumers have insufficient energy. It is important to form a combined wind-power storage system that can effectively make up for the shortcomings of renewable energy generation for the stability of the power grid [
7].
Currently, some scholars have conducted research on applications of the ESS in tracking new energy power generation schemes. In [
8], an integrated control approach to internal energy coordination control and multi-objective optimization control was adopted to realize the power tracking control of photovoltaic power stations. In [
9], a charge/discharge controlling strategy for an ESS with five control parameters was established, and a method of real-time optimization of control coefficients by the particle swarm optimization algorithm was proposed to track the power generation schedule. Ref. [
10] proposed an optimal control technique for power flow control of hybrid renewable energy systems, combining the whale optimization algorithm and the artificial neural network, the simulation results show that the proposed technique is successfully used to resolve the optimal power flow problem of the hybrid system In [
11], a fuzzy model predictive control for the ESS has been proposed, and simulation based on the historical operation data of the photovoltaic plant shows that the proposed control method has flexibility and adaptability. Although those methods can effectively realize the tracking control problem, they can only be used for a fixed power generation schedule, which is difficult to dynamically adapt to the random fluctuations of scenery and achieve online control. For the multi-time scale scheduling problem, the above methods are easy to fall into the local optimum due to the dimensional disaster.
Reinforcement learning is an adaptive model-free machine learning method, which has a good ability to extract historical data features and can avoid the problems of uncertainty modeling and dimensionality disaster [
12]. Reinforcement learning has now been applied to the energy scheduling problem [
13,
14]. In [
15], the q-learning algorithm was used for minimizing the photovoltaic power generation cost installed in the microgrid, and its results indicate that q-learning was superior to the rule-based heuristic algorithm. In [
16], the deep reinforcement learning theory was applied to integrated energy distribution. It can respond dynamically to uncertainties in the environment and achieve the effect of improving the economy of system operation. In [
17], an improved K-means algorithm was used to achieve energy storage grouping. The multi-agent deep deterministic policy gradient (MADDPG) algorithm was then used to tackle the grouped multi-agent system. The experiments showed that the suggested scheduling strategy can suppress the fluctuation impacts in the wind power output and improve the operational efficiency of the hybrid system. The application of reinforcement learning reduces costs and minimizes fluctuations in hybrid storage systems. It provides a good way to track generation plans.
In this paper, a charging and discharging control strategy for an energy storage system based on the PPO algorithm is proposed. The major contributions of the paper are as follows:
(1) A charge and discharge power control method for the energy storage system is proposed based on deep reinforcement learning. It can adapt to different power generation plans.
(2) The K-means algorithm is used to solve the problem that the control parameters will be different in different weather conditions.
(3) The charge/discharge power limit of the energy storage system and the residual capacity limit of the energy storage system are considered.
This paper proposes an effective scheduling scheme based on reinforcement learning. The whole scheme will be detailed in the following sections. First, the mathematical model, constraint conditions, and the objective function of the hybrid system are established in
Section 2; then, the state, action, and reward required by the PPO algorithm and the optimization process are introduced in
Section 3; next, the proposed control method is applied to a photovoltaic power station in
Section 4; finally, the whole paper is summarized in
Section 5.
2. Problem Formulation
The photovoltaic and energy storage hybrid system includes a photovoltaic power generation system, a control center, and an ESS. The structure of the hybrid system is described in
Figure 1. The photovoltaic power generation station delivers the day-ahead forecast power to the dispatching center as the power generation plan every day. If the actual power generation is too distinct from the power generation plan, it needs to be adjusted through the ESS.
In an ideal condition, the power generation of a photovoltaic power station should be equal to the power generation plan, but in practice, because photovoltaic power generation is affected by solar radiation and other meteorological factors, so the power generation of the photovoltaic power station and the power generation plan cannot be equal; there will always be deviations, which is not conducive to the stability of the power grid. Thus, it is necessary to connect the energy storage system to regulate the power generation of photovoltaic power stations. This means that to ensure stable operation of the power grid, a simple way is to make the hybrid system (the power generation of photovoltaic and energy storage) and the power generation plan as close as possible. When the power generation of the photovoltaic power station is smaller than the power generation plan, the energy storage system discharges to provide the missing power; when the power generation of the photovoltaic power station is larger than the power generation plan, the energy storage system charges to absorb the excess power.
The mathematical model of the hybrid system can be established as (1).
where
is the residual capacity of the ESS at the end of time
t, MW·H,
is the charging and discharging power value of the ESS at time
t, the charging power is negative and the discharging power is positive,
is the self-discharge rate of the ESS,
and
are the charging and discharge efficiency of the ESS,
is the time interval, and
is the photovoltaic power generation at time
t.
To make the ESS run healthily for a long time and save costs, the design of the hybrid system should satisfy some constraints.
where
,
and
are the lower and upper limits of the
, and
is the rated capacity of the ESS.
- 2.
Charge/discharge power constraint of the ESS:
where
is the maximum charge/discharge power value, and
can take all real numbers in this range.
To make the residual capacity still satisfy the constraint after charging/discharging, should satisfy the following requirements:
During the tracking of the generation plan, the target power curve is the planned output curve (day-ahead forecast of photovoltaic power generation) issued by the dispatch center. Take the time interval to be 15 min for example. There are 96 time periods in a day, and each period corresponds to a planned output value. The objective function in this paper is composed of (1) the deviation between the power generation and the power generation plan of the hybrid optical storage system, and (2) the deviation between the residual capacity and the ideal capacity, as shown in (6). The first part of the objective function describes the economics of energy storage, and the second part describes the tracking effect of the hybrid system.
where
is the ideal capacity of the ESS,
is the power generation plan for each time interval,
and
are the weight coefficients, and
.
This work aims to design a charge/discharge power control method for an ESS to satisfy two requirements simultaneously: (1) The power generation of the hybrid system follows the power generation plan as closely as possible; (2) The residual capacity of the ESS is close to the ideal capacity in the condition of satisfying the constraint conditions.
3. Power Generation Control Strategy Based on the PPO Algorithm
The flow chart is shown in
Figure 2. First, establish the mathematical model of the hybrid system, the constraint conditions of each variable, and the optimization objective function, all of which are defined in
Section 2. Use the K-means algorithm to divide different days into
k classes based on mean, standard deviation, and kurtosis; then, set the state, action, and reward required by the PPO algorithm. Next, the selected action is constantly optimized according to the PPO algorithm in different scenarios. Finally, the output power of the hybrid system can follow the power generation plan under the optimal regulation of the ESS.
3.1. Scenario Clustering Based on the K-Means Algorithm
As weather conditions vary, the generation power of the photovoltaic power station is a lot different from day to day, which brings difficulties to the control. Before designing the control algorithm, it is essential to cluster the scenarios on different days.
Because there is no definite classification for power generation plans of different days, the clustering algorithm is chosen to cluster different power generation plans, and among the original many power generation plans similar ones are clustered into one class. The K-means algorithm is widely used by researchers because of its simple principle and fast convergence, so it is chosen as the clustering algorithm for power generation plans in this study.
In this paper, the mean, standard deviation, and kurtosis of the power generation plan are taken as characteristics to divide the different power generation plans into different scenarios. All three metrics are used to measure the characteristics of the generation schedule curve. The mean represents the average of the generation schedule at 96 time points per day. (The photovoltaic power output is measured every 15 min, so there are 96 photovoltaic power generation data in a day). The standard deviation reflects the degree of dispersion of the generation schedule. The mean and the standard deviation are the two most important measures to describe the trend of data concentration and the degree of dispersion. Kurtosis is used to measure the steepness of the probability distribution of the generation schedule. The K-means clustering algorithm is a cluster analysis algorithm with an iterative solution [
18]. By calculating the distance between different generation plan characteristics, similar generation plans are clustered into one scenario. The clustering process is as follows:
- (i)
Determine the clustering features: The mean, standard deviation, and kurtosis of the power generation plan are taken as clustering features.
- (ii)
Select cluster center: Select k objects from the data as the initial clustering center.
- (iii)
Calculate the distances from each set of features to all cluster centers. Among the calculated distances, each feature that has the minimum distance with the cluster center would be classified into that cluster.
- (iv)
The center of each cluster is the average of all the features in the corresponding cluster.
- (v)
Calculate the clustering cost function.
- (vi)
Stop the calculation if the cost function is below a certain threshold value or the improvement over the previous iteration is below a certain tolerance.
After the above steps, different daily power generation plans in the historical data are divided into k scenarios. Because different scenarios have different control parameters, a reinforcement learning algorithm is adopted to choose a more appropriate charging/discharging strategy in each scenario.
3.2. Tracking Control Based on Reinforcement Learning
Reinforcement learning is a self-learning mechanism that establishes the mapping relationship between environmental states and actions by training agents to constantly interact with the environment [
19], as shown in
Figure 3.
Reinforcement learning regards learning as a process of exploration and exploitation. The agent chooses an option for action based on the environmental information. After the action is received by the environment, the state changes accordingly and generates reward or punishment feedback for the agent. The agent chooses action according to the reinforcement signal (reward) and the observed state in the current environment and repeats this process until the last state or until the end condition is reached.
During constant interactions with the environment, the agent constantly learns the optimal control strategy that maximizes the total reward value in the whole process. In reinforcement learning, policy-based approaches are more applicable for continuous state and action space problems than value-based approaches.
The PPO algorithm was proposed by Schulman et al. [
20]. It is a reinforcement learning algorithm proposed by OpenAI. It can quickly learn the correct strategy in complex scenarios and solve the problem of continuous action and continuous state. Among many reinforcement learning algorithms, the PPO algorithm has the advantages of strong adaptability and stable training. Since the actual residual capacity of the energy storage system and the generation schedule are continuous variables, the PPO algorithm applies to the study of this paper. The PPO algorithm is used to optimize the charging and discharging power decisions of the energy storage system, so that the power generated by the photovoltaic power system can follow the power generation schedule as closely as possible in the regulation of the energy storage system.
As the PPO algorithm is adopted to make sequence decisions, the corresponding state space, action space, and reward should be set according to the problem to be solved. The mathematical function of the hybrid system is analyzed in
Section 2. The corresponding state space, action space, and reward function are set as follows.
- (1)
State space
The definition of the state space is shown in (7). The residual capacity of the ESS
, the ultra-short-term forecast of the photovoltaic power generation
and the power generation plan
(day-ahead forecast of photovoltaic power generation) is selected as the state space.
- (2)
Action space
The charging/discharging power of the ESS
is selected as the action space, as shown in (8).
- (3)
Reward function
The reward function in this paper is composed of the objective function. Since reinforcement learning aims to maximize the reward, the negative value of the objective function is taken as the reward function as shown in (9).
Algorithm 1 shows the process of the PPO algorithm. The output power determination procedure based on the PPO algorithm is shown in
Figure 4. In the offline process of training, the neural network is trained, and after the training is completed, the state is directly inputted into the trained neural network to complete the online application.
Algorithm 1. PPO algorithm. |
Train:- 1.
Initialize: policy parameter , replay buffer Ɓ, and the number of iterations N - 2.
for i = 1 to N do - 3.
Initialize the environmental information . - 4.
for t = 1 to T do - 5.
Sample action according to - 6.
Calculate the reward and observe the next state ; - 7.
Store transitions (, , , ) in Ɓ - 8.
Compute advantages with Eq.(10) - 9.
end for - 10.
for epoch = 1 to K do - 11.
Sample mini-batches from Ɓ - 12.
Update by the gradient method with Equations (11) and (12) - 13.
end for - 14.
Clear Ɓ - 15.
end for - 16.
Save the policy parameter Test: - 17.
Initialize - 18.
for t = 1 to T do - 19.
Sample according to - 20.
Execute , and update the environment state to end for |
The advantage function (
) is used to evaluate the advantage value of taking the current action in the current state versus taking the average action as shown in Equation (10).
where
T is the maximum length of a trajectory,
is a discounted factor, and
denotes the value expectation of state
.
The parameter update formula of the PPO algorithm is as follows:
where
is the objective function.
where
is the maximum difference between the old and new probability ratios.
is the advantage function; when
, if
, the upper limit value is
; when
, if
, the lower limit value is
.