1. Introduction
Operation mode calculation (OMC) can provide the operation boundaries for a power system, which is the overall guidance scheme for ensuring the safe and stable operation of a power system. It is also a theoretical basis for dispatchers to evaluate the real-time operation status of a power system [
1]. Currently, power system OMC is mainly obtained manually. Large power systems are usually divided into small partitions according to geographical regions. Firstly, dispatchers adjust the operation mode of the electrical devices to achieve a target power flow state convergence in each partition. Finally, according to the preset boundary conditions, all the partitions are combined to realize the power flow convergence of the whole power system [
2]. Specifically, the purpose of OMC is to balance the generation and load consumption of the whole power system. Factors such as seasonal changes, equipment maintenance, and emergencies will lead to different load levels. Therefore, it is necessary to formulate operation modes under different load levels to guide generation. The goal of OMC is to ensure the safe and stable operation of the power system.
With large-scale new energy and long-distance, large-capacity transmission integration in power systems, the scale and complexity of power systems continues to increase, and power system OMC thus faces severe challenges [
3]. Differences in the models and scopes of applications among power system data make power flow convergence difficult and thus require extensive manual intervention. To solve such problems, Jabr [
4] proposed a polyhedral formulation and loop elimination constraint method for piecewise linear programming. Vaishya et al. [
5] proposed a method to find a feasible solution to a power flow equation based on piecewise linear programming. However, when the number of buses in a power system is too large, the piecewise linear programming algorithm will lead to non-convergence in the power flow calculation. Farivar et al. [
6] constructed a convex relaxation model suitable for power flow calculation. Fan et al. [
7] extended the expression of variable space to polynomial functions and established a linearized model of power flow for the optimal selection of variable space. With the large-scale construction of microgrids, it is difficult for traditional power flow calculation methods to consider the hierarchical control effects in microgrids [
8]. Ren et al. [
9] proposed a general power flow calculation model that can incorporate a hierarchical control scheme into power flow calculations. Guo et al. [
10] proposed a method with strong adaptability for power flow nonlinearity in order to improve the calculation accuracy of data-driven power flows under the high penetration rate of renewable power distribution generation. Based on the Koopman operator theory, the nonlinear relationship in power flow calculations was transformed into linear mapping in high-dimensional state space, which could significantly improve calculation accuracy. Huang et al. [
11] proposed a multi-area dynamic optimal power flow model that considers the collaborative optimization of energy reserves, intrinsic three-phase unbalanced networks, and dual-control time scales. Smadi et al. [
12] modeled the power conversion process by setting up a state estimator and proposed an HVDC hybrid two-step state estimation model.
The above methods have made great contributions to OMC from different angles. However, they are all based on the improvement of the power flow calculation method and traditional power system mechanism modeling. The operation mode of a large power grid involves the calculation of high-dimensional nonlinear systems, and the data model for this still needs to be built. In actual OMC, manual decision making is also needed to adjust power flow, thereby resulting in low efficiency, large errors, and heavy dependence on expert experience.
The emergence of deep reinforcement learning provides an opportunity to solve the above problems, and related studies have also gradually applied it to power systems [
13,
14]. In microgrid power management, Zhang et al. [
15] proposed a distributed consensus-based optimization method to train the policy function of supervised multi-agents, converted operating constraints into training gradients, and achieved the optimal power management of connected microgrids in power distribution systems. Ye et al. [
16] proposed an interior-point policy optimization (IPO) method. The time–space-related features were extracted from the operating state of a microgrid to further enhance the generalization ability of IPO. However, OMC needs to consider equipment maintenance, limit conditions, N-1 failure, and other such situations, as well as providing different grid scheduling schemes. Differently from microgrids, taking the optimal decision is disastrous for the computation of large-scale power systems. Therefore, in the actual OMC of power grids, the dispatchers often give many feasible solutions first, and they then find the relative optimal solution through an evaluation system. In addition, a great number of important and interesting research results have been published in photovoltaic power forecasting [
17], power generation control [
18,
19,
20], reactive power optimization [
21], frequency control [
22,
23], load control [
24], and economic dispatches [
25]. However, the above studies were all based on the data of one or several operation modes for power system component control and decision making, and there has been less research on OMC itself. The slow training speed, low decision-making efficiency, and poor adjustment accuracy limit the application of the deep reinforcement learning method in OMC.
The application of deep reinforcement learning in large-scale power system OMC needs to solve the following three problems: firstly, the system state space and action space are discrete. As such, the component parameters to be adjusted should all be discrete variables with only a few adjustment states. Secondly, the power system has thousands of buses. Thus, the state space and action space are high-dimensional. Finally, the model needs to have a certain generalization ability and applicability, as well as having the capacity to adapt to power systems with a different number of buses. Deep Q-network (DQN) has shown good performance in solving high-dimensional discrete state space and action space decision-making problems. At the same time, DQN uses neural networks to fit the Q value function; as such, they have strong applicability to hyperparameters and can easily solve the above three problems. Another problem caused by high-dimensional decision space is the dramatic increase in the amount of computing. Therefore, we improved the DQN by redesigning the system state space, action space, reward function, and action mapping strategy to reduce the consumption of computing and speed up the network training process.
The contributions of this paper are summarized as follows:
- (1)
The power flow adjustment problem in OMC is expressed as a Markov decision process (MDP), in which the generator power adjustment and line switching in power systems are considered. The state space, action space, and reward function are designed to conform to the rules of a power system, and the minimum adjustment value of the generator output power is set as 5% of the upper limit;
- (2)
An improved deep Q-network (improved DQN) method is proposed and introduced to solve the MDP problem, which improves the action mapping strategy for generator power adjustment to reduce the number of adjustments and speed up the DQN training process;
- (3)
OMC experiments with a power system with eight basic load levels and six N-1 faults are designed; the simulation verification is realized in an IEEE-118 bus system and the robustness of the algorithm after generator fault disconnection is verified.
The rest of this paper is organized as follows:
Section 2 presents the modeling process for the OMC problem;
Section 3 presents the improved deep reinforcement learning algorithm and training process;
Section 4 presents the simulation results and related discussions for an IEEE-118 bus system; and finally,
Section 5 presents the conclusions and future work.
3. The Improved DQN Model for OMC
3.1. Introduction of DQN
DQN is improved on the basis of Q-learning. Q-learning calculates the
Q value of each action under strategy
to form a
Q value table, and it then selects the action with the largest
Q value to execute. However, as the scale of the state space and action space increases, the scale of the
Q value table will increase exponentially. Calculating the
Q value of each action separately will lead to an excessively long algorithm running time. DQN combines Q-learning with neural networks and then uses neural networks to estimate the
Q value function, which is shown as
where
w is the weight of the neural network and the
Q value function is estimated by function approximation. DQN adopts the
strategy for action selection, and usually selects the action output with the largest Q value. However, there is a certain probability (
) of a random selection of actions for environment exploration. DQN introduces the online Q-network and the target Q-network. The DQN structure diagram for OMC is shown in
Figure 2.
At the beginning of the training, the power system element parameters and target load level information are input into the online Q-network, and the parameters of the online Q-network and the target Q-network are set the same, i.e., as
and
, respectively. The online Q-network calculates the agent state
in real time and outputs action
; the target Q-network calculates the Q value of state
at time
t + 1 after taking action
; the loss function of (7) is used to calculate the gradient between the online Q-network and the target Q-network; and, in order to prevent overfitting, the online Q-network parameter is copied to target the Q-network at every C time step. The reply memory is responsible for storing the experience tuples
of the power system operation mode. When the reply memory is full, the experience tuples at the earliest moment are removed. Finally, the DQN network outputs the trained online Q-network parameter
and a series of adjustment actions
to complete the OMC at the target load level.
3.2. Improved Mapping Strategy
The usual action-executing mapping strategy is to correspond action
with the state space
S one by one. However, there are many generators in the power system, and each additional generator will increase the state space exponentially. When the generator power and load are unbalanced, it is obvious that the power flow of the grid will not converge and that all the previous states are “useless”. We need to quickly skip these states and focus on the subsequent adjustments for power flow convergence. As shown in
Figure 3, in order to improve training efficiency, an improved mapping strategy was designed in this paper, which works as follows:
Set as the sum of the current active power of the generators without a slack bus generator; is the sum of the current active power of all loads; and are the maximum and minimum active power of the slack bus generator, respectively; K is the network power loss rate; is the active power of generator i; and are the maximum and minimum active power of generator I, respectively; and the minimum adjustment threshold is .
Strategy 1: When the power system is operating normally, it is only necessary to adjust the active power of the generator to match the current load level. Scenario 1: When , , set if , then set until . In this scenario, the total active power of the generators is insufficient, and the active power of the generators needs to be increased to meet the requirements of the power flow convergence. Scenario 2: When and , set if , and set until . In this scenario, the total active power of the generator is too large, and the active power of the generator needs to be reduced. Scenario 3: Except for Scenario 1 and Scenario 2, the active power of the generator matches the load level, and only the output of the slack bus generator needs to be adjusted.
Strategy 2: When an N-1 fault occurs on line i, set in the state space S. If the generator bus is disconnected by the fault line, the number of the disconnected generators is deleted in action space A, and then Strategy 1 is executed.
In the actual power flow adjustment process of a large power system, when the generator output is insufficient/surplus, then the estimated value of the difference is generally calculated while selecting the proportion of generator start-up in the corresponding area. In this paper, in order to simplify the model, this estimated value was set at and the difference was then fine-tuned by power flow calculations that corresponded to the process of .
3.3. Training of the Improved DQN
In this paper, an improved DQN method for intelligent power system OMC was established based on the original DQN [
27]. At the beginning of training, the reinforcement learning parameters and the current power flow state are initialized. The power system operation mode information and target load level information were input into the improved DQN. In line 7, the algorithm has a
probability to input
into the estimated Q-network, and it uses the improved mapping strategy to determine the action
. Line 8 calculates the power flow state after each iteration by calling MATPOWER [
28]. Furthermore, it saves the action and reward values, and then inputs them into reply memory D. MATPOWER is an m-file package based on MATLAB, which is used as power flow calculation simulation software. Lines 10–12 select the minibatch data from D, and they use (7) to calculate the gradient between the target Q-network and the online Q-network. In lines 14–15, if
and
are satisfied, it proves that the load level has been adjusted. Then, the power flow state is reset and the target load level is randomly initialized. Finally, the estimated Q-network parameters are output to complete the intelligent power system OMC. See Algorithm 1.
Algorithm 1: Training of the improved DQN method |
Input: All target load levels. |
Output: Trained parameters . |
1 | Initialize reply memory D to capacity N. |
2 | Initialize online Q-network with random weight , set target Q-network weight . |
3 | For episode = 1, M do |
4 | Initialize of power system with generator, load and bus data. |
5 | For t = 1, T do |
6 | With probability select a random action . |
7 | Otherwise select . |
8 | Execute action in MATPOWER and observe reward and state . |
9 | Store transition in D. |
10 | Sample random minibatch of transitions from D. |
11 | Set . |
12 | Perform a gradient descent step on with the parameters . |
13 | Every C step, reset . |
14 | If |
15 | Reset the power flow state and randomly initialize the target load level. |
16 | End If |
17 | End For |
18 | End For |
4. Experimental Verification and Analyses
4.1. Experimental Setup
In this paper, the proposed method was verified in an IEEE-118 bus system [
29], which contains 54 generators (1 slack bus generator) and 186 lines. The IEEE-118 bus file (case118.m) in MATPOWER 7.1 (
https://matpower.org/download/ (accessed on 10 October 2023)) was used as the initial data. Two experiments were considered.
Experiment 1 set eight load levels under normal line connections as the OMC target. The initial load level was 4242 MW, and the eight load levels were 1.0 (initial data), 0.6, 0.8, 1.2, 1.4, 1.6, 1.8, and 2.0 times the initial load level, respectively.
Experiment 2 simulated the situation when N-1 faults occur on the section lines. Three lines (line 23–24, line 23–25, and line 24–70) were respectively tripped off near the section lines of the system. Then, 1.0 (initial load level) and 1.4 times the load level after the lines were respectively tripped off were set as the OMC target.
The serial number of the slack bus generator was 30; we defined
to describe the generator turn-on ratio. The formula for this is as follows:
where
is the output active power of generator
i and
is the maximum active power of generator
i.
The network structure settings of the online Q-network and target Q-network are shown in
Table 1. The size of the input layer was a 242 × 1 × 1 × minibatch. The size of system state space S was 242, which was composed of 54 generators, 186 lines, and 8 load levels (binary coding). The output layer dimension was 54 and represented the
Q value corresponding to each possible generator action.
The hardware environment used for the experiments was an Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz and the GPU was an RTX 2060 6G (Legion Y7000P2020H laptop made in China). The software environment was the MATLAB and MATPOWER 7.1 package. The AC power flow model and Newton–Raphson method were used in the experiments. The remaining hyperparameters of the improved DQN were set as follows: a network loss rate of K = 0.05, a greedy coefficient of , a learning rate of , a reward attenuation factor of , a minibatch of 64, a C of 100, a D of 5000, episodes of 3000, and a T of 2000.
4.2. Training Process
The DQN method [
27], the IDRL method [
30], and the improved DQN method proposed in this paper were used for network training, and the average cumulative reward value change curve is shown in
Figure 4.
Figure 4a shows the average cumulative reward value change curve of Experiment 1. The cumulative reward values of DQN, IDRL, and the improved DQN reached convergence at 600 episodes, 540 episodes, and 470 episodes, respectively. The DQN method randomly selects 1 of the 21 states for state transition ([0, 0.05,…, 1.0]). Therefore, the convergence of the reward value is the slowest, the convergence of the reward values is the smallest, and the training effect is the worst. The IDRL method selects only one of the two states [0, 1] for state transition, which corresponds to the full on/full off operation of the generator; as such, the initial reward value produced the fastest accumulation speed, and the fluctuation after convergence was smaller than in the DQN. The improved DQN method performed state transition operations through the improved mapping strategy proposed in this paper. The convergence of the improved DQN training reward value was the fastest, the average cumulative reward value after convergence was the largest, and the fluctuation was the least.
Figure 4b shows the average cumulative reward value change curve of Experiment 2. The convergence episodes of the DQN, IDRL, and improved DQN were 650, 630, and 540, respectively. The average cumulative reward value and training speed of the improved DQN were also the best. When combining the two experiments, it can be concluded that the proposed improved DQN has a better training effect.
When comparing Experiment 1 and Experiment 2, it can be seen that the episodes and average reward values of the three methods in Experiment 1 were larger than those in Experiment 2. Our explanation for this phenomenon is as follows: in Experiment 2, the key line of the IEEE-118 bus system section tripped, which could have resulted in the power flow convergence action being unable to achieve the original effect, thus resulting in a slower network training. At the same time, the number of power flow convergences in an episode was reduced, thus making the average cumulative reward value smaller.
4.3. Analysis of Experimental Results
Figure 5 shows the OMC results of Experiment 1. As shown in
Figure 5a, 19 generators (including the slack bus generator) were running in the initial state, and the values of
were all less than 80%. There is no doubt that if the generators remained operating in their initial state, the power flow would not converge for the next 7 load levels.
Figure 5b–h show the generator adjustment results for the power flow convergence under the remaining load levels. It can be seen that the number of operating generators rose from 17 at the 0.6 times load level to 53 at the 2.0 times load level. At the 0.6 times load level,
was less than 50%. At the 2.0 times load level,
was generally more than 50%, and 18 of the generators would have reached their upper limits.
Figure 6 shows the OMC results of Experiment 2. It can be seen that, when the N-1 faults occurred on the three key lines, there was almost no difference in the start-up of the generators under the 1.0 times load level, thus indicating that the initial load level of the IEEE-118 bus system was robust and that the safety and stability margin was set as very large. However, at the load level of 1.4 times, after tripping off line 23–24, line 23–25, and line 24–70, the number of operating generators was 42, 40, and 38, respectively. The reason for the phenomenon is that line 23–24 is the key section tie line of the IEEE-118 bus system [
31]. After line 23–24 was tripped off, the power flow convergence of the system deteriorated, and more generators needed to be switched on to compensate for the power demand. Line 23–25 and line 24–70 are the key lines near the key sections; they have less impact on the system power flow convergence than line 23–24.
From
Figure 5 and
Figure 6, it can be concluded that the model proposed in this paper can better realize power flow convergence under different load levels, and it can complete OMC under normal conditions and N-1 faults.
Table 2 shows the comparison of the sum of generator output power (excluding the slack bus generator), the output power of the slack bus generator, the number m of the operating generator, and the network loss rate K of the 14 operating modes. It can be seen that the number of operating generators m is positively correlated with the load level. The maximum K value is 3.21% and the minimum K value is 1.97%, all of which are less than the K = 0.05 setting that was present during the improved DQN training. This shows that the hyperparameter setting was valid. The network loss rates of the 14 operating modes were all low, which proves that the proposed method had a good effect on the power flow adjustment. Furthermore,
approaches
at load levels of less than 1.0 and approaches
at load levels of more than 1.0. The reason for this is that when the load level is greater than 1.0, decreasing
requires additional actions to turn up the other generators. And, at load levels that are less than 1.0, increasing
requires additional actions to turn down the other generators. According to the improved mapping strategy and reward mechanism proposed in this paper, the improved DQN will always provide as few generator adjustment actions as possible in each episode to obtain a higher reward. Thus,
approaches the two limits as mentioned above.
Figure 7a shows the operating state of the generator under normal conditions and after the failure of the No. 37 generator under the IDRL method at a 0.6 times load level. The state of the IDRL method generator was only fully on and off. Therefore, it can be seen that the number of generators turned on was only 8. Except for the No. 30 slack bus generator, the output power of other generators was at the upper limit. As shown in the red dotted line box, when the No. 37 generator failed, the output power became 0. In order to compensate for its power, the
of the slack bus generator was 114.07%, which exceeded the
. The power flow calculation did not converge, thus resulting in power system collapse.
Figure 7b shows the improved DQN method proposed in this paper under normal conditions and after the failure of the No. 37 generator at a 0.6 times load level. In normal conditions, 17 of the generators were turned on. The
of all the generators was less than 50%, which reduced the dependence on a single generator’s output power. Therefore, when the No. 37 generator failed, the slack bus generator could compensate for its power particularly well, and the
was only 62.35%. The power flow calculation was still convergent and the power system was still running normally, which verifies the robustness of the proposed improved DQN method.