Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics

Liu, Jinbo; Guo, Qinglai; Zhang, Jing; Diao, Ruisheng; Xu, Guangjun

doi:10.3390/app15020900

Open AccessArticle

Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics

by

Jinbo Liu

¹,

Qinglai Guo

¹,

Jing Zhang

²,

Ruisheng Diao

^3,* and

Guangjun Xu

³

¹

Department of Electrical Engineering, Tsinghua University, Beijing 100084, China

²

SGCC Zhejiang Electric Power Company, Hangzhou 310007, China

³

The Zhejiang University-University of Illinois Urbana-Champaign Institute, Zhejiang University, Haining 314400, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(2), 900; https://doi.org/10.3390/app15020900

Submission received: 3 November 2024 / Revised: 21 December 2024 / Accepted: 3 January 2025 / Published: 17 January 2025

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The ever-growing penetration of renewable energy with substantial uncertainties and stochastic characteristics significantly affects the modern power grid’s secure and economical operation. Nevertheless, coordinating various types of resources to derive effective online control decisions for a large-scale power network remains a big challenge. To tackle the limitations of existing control approaches that require full-system models with accurate parameters and conduct real-time extensive sensitivity-based analyses in handling the growing uncertainties, this paper presents a novel data-driven control framework using reinforcement learning (RL) algorithms to train robust RL agents from high-fidelity grid simulations for providing immediate and effective controls in a real-time environment. A two-stage method, consisting of offline training and periodic updates, is proposed to train agents to enable robust controls of voltage profiles, transmission losses, and line flows using a state-of-the-art RL algorithm, soft actor–critic (SAC). The effectiveness of the proposed RL-based control framework is validated via comprehensive case studies conducted on the East China power system with actual operation scenarios.

Keywords:

artificial intelligence; line flow control; reinforcement learning; soft actor–critic; voltage control

1. Introduction

To reach the goals of carbon peaking and carbon neutrality, an ever-growing penetration level of renewable energy, including wind and solar, is being integrated with modern power systems. Due to the uncertainty and stochastic nature of renewable energy, power electronic devices, natural disasters, and malfunctions in control devices, significant challenges are being observed in operating the power grid to meet various security constraints at all times [1]. In case of severe disturbances, rapidly restoring the fluctuating voltage and frequency profiles and the overloaded line flows back to normal is of great importance to ensure the secure and economical operation of power systems.

Traditionally, voltage control is performed locally to maintain bus voltage profiles within secure ranges via adjusting generator terminal voltage settings and switching shunt VAr devices. Line flow control strategies are designed similarly to mitigate overloading issues before and after system disturbances. Their settings and coordination strategies are typically determined via large-scale offline simulation studies to prepare for the projected worst scenarios before being used in a real-time environment. Full and accurate system models need to be built and large-scale sensitivity analyses conducted to screen for effective controllers in dealing with voltage violation and line flow violations. Then, optimization approaches such as the inner point algorithm are adopted to adjust the selected controllers to provide voltage and line flow controls. Such methods work well for traditional power systems. However, given the trend of the grid’s ever-increasing complexity and stochastic characteristics, those offline-determined operational rules and study assumptions can be violated, thus limiting the effectiveness of such offline-determined control decisions. Optimizing control strategies in real time remains a difficult task that requires sufficiently accurate full-system models and fast simulation techniques to quickly adapt to changing operating conditions, causing the majority of today’s controller settings to be adjusted manually. Therefore, deriving prompt and effective coordinated control strategies of active and reactive power for real-time applications becomes critical to optimize control performance [2].

Hierarchical automatic control systems, including automatic voltage control (AVC) and automatic generation control (AGC) with multiple-level coordination, have been developed and are widely deployed in today’s power systems. An AVC system typically consists of three levels of control (primary, secondary, and tertiary), e.g., in France [3], Italy [4,5], and China [6,7,8,9,10]. At the primary level, automatic voltage regulators maintain local voltage levels using excitation systems with a response time of several seconds. At the secondary level, the bus voltages of selected pilot buses in different control zones are regulated using available reactive power resources, with a response time of several minutes. At the tertiary level, the setpoints of zonal pilot buses are optimized to minimize inter-zonal transmission losses while respecting security constraints, with a response time of 15 min to several hours. An AGC system is designed to provide line flow control and secondary frequency regulation by adjusting the active power outputs of AGC generators. Due to the abovementioned issues, the control performance of today’s AVC and AGC systems can be affected for the following reasons: (1) They require accurate real-time full-system models to optimize and achieve the desired control performance, which depend upon high-quality EMS snapshots running every few minutes. (2) Sensitivity-based methods for selecting the most effective control devices can be affected significantly by different operating conditions, significant topology changes, and severe contingencies. (3) Real-time coordination and optimization of all controllers in a high-dimensional space for a large-scale power network remains a challenging task. And (4) such approaches are mostly designed for single-system snapshots, making it challenging to coordinate control actions across multiple time steps while considering practical constraints, i.e., capacitors should not be switched on and off too often during one operating day.

To address the above issues, this paper presents key perspectives for an innovative control framework of training reinforcement learning-based agents to provide real-time data-driven control strategies. Quasi-steady-state voltage control, line flow control, and transmission loss control problems can be formulated as Markov decision processes (MDPs) so that they can take full advantage of flexible reinforcement learning (RL) algorithms that are proven to be effective in various real-world control problems in highly dynamic and stochastic environments.

Recent research efforts applied RL-based methods to solve different power grid control problems, including damping control under uncertainties [11], cyber–physical security assessment [12], load–frequency control [13], short-term load forecasting [14], dynamic economic dispatch [15], emergency control for transient behavior enhancement [16], real-time intelligent topology control considering practical constraints [17], voltage security control [18,19,20,21,22,23], frequency regulation [24], real-time optimal power flow control [25], multi-energy management in microgrids [26], optimal trading strategy [27], voltage management in distribution networks [28], MPPT control for PV systems [29], reactive power optimization of distribution networks [30], and cooperative control for hybrid electric vehicles [31]. Preliminary results on small-to-medium test systems demonstrate the feasibility of using RL agents to provide effective controls.

Inspired by these efforts, an RL-based multi-objective framework for deriving real-time voltage control and line flow control strategies is proposed in this paper, which adopts and extends the state-of-the-art reinforcement learning algorithm to achieve robust control performance considering practical constraints. The decision-making is purely data-driven, without the need for accurate real-time system models once AI agents are well trained. Thus, live supervisory control and data acquisition (SCADA) or phasor measurement unit (PMU) data streamed from a wide-area measurement system (WAMS) can be used to enable sub-second controls, which is valuable for scenarios with fast changes like renewable resource variations and significant disturbances.

The remainder of this paper is organized as follows. Section 2 introduces the principles of the Markov decision process and reinforcement learning. Section 3 provides the problem formulation and explains the architecture of the proposed RL-based control framework. Section 4 details the implementation of two types of control, namely voltage control and line flow control. The case studies and a discussion are provided in Section 5. Finally, key findings and conclusions are drawn in Section 5 with directions for future work identified.

2. Principles of MDP and Reinforcement Learning

2.1. Markov Decision Process

A Markov decision process (MDP) represents a discrete time-sequence control process that provides a general framework for the decision-making procedure of a stochastic, time-varying, and dynamic control problem. The formulation of the MDP requires the control problem to have a “Markov property”, indicating the next state of the system is only dependent upon the current state. For the problem of coordinated voltage control, a 4-tuple can formulate the MDP (S, A, P_a, R_a). S stands for a vector of system states, including voltage magnitudes, phase angles, and line flow across the system or areas of interest; A is a list of actions to be taken, e.g., generator terminal bus voltage setpoints, generator active power outputs, and the status of shunts and tap ratios of regulating transformers; P_a(s, s′) = Pr(s_t₊₁ = s′|s_t = s, a_t = a) stands for the transition probability from the current state s_t at time t to a new state at the next time stamp t + 1, s_t₊₁, after an action is taken at time = t, a_t; R_a(s, s′) is the reward value calculated after reaching state s′ to quantify the overall control performance. The MDP is solved to determine an optimal policy, π(s), which can specify actions based on states so that the expected accumulated rewards, typically modeled as a Q-value function,

Q^{π} (s, a)

, can be maximized in the long run, as follows:

Q^{π} (s, a) = E (r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots | s, a)

(1)

Then, the maximum achievable value is given as

Q^{*} (s, a) = \max_{π} Q^{π} (s, a) = Q^{π^{*}} (s, a)

(2)

Once Q* is known, the agent can act optimally as

π^{*} (s) = \arg \max_{a} Q^{*} (s, a)

(3)

The MDP problem can be solved using effective RL algorithms.

2.2. Principles of RL

RL provides a promising approach to solving the MDP problem and addresses the real-time decision-making/control problem in a complex, stochastic, and highly dynamic system environment. A general interaction process between the RL agent and the environment is presented in Figure 1. After receiving the current states from the environment, an RL agent generates a corresponding action using its policy; then, the environment provides the state at the next time step (s′) and calculates the corresponding reward (r′) after the control action is executed in the environment. Through such interactions, the RL agent optimizes its policy (typically modeled as neural networks) to maximize the accumulated rewards. In this way, the RL agent will gradually master the control problem after a certain training period. Successful applications of RL include DeepMind’s AlphaGo, AlphaStar, ATARI games, self-driving cars, and many others [22].

A soft actor–critic (SAC) RL algorithm that also uses the value function and Q function is chosen in this work. Traditional RL algorithms are designed to maximize the expected reward

\sum_{t} E_{(s_{t}; a_{t}) ~ ρ_{π}} [R (s_{t}, a_{t})]

only; however, the SAC algorithm trains stochastic policies using neural networks with entropy regularization. In other words, SAC maximizes both the expected reward and the control policy entropy value [23]. The SAC’s optimal policy, denoted as

π^{*}

, can be formulated in Equation (4):

π^{*} = a r g \max_{π} \sum_{t} E_{(s_{t}; a_{t}) ~ ρ_{π}} [R (s_{t}, a_{t}) + α H (π \cdot | s_{t})]

(4)

where

H (π \cdot | s_{t})

is the entropy value of the control policy at the state

s_{t}

, and

a

is defined as the temperature coefficient that balances exploration and exploitation during the training process in providing more effective control actions of the agent. The policy evaluation and updates of the SAC RL agent can be achieved by training NNs with stochastic gradient descent values. The soft value function networks are trained using Equation (5):

J_{V} (ψ) = E_{s_{t} ~ D} {[{(V}_{ψ} (s_{t}) - V_{s o f t} (s_{t})]}^{2}

(5)

where

V_{s o f t} (s_{t}) = E_{α_{t} ~ π} [Q_{s o f t} (s_{t}, a_{t}) - α l o g π (a_{t} | s_{t})]

(6)

The soft Q function can be trained by minimizing Equation (7):

J_{θ} (Q) = E_{(s_{t}, a_{t}) ~ D} {[Q_{θ} (s_{t}, a_{t}) - \hat{Q} (s_{t}, a_{t})]}^{2}

(7)

where

\hat{Q} (s_{t}, a_{t}) = R (s_{t}, a_{t}) + γ E_{s_{t + 1} ~ p} [V_{\hat{ψ}} (s_{t + 1})]

(8)

V_{\hat{ψ}} (s_{t + 1})

represents the target value network of SAC. The policy parameters are then updated by minimizing the expected Kullback–Leibler (KL) divergence, given in Equation (9):

J_{π} (ϕ) = D_{K L} (π (\cdot | s_{t}) | | (\exp (\frac{1}{α} Q_{θ} (s_{t}, \cdot)) - l o g Z (s_{t})) = E_{s_{t} ~ D} [E_{a_{t} ~ π_{ϕ}} [α \log (π_{ϕ} (a_{t}| s_{t})) - Q_{θ} (a_{t}, s_{t})]]

(9)

Compared with other RL algorithms, SAC performs better regarding sampling efficiency and stability. In the objective function of the control strategy, the temperature coefficient α determines the weight of the entropy and the reward value, thereby controlling the random sampling degree of the optimal strategy. It is worth noting that if a fixed temperature coefficient α is used, the SAC agent becomes unstable as the number of training samples increases. In order to solve this problem and improve the convergence speed of agent training, this paper adopts the method of automatically updating the temperature coefficient. The temperature coefficient can be changed automatically as the control strategy is updated to explore more feasible solutions. The implementation method includes the average entropy constraint to the original objective function and allows the entropy value to change in different states during training. The algorithm for training SAC RL agents for power flow control is given in Algorithm 1 [23]. More details of the SAC algorithm can be found in [23] and are not repeated here due to space limitations.

Algorithm 1. Algorithm for training the soft actor–critic (SAC) agent for power flow control.

1. Initialize the weights of neural networks, θ and ϕ, for the policy

π (s, a)

and value function

V (s)

, respectively; initialize weights

ψ

and

\bar{ψ}

for the two

Q (s, a)

functions; initialize replay buffer

D

; set up training environment, env

2. for: k = 1, 2, … (k is the counter of episodes for training)

3. for: t = 1, 2, … (t stands for control iteration)

4. reset environment

s ⟵

env. reset()
5. obtain states and actions

a ~ π (\cdot | s_{t})

6. apply action a and obtain the next states

s_{t + 1}

, reward value

r

and termination signal done

7. store tuple <

s_{t}, a_{t}, r_{t}, s_{t + 1}, d o n e >

in

D

8.

s_{t} = s_{t + 1}

9. if satisfying policy updating conditions, conduct

10. for a required number of policy updates, conduct

11. randomly sample from

D, < s_{t}, a_{t}, r, s_{t + 1}, d o n e >

12. update Q function,

Q (s, a)

:

θ_{i} \leftarrow θ_{i} - λ_{Q} \nabla J_{Q} (θ_{i})

13. update value function

V (s) : ψ \leftarrow ψ - λ_{V} \nabla J_{V} (ψ)

14. update policy network

π (s, a) : ϕ \leftarrow ϕ - λ_{π} \nabla J_{π} (ϕ)

15. update target network

\bar{ψ} \leftarrow τ ψ + (1 - τ) \bar{ψ}

16. update temperature coefficient, α

3. Proposed RL-Based Real-Time Control Framework

3.1. Control Objectives and Constraints

The proposed RL-based real-time control framework is general and can be adapted to various control objectives while respecting different security constraints. Table 1 summarizes the typical objectives and constraints for voltage/var control and line flow control problems. It is worth mentioning that the formulation of such control problems into MDP is flexible and includes combinations of objectives and constraints.

3.2. Overall Flowchart of Training RL Agents for Power System Operation

The main flowchart for training RL agents for real-time volt/var control and line flow control is shown in Figure 2 and consists of several key major steps:

Step A: Large-scale historical operating data are collected from the SCADA system in a power grid. Today’s SCADA system can typically capture full-topology information from wide-area deployed measurement sensors before sending it to the EMS to provide the best guess of the current system operating conditions in a quasi-steady state. Next, the converged state estimator snapshots are saved in full-topology model format. The sampling rate is set to 5 min in this work to ensure comprehensive operational scenarios of bulk power systems are collected.

Step B: Control objectives and the corresponding settings are specified for training SAC agents. The control objectives can be volt/var control only, transmission loss reduction, or a combination of these two objectives. The thresholds of each control objective and every constraint also need to be specified by the power engineers or system operators for real-time application, including voltage security limit, line flow limit, etc.

Step C: The volt/var control RL agent is trained using historical EMS snapshots and the high-fidelity simulation software. The RL agent weights are initialized and then the offline training process is entered. For each historical snapshot, i, the EMS case is loaded and AC power flow analysis is conducted. If divergence occurs, this episode is abandoned. Voltage magnitude violations and transmission losses are then calculated as reward values of the agents to reflect the health condition of the current EMS snapshot. A control loop is started by forming state space from the power flow information, obtaining the best action from the RL agent, verifying control performance using an AC power flow solver, and computing the reward value. This control process iterates until a “Done” condition is met. This happens when the following occur:

(1): The RL agent reaches the maximum control iteration;
(2): Power flow diverges;
(3): The RL agent’s action successfully meets the desired control performance goal.

After training the current episode, the RL agent’s weights in the neural networks are updated. This process continues until it exhausts all the available EMS episodes. The RL agent keeps optimizing and updating its policy parameters based on the accumulated knowledge during the large-scale interaction process with the environment.

Step D: After several offline trained RL agents for voltage control are obtained with hyperparameter tuning, a periodical online update procedure is launched and applied to the best performing RL agent by taking live EMS snapshots in the training process to update the weights of the RL agent used for real-time application. The actions of the RL agent are verified using real-time AC power flow solver with the most current system operating condition before actual implementation of the suggested controls.

Step E: A similar procedure is adopted for training RL agents for line flow control, shown as the right branch of Figure 2. The key difference rests with the control objectives and constraints for obtaining effective and reliable line flow control actions. The agents’ offline training and periodical online updates are needed to ensure long-term control performance. For the current operating condition, the environment (i.e., high-fidelity power flow solver) solves the power flow and checks for the control performance of the RL agent at each control iteration. Reward values are calculated correspondingly to update the RL agent.

3.3. Design of Episode, Reward, State Space, and Action Space

Without loss of generality, this paper trains effective RL agents to provide prompt corrective measures once voltage or line flow violations are detected. It is worth mentioning that users can adjust voltage and line flow limits. Constraints include full AC power flow equations, generation limits, voltage limits, and line flow limits.

3.3.1. Episode

An episode can start from any quasi-steady-state system operating condition that EMS snapshots can capture. For training corrective control RL agents, no actions need to be taken without any security violations if fixing security violations is the only control objective. However, due to variations in system loads, renewable generation, and unexpected contingencies, once voltage or line flow violations occur, the RL agent starts to take actions selected from a feasible action space to fix the security issues. For each iteration of applied control actions, the control performance is evaluated using reward values. To train effective agents, extensive representative operating conditions need to be collected or created, including random load changes, variations in renewable generation, generation dispatch schedules, and significant topology changes due to maintenance and contingencies. In this work, the actual system operating conditions captured by the EMS system every few minutes are taken as episodes for training RL agents.

3.3.2. Reward

The reward function for each control iteration can be calculated by penalizing the total degree of voltage or line flow violations as well as the total amount of control actions. In this work, typical security ranges are defined as follows:

V_{i}^{m a x} = 1.07 p . u . (for 200 kV and above buses)

V_{i}^{m i n} = 0.97 p . u . (for 200 kV and above buses)

S_{i j}^{m a x} = 100 % o f O p e r a t i n g L i m i t o f B r a n c h e s

A higher reward indicates more effective control strategies. A DRL agent is motivated to regulate system voltages and line flows within the desired typical zones and reduce transmission losses by maximizing the total reward for the episode.

3.3.3. State Space

For voltage and line flow controls, states are defined as vectors of voltage magnitudes, phase angles, and active and reactive power flows on branches and the status of controllers that EMS systems can directly capture. To improve the convergence speed and performance of control when training DRL agents, the batch normalization technique is applied.

3.3.4. Action Space

Effective control actions for regulating voltages and line flows can be changing voltage setpoints of generator terminal buses, adjusting transformer tap ratios, switching shunt capacitors/reactors, and adjusting active power outputs. Table 2 summarizes the state space, action space, and reward function for different types of control objectives.

3.4. Implementation of the Proposed RL-Based Control Framework

The proposed RL-based control framework for multi-objective real-time controls is implemented in Python 3.7. The Tensorflow V1.14 framework trains various RL agents using the soft actor–critic algorithm for volt/var, line flow, and transmission loss controls. The developed RL-based control prototype is deployed on a Linux computing server with 64 cores and 128 GB memory in the control center of a provincial power system that interacts with their live EMS system, capturing system snapshots every 5 min. The specific voltage and line flow control models with or without transmission loss are described in Table 1 and Table 2, where AC power flow equations and limits of physical devices are considered.

4. Case Studies

The proposed RL-based corrective voltage and line flow control framework is tested on the East China power system model with actual operating data. Large-scale operating scenarios are collected from the live EMS system with full-topology information for training RL agents, which includes renewable generation variation, different load patterns, generation dispatch schedules, and maintenance. All the operating scenarios are saved in the node/breaker format in QS files. The commercial EMS software package D5000, developed by the NARI Group, is used as an environment to train RL agents and verify control performance. Several case studies are conducted to verify the effectiveness of the proposed RL-based control framework for various control objectives. To obtain the best achievable SAC agents’ models using the allocated computing resources in the provincial control center, the RL training process considered multiple sets of hyperparameters of the proposed SAC agents using random searching algorithms before selecting the best for real-time application in the control center.

4.1. Corrective Voltage Control

4.1.1. Case Study 1

A total of around 9000 valid EMS snapshots representing actual system operating conditions in East China were collected at 5 min intervals for training and testing RL agents to provide voltage control actions, among which around 6000 snapshots were used for RL agent training and the remainder for testing.

The first case study was designed to regulate voltage profiles of a provincial transmission power grid (220 kV+) only. The secure voltage level is defined as [0.97, 1.07] p.u. The control performance of the RL agent is shown in Figure 3. As can be observed from the reward accumulation, during the training phase, the RL agent was able to learn voltage control strategies effectively. The trained RL agent performed well during the testing phase with a maximum of two iterations to fix all voltage violations, indicating all voltage magnitudes of buses above 200 kV return to the secure range. This case study verifies the feasibility of the proposed control framework in regulating voltage in an actual power system considering various operating conditions.

4.1.2. Case Study 2

In the second case study, the control objectives were extended to regulate both the voltage profile and transmission losses so that the RL agent could mitigate voltage violations and reduce transmission losses simultaneously. The RL agent training and testing performance is shown in Figure 4. Compared to Case Study 1, the second RL agent takes longer to control both voltage profiles and transmission losses. In the testing phase, all voltage violations are fixed successfully. More importantly, the average transmission loss reduction is around 1%, demonstrating the proposed method’s effectiveness when simultaneously considering multiple objectives. The loss values of Pi, V, Q1, and Q2 within the SAC agent are plotted in Figure 5 during the training phase, where the y-axis represents loss values and x-axis represents the training episodes.

4.2. Corrective Line Flow Control

For transmission line flow control, we focus on controlling key transmission corridors with a high risk of overloading issues in East China. Based on historical observation, one provincial power grid in East China, which contains five key transmission corridors, was selected for verifying the effectiveness of the proposed RL-based control framework because of overloading concerns during the summer peak period over the past few years. To test the performance of the RL-based control in simultaneously regulating multiple line flows, a total of 51,891 valid EMS snapshots were collected that represented actual system operating conditions from January 2020 to July 2020 to form a dataset. The control objective is set to ensure the active power flows of the five selected transmission corridors are below 90% of their corresponding operating limits, also known as the pink line used in the control center. Moreover, an additional term representing the total amount of control actions is used to penalize more actions in the reward function to minimize the amount of controls for mitigating line overloading issues.

In this case study, the state space consists of line flows and active power outputs of selected generators that can be dispatched in real time for emergency control. The action space consists of the active power outputs of the selected 150 generators. Among all the episodes collected in the above dataset, 67% of the cases are used for training the RL agent, while the remaining 33% are used for testing purposes. The reward accumulation vs. episode is plotted in Figure 6 (upper). It can be observed that the RL agent learns to control line flows within 90% of their limits using a maximum number of three iterations. During the control process, the maximum amount of generation adjustment is limited to 5%. This RL agent can control all five transmission line flows within the specified operating limits.

A more comprehensive case study is conducted to regulate line flows of all 200 kV and above transmission lines in the same provincial power grid. The control performance of the more general RL agent is shown in Figure 6 and Figure 7. Similarly, the RL agent can learn effective control policies to fix all line flow violations successfully. It is found that the maximum number of iterations used by the selected generators is 4. The selected generator’s maximum active power output adjustment is limited to 8%. Again, this test verifies the effectiveness of the proposed RL-based control framework in resolving line overloading concerns in real time.

4.3. Discussion

It is worth mentioning that the case studies were conducted in a real-world operational environment: the provincial control center. Because SAC belongs to the category of off-policy reinforcement learning, it has better sample utilization efficiency, requiring less time to train models for real-time use compared to other well-known competitors like proximal policy optimization (PPO). Thus, only the SAC algorithm is adopted in the real-world environment due to the limited computing resources and time allowance for fine-training the RL models.

Before deploying the proposed SAC-based control system in the control center, the control center has automatic voltage control (AVC) system running 24*7 to regulate system voltage profiles and transmission losses. However, due to the complexity of obtaining precise full-system information and models, limitations are observed in deriving optimal control actions to mitigate operational risks. So, the demonstrated case studies did reflect scenarios where RL actions are added on top of the traditional AVC logic, achieving superior performance over existing approaches. To further verify the effectiveness of SAC agents, sensitivity analysis was conducted by training and testing the SAC agents against various conditions. Below are two examples of SAC agents completing control tasks with a 0.8% and 1.0% relaxation of voltage limits, both demonstrating similar and satisfactory control performance in regulating voltage profiles.

The above case studies have demonstrated the feasibility and effectiveness of the proposed SAC-based method in a real operational environment: a provincial control center. In the design and testing phase, several key issues need to be addressed to achieve resilience and robustness in the control strategies, including the following:

(1): Given the importance of valid samples when training good RL agents, it is essential to periodically include new samples in the SAC agent training, from either a real-time EMS system or planning cases, considering various types of disturbances that can capture major changes in the power system. From the authors’ experience, it is a good practice to update the model training daily.
(2): Once the SAC agent is trained with satisfactory performance, adoption of the agent for use in real time is rapid, typically within dozens of ms. However, it is very important to ensure the quality of the input samples when constructing the state space of the SAC agent before obtaining control strategies.
(3): The proposed method in this paper mainly tackles the regulation of system voltage violation, line overloading, and system losses. When extending this approach towards more complicated control tasks in daily operation with different objectives (sometimes with conflict), using multiple RL agents can be a good research direction, considering the tradeoff among different control objectives.

5. Conclusions and Future Work

To tackle the growing uncertainties of today’s power systems with increased penetration of renewable generation and new-type load behavior, this paper presents an innovative control framework using SAC RL agents to provide fast voltage control, line flow control, and transmission loss reduction. Compared to the traditional OPF-based approaches that require full-system information and accurate parameters to conduct mass sensitivity analysis for effective controls, the developed RL-based control framework demonstrates the merits of fast learning from past experiences and the capability to adapt to unforeseen conditions in real time. From comprehensive case studies conducted in the control center, both voltage control and line flow control agents provide effective actions for mitigating security issues, verified in the live operational environment of the East China power system. It is worth mentioning that the SAC agent-based control actions were tested on top of the closed-loop actions from the existing automatic voltage control (AVC) and automatic generation control (AGC) systems, further demonstrating the advantages of the proposed method in dealing with the complex operational scenarios of bulk power systems. The developed method and system offer an effective solution by providing prompt controls to deal with voltage violation and line flow violation issues in a regional power system, caused by the growing uncertainties from generation and load, especially when traditional control systems fail to provide effective controls or they are under maintenance. When designing RL-based agents for specific control tasks in power system operation, several important factors need to be considered:

(1): It is important to have high-fidelity power system simulators to accurately capture the system behavior before and after each control action, providing a reliable environment for training the agents.
(2): It is also important to include large-scale representative operating conditions in the power system in the form of samples that cover the feature space more evenly so that RL agents can learn directly from interacting with these samples, especially those with operational risks.
(3): Design of reward functions plays an important role in the effectiveness and efficiency of RL agent training in performing specific control tasks. Hyperparameter tuning is always a good practice to ensure better control performance of the agents.

The proposed framework will be further enhanced and tested in future work by combining both voltage and line flow control objectives. The goal is to provide autonomous, fast, and effective solutions in real time for mitigating various operating risks simultaneously, including line flow control, transmission loss reduction, and others. A multi-agent approach will also be investigated for coordinated control and decision-making among multiple agents, each focusing on its regions, zones, or objectives to achieve improved overall control performance.

Author Contributions

Conceptualization, J.L. and J.Z.; methodology, R.D. and Q.G.; software, J.L. and R.D.; validation, Q.G., J.Z. and G.X.; formal analysis, J.Z.; investigation, J.L.; resources, J.L. and J.Z.; data curation, R.D.; writing—original draft preparation, J.L. and R.D.; writing—review and editing, J.Z. and Q.G.; visualization, G.X.; supervision, R.D.; project administration, J.Z.; funding acquisition, R.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (U22B6007) and the Fundamental Research Funds for the Central Universities (226-2024-00244).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to that operational data are not allowed to be shared in the public domain.

Conflicts of Interest

Author Jing Zhang was employed by the company SGCC Zhejiang Electric Power Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Standard TPL-001-4; Transmission System Planning Performance Requirements. NERC: Abuja, Nigeria, 2014.
Kamwa, I.; Grondin, R.; Hebert, Y. Wide-area measurement based stabilizing control of large power systems-a decentralized/hierarchical approach. IEEE Trans. Power Syst. 2001, 16, 136–153. [Google Scholar] [CrossRef]
Paul, J.P.; Leost, J.Y.; Tesseron, J.M. Survey of the secondary voltage control in France: Present realization and investigations. IEEE Trans. Power Syst. 1987, 2, 505–511. [Google Scholar] [CrossRef]
Corsi, S.; Pozzi, M.; Sabelli, C.; Serrani, A. The coordinated automatic voltage control of the Italian transmission grid—Part I: Reasons of the choice and overview of the consolidated hierarchical system. IEEE Trans. Power Syst. 2004, 19, 1723–1732. [Google Scholar] [CrossRef]
Corsi, S.; Pozzi, M.; Sforna, M.; Dell’Olio, G. The coordinated automatic voltage control of the Italian transmission grid—Part II: Control apparatuses and field performance of the consolidated hierarchical system. IEEE Trans. Power Syst. 2004, 19, 1733–1741. [Google Scholar] [CrossRef]
Sun, H.; Guo, Q.; Zhang, B.; Wu, W.; Wang, B. An adaptive zone-division-based automatic voltage control system with applications in China. IEEE Trans. Power Syst. 2013, 28, 1816–1828. [Google Scholar] [CrossRef]
Sun, H.; Zhang, B. A systematic analytical method for quasi-steady-state sensitivity. Electr. Power Syst. Res. 2002, 63, 141–147. [Google Scholar] [CrossRef]
Sun, H.; Guo, Q.; Zhang, B.; Wu, W.; Tong, J. Development and applications of the system-wide automatic voltage control system in China. In Proceedings of the IEEE PES General Meeting, Calgary, AB, Canada, 26–30 July 2009. [Google Scholar]
Guo, R.; Chiang, H.; Wu, H.; Li, K.; Deng, Y. A two-level system-wide automatic voltage control system. In Proceedings of the IEEE PES General Meeting, San Diego, CA, USA, 22–26 July 2012. [Google Scholar]
Shi, B.; Wu, C.; Sun, W.; Bao, W.; Guo, R. A practical two-level automatic voltage control system: Design and field experience. In Proceedings of the International Conference on Power System Technologies, Guangzhou, China, 6–8 November 2018. [Google Scholar]
Duan, J.; Xu, H.; Liu, W. Q-learning-based damping control of wide-area power systems under cyber uncertainties. IEEE Trans. Smart Grid 2018, 9, 6408–6418. [Google Scholar] [CrossRef]
Liu, X.; Konstantinou, C. Reinforcement learning for cyber-physical security assessment of power systems. In Proceedings of the 2019 IEEE Milan PowerTech Conference, Milan, Italy, 23–27 June 2019. [Google Scholar]
Yan, Z.; Xu, Y. Data-driven load frequency control for stochastic power systems: A deep reinforcement learning method with continuous action search. IEEE Trans. Power Syst. 2019, 34, 1653–1656. [Google Scholar] [CrossRef]
Feng, C.; Zhang, J. Reinforcement learning based dynamic model selection for short-term load forecasting. In Proceedings of the 2019 IEEE PES ISGT Conference, Washington, DC, USA, 18–21 February 2019. [Google Scholar]
Dai, P.; Yu, W.; Wen, G.; Baldi, S. Distributed reinforcement learning algorithm for dynamic economic dispatch with unknown generation cost functions. IEEE Trans. Ind. Inform. 2020, 16, 2256–2267. [Google Scholar] [CrossRef]
Huang, Q.; Huang, R.; Hao, W.; Tan, J.; Fan, R.; Huang, Z. Adaptive power system emergency control using deep reinforcement learning. IEEE Trans. Smart Grid 2020, 11, 1171–1182. [Google Scholar] [CrossRef]
Lan, T.; Duan, J.; Zhang, B.; Shi, D.; Wang, Z.; Diao, R.; Zhang, X. AI-based autonomous line flow control via topology adjustment for maximizing time-series ATCs. In Proceedings of the IEEE PES General Meeting, Montreal, QC, Canada, 2–6 August 2020. [Google Scholar]
Diao, R.; Wang, Z.; Shi, D.; Chang, Q.; Duan, J.; Zhang, X. Autonomous voltage control for grid operation using deep reinforcement learning. In Proceedings of the IEEE PES General Meeting, Atlanta, GA, USA, 4–8 August 2019. [Google Scholar]
Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Trans. Power Syst. 2020, 35, 814–817. [Google Scholar] [CrossRef]
Zimmerman, R.D.; Sanchez, C.E.; Thomas, R.J. MATPOWER: Steady-state operations, planning, and analysis tools for power systems research and education. IEEE Trans. Power Syst. 2010, 26, 12–19. [Google Scholar] [CrossRef]
Xu, T.; Birchfield, A.B.; Overbye, T.J. Modeling, tuning and validating system dynamics in synthetic electric grids. IEEE Trans. Power Syst. 2018, 33, 6501–6509. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbee, P.; Levine, S. Soft actor-critic: Off policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Lee, W.; Kim, H. Deep Reinforcement Learning-Based Dynamic Droop Control Strategy for Real-Time Optimal Operation and Frequency Regulation. IEEE Trans. Sustain. Energy 2025, 16, 284–294. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, M.; Gao, S.; Wu, Z.G.; Guan, X. Physics-Informed Reinforcement Learning for Real-Time Optimal Power Flow with Renewable Energy Resources. IEEE Trans. Sustain. Energy 2025, 16, 216–226. [Google Scholar] [CrossRef]
Hu, B.; Gong, Y.; Liang, X. Safe Deep Reinforcement Learning-Based Real-Time Multi-Energy Management in Combined Heat and Power Microgrids. IEEE Access 2024, 12, 193581–193593. [Google Scholar] [CrossRef]
Belyakov, B.; Sizykh, D. Adaptive Algorithm for Selecting the Optimal Trading Strategy Based on Reinforcement Learning for Managing a Hedge Fund. IEEE Access 2024, 12, 189047–189063. [Google Scholar] [CrossRef]
Hou, S.; Fu, A.; Duque, E.; Palensky, P.; Chen, Q.; Vergara, P.P. DistFlow Safe Reinforcement Learning Algorithm for Voltage Magnitude Regulation in Distribution Networks. J. Mod. Power Syst. Clean Energy 2024, 1–12. [Google Scholar]
Vora, K.; Liu, S.; Dhulipati, H. Deep Reinforcement Learning Based MPPT Control for Grid Connected PV System. In Proceedings of the 2024 IEEE 7th International Conference on Industrial Cyber-Physical Systems (ICPS), St. Louis, MO, USA,, 12–15 May 2024. [Google Scholar]
Liao, J.; Lin, J. A Distributed Deep Reinforcement Learning Approach for Reactive Power Optimization of Distribution Networks. IEEE Access 2024, 12, 113898–113909. [Google Scholar] [CrossRef]
Gan, J.; Li, S.; Lin, X.; Tang, X. Multi-Agent Deep Reinforcement Learning-Based Multi-Objective Cooperative Control Strategy for Hybrid Electric Vehicles. IEEE Trans. Veh. Technol. 2024, 73, 11123–11135. [Google Scholar] [CrossRef]

Figure 1. Interaction between RL agent and power system environment.

Figure 2. Flowchart of training RL agents for volt/var and line flow control.

Figure 3. Performance of RL agent for regulating voltage profiles.

Figure 4. Performance of RL agent for regulating voltage profiles and transmission losses.

Figure 5. Loss values of Pi, V, Q1, and Q2 of the SAC agent during the training phase.

Figure 6. RL performance(left: controlling 5 line flows; right: controlling all line flows).

Figure 7. SAC agents’ performance considering relaxation of the voltage limits (upper: 0.8% relaxation; lower: 1.0% relaxation).

Table 1. Control objectives and constraints of the proposed framework.

	Voltage Control	Line Flow Control
Control Objectives
Corrective Actions	minimum reactive power control actions: min $\sum_{k}^{K m a x} c (k), k \in ∁$	minimum active power control actions: min $\sum_{n}^{N m a x} p (i), i \in ᴦ$
Loss Minimization	as an objective function: min $\sum_{i, j} P_{l o s s} (i, j), (i, j) \in Ω_{L} \cup Ω_{T}$
Constraints Modeled
AC Power Flow Constraints	$P_{i}^{g} - P_{i}^{d} - g_{i} V_{i}^{2} = \sum_{j \in B_{i}} P_{i j} (y), i \in B$ $Q_{i}^{g} - Q_{i}^{d} - b_{i} V_{i}^{2} = \sum_{j \in B_{i}} Q_{i j} (y), i \in B$ where $y = {[θ V]}^{T}$ $P_{i}^{g} = \sum_{n \in G i} P_{n}^{g}, i \in B$ $Q_{i}^{g} = \sum_{n \in G i} Q_{n}^{g}, i \in B$ $P_{i}^{d} = \sum_{m \in D i} P_{m}^{d}, i \in B$ $Q_{i}^{d} = \sum_{m \in D i} Q_{m}^{d}, i \in B$ P_ij and Q_ij are active power and reactive power on branches, respectively
Generation Limits	$P_{n}^{m i n} \leq P_{n} \leq P_{n}^{m a x}, n \in G$ $Q_{n}^{m i n} \leq Q_{n} \leq Q_{n}^{m a x}, n \in G$
Voltage Limits	$V_{i}^{m i n} \leq V_{i} \leq V_{i}^{m a x}, i \in B$
Transmission Line Limits	$\sqrt{P_{i j}^{2} + Q_{i j}^{2}} \leq S_{i j}^{m a x}, (i, j) \in Ω_{L} \cup Ω_{T}$

Table 2. State space and action space for different control objectives.

Objective	State Space	Action Space	Reward
voltage security	bus voltage magnitudes, phase angles, active power on lines, reactive power on lines, controller status, controller settings	generator terminal setting, shunt elements, transformer tap changing, flexible alternating current transmission system (FACTS) devices	penalize voltage violations and/or total amount of control $r 1 = - \frac{d e v_o v e r f l o w}{10} - \frac{v i o_v o l t a g e}{1000}$ , where $d e v_o v e r f l o w = \sum_{i}^{N} {(S_{i j} - S_{i j}^{m a x})}^{2}$ $v i o_v o l t a g e = \sum_{j}^{M} (V_{m} (j) - V_{m i n}) * (V_{m} (j) - V_{m a x})$
voltage security + loss reduction	bus voltage magnitudes, phase angles, active power on lines, reactive power on lines, controller status, controller settings	generator terminal setting, shunt elements, transformer tap changing, FACTS devices	penalize voltage violations, transission losses and/or total amount of control if delta_p_loss < 0: $r 2 = 50 - d e l t a_p_l o s s * 1000$ else if delta_p_loss ≥ 0.02 $r 2 = - 100$ else: $r 2 = - 1 - (p_l o s s - p_l o s s_p r e) * 50$ where $d e l t a_p_l o s s = \frac{p_l o s s - p_l o s s_p r e}{p_l o s s_p r e}$ p_loss is the present transmission loss value and p_loss_pre is the line loss at the base case
line flow	bus voltage magnitudes, phase angles, active power on lines, reactive power on lines, controller status, controller settings	generator active power, controllable load	penalize line flow violations and/or the total amount of control $r 3 = r 1 - μ * \sum_{g = 1}^{G} ∆ P_{g e n} (g)$ where $μ$ is a parameter and $∆ P_{g e n}$ is the amount of control action of the generator
line flow + loss reduction	bus voltage magnitudes, phase angles, active power on lines, reactive power on lines, controller status, controller settings	generator active power, controllable load	penalize line flow violations, transmission losses, and/or the total amount of control $r 4 = r 2 - μ * \sum_{g = 1}^{G} ∆ P_{g e n} (g)$ where $μ$ is a parameter and $∆ P_{g e n}$ is the amount of control action of generator

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Guo, Q.; Zhang, J.; Diao, R.; Xu, G. Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics. Appl. Sci. 2025, 15, 900. https://doi.org/10.3390/app15020900

AMA Style

Liu J, Guo Q, Zhang J, Diao R, Xu G. Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics. Applied Sciences. 2025; 15(2):900. https://doi.org/10.3390/app15020900

Chicago/Turabian Style

Liu, Jinbo, Qinglai Guo, Jing Zhang, Ruisheng Diao, and Guangjun Xu. 2025. "Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics" Applied Sciences 15, no. 2: 900. https://doi.org/10.3390/app15020900

APA Style

Liu, J., Guo, Q., Zhang, J., Diao, R., & Xu, G. (2025). Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics. Applied Sciences, 15(2), 900. https://doi.org/10.3390/app15020900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Perspectives on Soft Actor–Critic (SAC)-Aided Operational Control Strategies for Modern Power Systems with Growing Stochastics and Dynamics

Abstract

1. Introduction

2. Principles of MDP and Reinforcement Learning

2.1. Markov Decision Process

2.2. Principles of RL

3. Proposed RL-Based Real-Time Control Framework

3.1. Control Objectives and Constraints

3.2. Overall Flowchart of Training RL Agents for Power System Operation

3.3. Design of Episode, Reward, State Space, and Action Space

3.3.1. Episode

3.3.2. Reward

3.3.3. State Space

3.3.4. Action Space

3.4. Implementation of the Proposed RL-Based Control Framework

4. Case Studies

4.1. Corrective Voltage Control

4.1.1. Case Study 1

4.1.2. Case Study 2

4.2. Corrective Line Flow Control

4.3. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI