A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids

Wang, Tianhao; Ma, Shiqian; Tang, Zhuo; Xiang, Tianchun; Mu, Chaoxu; Jin, Yao

doi:10.3390/en16155653

Open AccessArticle

A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids

by

Tianhao Wang

¹,

Shiqian Ma

¹,

Zhuo Tang

^2,*

,

Tianchun Xiang

³,

Chaoxu Mu

² and

Yao Jin

³

¹

Electric Power Research Institute, State Grid Tianjin Electric Power Company, No. 8, Haitai Huake 4th Road, Huayuan Industrial Zone, Binhai High Tech Zone, Tianjin 300384, China

²

School of Electrical and Information Engineering, Tianjin University, No. 92, Weijin Road, Nankai District, Tianjin 300072, China

³

State Grid Tianjin Electric Power Company, No. 39 Wujing, Guangfu Street, Hebei District, Tianjin 300010, China

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(15), 5653; https://doi.org/10.3390/en16155653

Submission received: 28 June 2023 / Revised: 14 July 2023 / Accepted: 16 July 2023 / Published: 27 July 2023

(This article belongs to the Section A1: Smart Grids and Microgrids)

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes a novel cooperative voltage control strategy for an isolated microgrid based on the multi-agent advantage actor-critic (MA2C) algorithm. The proposed method facilitates the collaborative operation of a distributed energy system (DES) by adopting an attention mechanism to adaptively boost information processing effectiveness through the assignment of importance scores. Additionally, the algorithm we propose, executed through a centralized training and decentralized execution framework, implements secondary control and effectively restores voltage deviation. The introduction of an attention mechanism alleviates the burden of information transmission. Finally, we illustrate the effectiveness of the proposed method through a DES consisting of six energy nodes.

Keywords:

multi-agent reinforcement learning; microgrid; voltage control; attention mechanism

1. Introduction

Recently the evolution of renewable energy sources has facilitated the transformation of the traditional power system. While distributed energy systems offer advantages such as low pollution and high energy efficiency, their large-scale integration into the traditional grid has posed new challenges for stable operation [1]. Wind and solar power exhibit fluctuations and intermittency, while the integration of power generation equipment with power electronics results in low immunity and weak support, adversely affecting the reliability of the power supply in the system. A microgrid is an effective method for utilizing distributed energy resources. It refers to a regional autonomous power system that comprises distributed energy sources (DES), energy storage devices, loads, and other components [2]. The control and coordination of microgrids are crucial for ensuring a high-quality power supply and enabling the integration of renewable energy sources. Microgrids are usually linked to the grid through public connection points and operate in grid connection mode. However, in cases where the main power grid experiences a failure, the microgrid will disconnect and operate autonomously, which is called islanding mode. In practice, the PMU (Phasor Measurement Unit) can collect real-time and accurate data from various distributed energy sources and other equipment in the microgrid, including voltage distribution, power flow, and other key parameters [3]. Deploying a synchronous micro-PMU on the feeder can achieve better voltage stability control because its precise synchronous measurement is conducive to the implementation of advanced control algorithms. In addition, real-time and accurate measurement of the voltage amplitude of the entire network can enhance control performance and improve the robustness and stability of the system in the face of interference [4]. Usually the microgrid adopts a hierarchical control architecture [5], enabling controllers at each level to work independently at different time scales [6]. The main function of the first-level control, also known as local control of the inverter, is to maintain system voltage and frequency stability while ensuring V-f stability and achieving automatic power sharing within a few seconds, which determines the stability of the microgrid [7,8]. The secondary control aims to formulate minute-scale control strategies based on the measured error values [9,10]. Finally, the optimization management layer, or the third-level control, is responsible for controlling the hourly economic operation and energy management hierarchy of the microgrid. Since the droop control commonly used in the first-level control is essentially proportional control, voltage values may deviate from the reference. Therefore, the implementation of secondary control becomes necessary for the purpose of maintaining the system voltage within a predetermined range.

Traditional centralized control relies on microgrid central control (MGCC) and requires a high communication bandwidth between agents. When the MGCC fails, the control process cannot continue, resulting in poor reliability. In addition, decentralized control is based on local information and local communication to obtain neighbor information and formulate collaborative control strategies. It enables a plug-and-play microgrid without the need for a complex communication network [11]. The control techniques for microgrids focused on distributed cooperative control are based on consensus theory. Bidram and Lewis [12] proposed using input–output feedback linearization to linearize the dynamic characteristics of inverters, further transforming the secondary control problem into a consensus tracking problem of a linear system, in which each agent interacts with its neighbors, formulating its own strategy based on the local state and observations of neighbors. In addition, model-based control methods such as nonlinear control [13], optimal control [14], and model predictive control (MPC) [15] have been extensively studied. These methods of voltage control are under the framework of a multi-agent system (MAS) [16], where the average measured voltage value is shared through a distributed communication protocol; the PI controller of each DG generates a control signal based on the error value to eliminate the relevant steady-state errors and reach consensus. However, an inverter-based large-signal dynamic model is nonlinear, and simplifying it may lead to the problem that the control strategy is not applicable after the disturbance produces a large deviation.

Data-driven algorithms have received widespread attention on account of their high scalability and flexibility such as reinforcement learning (RL) [17]. It goes through a process of trial and error and experiences accumulation [18], exploring optimal control strategies from a large number of samples, which are provided by simulated data or historical measurement data. Therefore, they have been widely applied in the power system [19,20]. The combination of deep neural networks with RL has enabled the approximation of the inputs and outputs of complex systems, which is known as deep reinforcement learning (DRL). It has demonstrated significant benefits in addressing large-scale challenges, and high-dimensional, and highly complex problems. Reference [21] proposes a self-controlled method based on soft actor-critic (SAC) to improve renewable energy consumption rates while performing real-time active power dispatch using observed states under stable operation. During system operation, RL can promote the integration of more renewable energy into the power system, ultimately achieving carbon neutrality goals.

Recently, multi-agent reinforcement learning (MARL) has been proposed for secondary control [22,23,24,25]. As agents interact with the environment and learn offline by simulating other agents’ strategies to cooperate, they can find the optimal strategy. After training, agents can adapt well to real-time decision making in unknown power grids or microgrids. Grid Mind [22] measures power flow data on smart grid transmission lines through DDPG and formulates a reward function to punish agents that deviate from voltage levels to learn optimal control strategies and change voltage set points. For the dynamic distribution network management system, Reference [23] uses the maximum entropy reinforcement learning framework for voltage control and improves resilience to controller failures by introducing a communication consensus strategy to reduce network losses. Reference [24] solves decentralized voltage control in an active distribution network through an attention-based MATD3 algorithm, improves communication efficiency, and can effectively handle drastic voltage fluctuations caused by rapid photovoltaic power generation while achieving fast decision making. Reference [25] uses the DDPG algorithm to jointly achieve reasonable distribution of reactive power and adjust bus voltage based on static var compensator (SVC), effectively reducing the number of transformer tap changes.

The deficiencies in the existing research make us consider the introduction of MARL to solve the problem of microgrid cooperative secondary voltage control, including policy scalability, cooperation and coordination, nonstationarity and uncertainty, and efficient communication problems [26,27]. The main achievements of this research are as follows. Firstly, we develop a multi-agent system (MAS) model of an inverter-based microgrid according to its dynamic characteristics, and the secondary voltage control problem is modeled as a networked partially observable Markov decision process (POMDP) observable by multiple agents. Then, we develop an efficient multi-agent actor-critic (MA2C) algorithm by introducing attention mechanisms to deliberately concentrate on particular information, guiding the update of action networks for better training results and improved communication efficiency. An action smoothing mechanism is employed to improve voltage stability. Finally, experiments are conducted on an IEEE 34-node system, demonstrating the proposed control strategy effectively restores terminal voltages of the microgrid to their reference values.

The rest of this paper is outlined as follows. In Section 2, we present the construction method of a multi-agent model for microgrids. In Section 3, we introduce an attention mechanism into an MA2C algorithm for voltage control, effectively improving the efficiency of communication. Moreover, we provide a specific solution process for an MA2C algorithm based on the attention mechanism. In Section 4, simulation based on the improved IEEE 34-node system is presented. Finally, we summarize the main research work of this article and introduce further research directions.

2. Problem Formulation

The DES of the microgrid might be considered an agent, and therefore, using MARL to maintain voltage stability can effectively improve the system’s dynamic performance. This section first introduces the basics of POMDP in MARL and then establishes a multi-agent reinforcement learning model for the secondary voltage control problem of an islanded microgrid.

2.1. Preliminaries for POMDP

The learning process of RL is essentially a trial-and-error process, where agents interact with the environment, exploring unknown states. By capturing immediate rewards from the environment, the agent adjusts its action policy, aiming to maximize the objective function and find the optimal strategy. This way of learning can be formalized as a Markov decision process [28]. A decentralized framework is used to build the architecture of the inverter-based microgrid. We establish a decentralized partially observable Markov decision process (Dec-POMDP) [29] for the agent because the agent can only obtain information by observing its own state and communicating with neighbors. In other words, each distributed generator (DG) realizes voltage stability under the condition of incomplete information.

The system model is typically characterized as a tuple

\{N, S, A, P, R, O, γ\}

. The topology can be abstracted into a multi-agent network

G = (V, ϵ)

, N agents can be regarded as a set V of vertices, and

ϵ

represents the edges. S represents the global state space,

A = [A_{1}, A_{2} \dots A_{N}]

is the joint action space, and the action value in this paper is the discrete secondary voltage set value, which constitutes the action space. The actual grid environment is abstracted into a transformation model, and the agent can realize the operation in the grid by interacting with the environment. It is described using a state transition function

P : S \times A \times S \to [0, 1]

, and

O = [O_{1}, O_{2} \dots O_{N}]

is the observation values of the agents. The global reward is

R = O \times A

. Moreover,

γ \in [0, 1]

is the discount factor used to calculate the cumulative reward to be maximized.

Figure 1 depicts the classic reinforcement learning training process. The agent obtains the observation value at each time step t and then calculates the action value as its strategy, which is then applied to the environment. Then the agent enters the next state according to the state transition probability and obtains immediate rewards. The agent needs to maximize the reward

R_{t}^{π} = \sum_{τ = t}^{\infty} γ^{τ - t} r_{τ}

, and under a given state and policy, the expected value of the return that an agent can obtain is

Q^{π} (s, a) = E [R_{t}^{π} | s_{t} = s, a_{t} = a]

. The Q function denotes the expected cumulative return that an agent can obtain after taking a certain action in a given state. In the context of a specified strategy, the optimal policy function

π^{*} (a, s) : a \in a r g m a x_{a^{'}} Q^{*} (s, a^{'})

could maximize the cumulative expected return. Each interaction is referred to as a step. An episode contains complete interaction steps.

Deep reinforcement learning (DRL) can learn control policies through interaction with the environment. In recent years, three general frameworks for the DRL algorithm have been used. The first is the value iteration method, such as Deep Q-Network (DQN) [30], which approximates the value function using deep neural networks (DNNs). The second is policy gradient algorithms, which directly learn parameterized policies that are suitable for continuous high-dimensional action spaces. Finally, the framework used in this paper is the advantage actor-critic (A2C) framework [31], which trains a value network with parameters

θ

to accurately estimate the advantage function

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)

to guide the training of a policy network with parameters

ω

. Training the value network is equivalent to policy evaluation, while training the policy network is equivalent to policy improvement.

The loss function for policy network update is

L (θ) = - \frac{1}{|B|} \sum_{t \in B} l o g π_{θ} (a_{t} | s_{t}) A_{t}

(1)

Likewise, the loss function that guides the update of the value network is

L (ω) = \frac{1}{2 |B|} \sum_{t \in B} {(R_{t} - V_{ω} (s_{t}))}^{2}

(2)

Each minibatch

B = (s_{t}, a_{t}, s_{t}^{'}, r_{t})

contains an experience trajectory, and each return is estimated as

\hat{R_{t}} = \sum_{τ = t}^{t_{B} - 1} γ^{τ - t} r_{τ}

, where

t_{B}

is the final step processed in the minibatch. Experience replay involves using a replay buffer B. At each time step, agents randomly select a minibatch from the experience playback buffer, which is used as the training data for this process. This trick allows the past experience to be reused and overcome problems with correlated data and nonstationary distribution in the experience data.

2.2. Microgrid Voltage Secondary Control

Figure 2 shows the large-signal model structure of the distributed generator. Distributed generators in microgrids typically require power electronic converters for grid integration, such as inverters that convert DC to AC based on the power invariant characteristic. In order to ensure voltage stability, voltage-controlled voltage source inverters (VCVSI) are commonly used, with a capacitor as an energy storage component on the DC side. Voltage source inverters usually adopt a double-loop control structure [32]. The outer loop is a functional loop, and the inner loop regulates the current, with a faster dynamic response than the outer loop.

According to the purpose of the outer loop control, common inverter control methods include PQ control, V-f control, and droop control. In this paper, we select the most widely used droop control [33], which does not require communication and automatically controls voltage and frequency changes by simulating the synchronous generator’s frequency characteristics. Its basic principle is that due to the low inertia of inverter-based DG, the operating curve satisfies the linearity property. Based on the inertia of synchronous generators, the voltage can be adjusted by changing the operating point on the droop curve.

In a microgrid, the transmission lines are typically relatively short, with low impedance and inductive circuits. The output of DES is primarily determined by the active and reactive power, and the voltage amplitude. Therefore, it is possible to achieve decoupled control of reactive power and voltage. Considering that the angle of attack and frequency are directly related, and frequency is the first derivative of the voltage phase difference over time, which is easy to detect, the angle frequency is used instead of the angle of attack, resulting in a more common droop control formula as follows.

ω_{i} = ω_{n i} - m_{P_{i}} P_{i}

(3)

v_{o, m a g i}^{*} = V_{n i} - n_{Q_{i}} Q_{i}

(4)

where

v_{o, m a g i}^{*}

is the inverter terminal voltage amplitude for the internal loop,

ω_{i}

denotes the angular frequency, and

ω_{n i}

,

V_{n i}

, respectively, refer to the primary control references.

m_{P_{i}}

,

n_{Q_{i}}

, respectively, signify the droop coefficients of active and reactive power. Subsequently, applying the Park transformation, the voltage and current in the original three-phase reference frame are converted into the

d - q

reference frame. In order to facilitate the calculation in the next step, the amplitude of voltage typically aligns with the d-axis of these common frames, resulting in

v_{o d i}^{*} = V_{n i} - n_{Q_{i}} Q_{i}

. The voltage compensation problem for a microgrid with N DGs can be transformed into a reinforcement learning problem for multi-agents.

It is important to acknowledge that droop control has the defect of error. Load changes can lead to a voltage deviation between them and the true value, and differences in inverters’ parameters and the connecting lines can cause unreasonable power distribution, even leading to issues such as circulating current and power reversal. In order to compensate this deviation and improve power quality, we study the microgrid distributed secondary control strategy. First, it is essential to model and analyze the microgrid. Figure 2 shows how we abstract the physical structure of the VCVSI-based DG through large-signal mathematical analysis [34], which consists of a DC power source, a three-phase inverter bridge, the power control loop, and the internal dual loop controllers, LC filter, and an output interface. Since the switching frequency of the inverter is quite fast, the dynamic characteristics of the three-phase inverter can be ignored, and we usually use an ideal DC voltage source in theoretical analysis. Reference [12] provides a detailed introduction on how to build impedance links in a large-signal model of DG, which is used to construct the improved IEEE 34 bus test system platform.

The secondary voltage control aims to synchronize

v_{o, m a g i}

to its reference

v_{r e f}

[35]. Based on the above analysis, it can be inferred that the synchronization of the voltage amplitude of VCVSI

v_{o, m a g i}

can be achieved by adjusting

v_{o d i}

. Consequently, the next problem we need to consider is how to select an appropriate control input

V_{n i}

to synchronize

v_{o d i} \to v_{r e f}

.

Model-based methods have many limitations in practical applications. Renewable energy sources have the characteristic of uncertainty, making it difficult to perform detailed modeling of each element in a microgrid. The equipment in the microgrid usually belongs to different equipment agents. Agents need to efficiently collect and process different sensor data, and deploying multiple synchronized sensors in the system that can monitor system data such as frequency or power flow is essential for a multi-agent environment [36].

In order to maintain the market environment, the information of private users cannot be shared, and only some public data can be obtained and used for regulation. With a large number of new energy sources running in the grid through adaptive changes in control methods and parameters, the dynamics of microgrids are becoming increasingly fast-paced.

To address this issue, The data-driven MARL method has received widespread attention in the field of the smart grid due to avoiding the problem of model inaccuracy. Microgrids containing DGs can be abstracted into a cyber–physical system, which includes a topology demonstrated by a directed graph

G = (V, E)

.

V = {v_{1}, v_{2}, \dots, v_{N}}

represents a nonempty set of N nodes or the set of DGs. In the isolated microgrid, each DG can be viewed as a node in the communication graph. Communication lines are represented by a set of edges

ϵ \in E

, with the adjacency matrix

A_{i j}

.

(v_{j}, v_{i})

represents an edge from node j to node i. We consider a time-invariant communication topology, so

A_{i j}

is constant, and therefore Aij is constant. Aij is the weight of edge

(v_{j}, v_{i})

, and

A_{i j} \geq 0

if

(v_{j}, v_{i}) \in E

, otherwise

A_{i j} = 0

.

Therefore, in this paper, based on local observations and limited communication, the DG secondary voltage control process is described as a multi-agent extended version of the Markov decision process (MDP), namely POMDP. Figure 3 depicts the schematic diagram of the secondary voltage control process of the microgrid system using the MARL framework, in which the microgrid includes two levels of information and physics. DES in the information layer is regarded as an agent. It not only needs to obtain its own state value and reward value, but also needs to exchange state values with its neighbors through the communication network. The obtained information is aggregated into the observation value of the agent through the attention mechanism. After processing, encoding, decoding, and other steps are sent as input to the respective action network and evaluation network, and then the strategy and the approximation of the value function are obtained. The final output action value realizes the secondary voltage control, and further interacts with the environment regarded as the physical layer to generate the next state. The relevant elements of the proposed MARL algorithm considered in this paper are as follows.

(1) Action Space: We select the voltage setting value in the droop control as the action value of the agent, and obtain 10 different action values on average from

1.00

to 1.14 p.u. The set of all agent actions is expressed as

a = [v_{n 1} \times v_{n 2} \times \dots \times A_{N}]

.

(2) State Space: In order to reduce the communication pressure, we select as few parameters as possible to describe the operating state of the system as state variables while ensuring sufficient information, and obtain a state space containing necessary information as follows

s_{i, t} = (δ_{i}, P_{i}, Q_{i}, i_{o d i}, i_{o q i}, i_{b d i}, i_{b q i}, v_{b d i}, v_{b q i})

(5)

where

δ_{i}

is the phase angle difference between the reference coordinate system of each DES and the common reference coordinate system;

P_{i}

and

Q_{i}

represent the active and reactive power, respectively;

i_{o d i} A

,

i_{o q i} A

denote the d-axis component and q-axis quantity of the output current after passing through an LC filter. Meanwhile,

i_{b q i} A

,

i_{b q i} A

,

v_{b d i} k V

,

v_{b q i} k V

represent the quadrature axis component of the voltage and current on the line connected to the main grid. The total state space is represented as the Cartesian product of individual agent’s states

S (t) = s_{1} \times s_{2} \times \dots \times s_{N}

.

(3) Observation Space: Assuming that the communication between agents is limited to neighbors, then the observations of each agent

o_{i, t} = s_{i, t} \cup m_{i, t}

include observations from neighbors

m_{i, t}

in addition to its own state values. We elaborate on how to generate

m_{i, t}

using attention mechanisms in Section 3.

(4) Transition Probability: The transition probability

T (s^{'} ∣ s, a)

describes the dynamic behavior of the agent in the DES system. We follow the model proposed in [35] to build an environment for simulating microgrid operating conditions. The model parameters we use in this process are not involved in the process of designing the MARL algorithm in the next section because our method is data-driven.

(5) Reward function: A well-designed reward function provides optimal guidance for learning the best strategy. Reference [37] only uses

v_{o d i}

to calculate the reward function. Although the voltage of DG can be adjusted, the bus voltage may still exceed the normal range. Therefore, we consider the bus voltage value,

v_{b u s}

, in the reward function and define an average relative error

e r r o r = \frac{1}{n_{b u s}} \sum_{i = 0}^{n_{b u s}} \frac{|v_{n o m} - v_{b u s, i}|}{v_{n o m}}

. We designed a reward function that includes error of DGs to help us quickly converge to the reference value (e.g., 1 p.u.) during the training process.

r_{i, t} = \{\begin{matrix} |e r r o r - 0.05| \times n_{a g e n t}, & e r r o r \leq 0.05 \\ - \frac{(e r r o r - 0.05)}{4} \times n_{a g e n t}, & 0.05 < e r r o r \leq 0.25 \\ - 10 \times n_{a g e n t}, & o t h e r w i s e \end{matrix}

(6)

where

v_{n o m}

represents the nominal bus voltage value of the VCVSI,

r_{i, t}

denotes the immediate reward obtained by agent i. Three working areas are set according to the level of voltage, including normal operating conditions (

e r r o r \leq 0.05

), heavy load region (

0.05 < e r r o r \geq 0.25

), and divergence region (

e r r o r \geq 0.25

). DGs that lead to voltage divergence or no power flow solution will receive larger penalties, while DGs whose voltage amplitude is closer to the set value

1 p . u .

will receive higher rewards.

3. Methods

In the previous section, we describe the necessary elements required to formulate the voltage control problem into a MARL framework in detail. In this section, a multi-agent advantage actor-critic (MA2C) algorithm is proposed to solve this POMDP, which is an improvement on the IA2C algorithm in two ways. Firstly, each agent can obtain the policies of adjacent agents, including observable states and fingerprints. Secondly, a spatial discount factor is introduced to reduce rewards for distant agents, enabling agents to focus more on nearby situations. With these improvements, local rewards, value loss, and policy loss all consider the influence of neighboring agents. The method we proposed has the characteristics of data-driven, centralized training, decentralized execution, and rule integration, meeting the requirements for smart grid operation. Finally, an attention mechanism is introduced to improve communication efficiency.

3.1. MA2C Algorithm

In this section, we first introduce the independent advantage actor-critic (IA2C) algorithm. In IA2C, there is no consideration of information exchange between multiple agents, and each agent trains its own critic network

V_{ω i}

and actor network

π_{θ_{i}}

, approximated by deep neural networks. We further supplement the definition of POMDP defined in Section 2. If the neighbor of agent i is

j \in N_{i}

, the local neighbor set can be written as

V_{i} = N_{i} \cup i

. Furthermore, we define the corresponding local reward, policy loss, and value loss for each agent.

The A2C framework is the mainstream architecture in RL. The independent A2C (IA2C) algorithm that emerges on this basis does not consider the information interaction between multiple agents. Under the assumption that the global reward and state can be shared between agents, each agent trains its own policy network and value network independently. The multi-agent A2C (MA2C) takes into account the information interaction between agents; that is, each agent can observe the action value, reward value, and state value of the neighbor agent, so the observability of each local agent improves and facilitates multi-agent cooperation and coordination. The local reward of centralized A2C, which is an extension of IA2C, can be written as

R_{t, i} = \hat{R_{t}} + γ^{t_{B} - t} V_{ω_{i}^{-}} (s_{t_{B}} ∣ π_{θ}_{- i}^{-})

(7)

The environment we define is partially observable, so each agent can only communicate with agents in its own neighborhood. In other words, the inputs of the actor network and critic network should be replaced with

s_{t, v_{i}} = {s_{t, j}}_{j \in v_{i}}

instead of

s_{t}

. Therefore, the loss function formula for the value network should be written as:

L {(ω)}_{i}) = \frac{1}{2 |B|} \sum_{t \in B} {(R_{t, i} - V_{ω_{i}} (s_{t}, v_{i}))}^{2}

(8)

Similarly, the advantage function should be rewritten as

A_{t, i} = R_{t, i} - V_{ω_{i}^{-}} (s_{t}, v_{i})

. Therefore, the loss function formula for the policy network should be written as:

L (θ_{i}) = - \frac{1}{|B|} \sum_{t \in B} l o g π_{θ_{i}} (a_{t, i} | s_{t}, v_{i}) A_{t, i}

(9)

For such a multi-agent environment, using traditional reinforcement learning algorithms separately for each agent is not suitable because from each agent’s perspective, the environment is nonstationary. Therefore, we follow the actor-critic framework and propose two improvements to stabilize the convergence of IA2C and enhance its fitting ability.

Firstly, we incorporated neighboring policy information to enrich the observation of agents. MA2C considers the information interaction among agents, allowing agents to acquire the actor network parameters of its adjacent agents, thus enhancing the observability of each agent and facilitating better coordination among them. As the policy information in the A2C framework is explicitly parameterized in the policy function

π_{θ_{- i}^{-}}

, and considering that the power system state remains similar in the policy between the current and previous time steps, we only consider short-term policy sharing among neighbors. Thus, we use the current local neighborhood state

(s_{t}, v_{i})

and the latest policy sampled from neighboring agents

π_{t - 1, N_{i}} = {[π_{t - 1, j}]}_{j \in N_{i}}

as inputs to the DNNs [38], and the locally sampled policy is represented as

π_{t, i} = π_{θ_{i}^{-}} (\cdot ∣ s_{t}, v_{i}, π_{t - 1, N_{i}})

(10)

Secondly, to adjust the degree of cooperation among agents and reduce the impact of state and reward from neighbors, a spatial discount factor

α

is introduced into the calculation of the global reward value

{\tilde{r}}_{t, i} = \sum_{d = 0}^{D_{i}} (\sum_{j \in V ∣ d (i, j = d)} α^{d} r_{t, j})

(11)

The discount factor

α

is different from the time discount factor

γ

, as it reduces signals according to spatial order rather than temporal order. Here,

D_{i}

is the maximum Euclidean distance to agent i. After the discount, the global reward of each agent is no longer the same but adjusted according to the distance of neighbors, so that the control strategy achieves a more appropriate balance between exploration and utilization. Similarly, we discount neighboring states using

α

{\tilde{s}}_{t, v_{i}} = [s_{t, i}] \cup α {[s_{t, j}]}_{j \in N_{i}}

(12)

The discounted estimation for local reward should be represented as

{\hat{R}}_{t, i} = \sum_{τ = t}^{t_{B} - 1} γ^{τ - t} {\tilde{r}}_{τ, i}

, therefore the discounted estimated local reward should be represented as

{\tilde{R}}_{t, i} = {\hat{R}}_{t, i} + γ^{t_{B} - t} V_{ω_{i}^{-}} ({\tilde{s}}_{t_{B}}, v_{i}, π_{t_{B} - 1}, N_{i} ∣ π_{θ_{- i}^{-}})

(13)

The loss function of the value network could be modified as

L {(ω)}_{i}) = \frac{1}{2 |B|} \sum_{t \in B} {({\tilde{R}}_{t, i} - V_{ω_{i}} ({\tilde{s}}_{t}, v_{i}, π_{t - 1, N_{i}}))}^{2}

(14)

with the fingerprint information

π_{t - 1, N_{i}}

as input to the value network, it can capture the impact of

π_{θ_{- i}^{-}}

, and the spatially discounted rewards

{\tilde{R}}_{t, i}

are more relevant to the state, making the improved value network more stable. Considering the discounted advantage function

{\tilde{A}}_{t, i} = {\tilde{R}}_{t, i} - V_{ω_{i}^{-}} ({\tilde{s}}_{t}, v_{i}, π_{t - 1, N_{i}})

, the policy network’s loss function is written as follows

\begin{matrix} L (θ_{i}) = & - \frac{1}{|B|} \sum_{t \in B} (l o g π_{θ_{i}} (a_{t, i} | {\tilde{s}}_{t}, v_{i}, π_{t - 1, N_{i}} A_{t, i}) \\ - β \sum_{a_{i} \in A_{i}} π_{θ_{i}} l o g π_{θ_{i}} (a_{i} | {\tilde{s}}_{t}, v_{i}, π_{t - 1, N_{i}}) \end{matrix}

(15)

The second term is a regularization term we added; by introducing an entropy loss of this policy

π

, it can increase the exploration of early agents. It can emphasize the marginal effect of

π_{θ_{i}}

on the states of neighboring agents, thereby avoiding the agent from converging to a suboptimal policy.

3.2. Attention Mechanisms

MARL algorithms are challenging to train because of a large number of agents, action spaces, and dimensions, leading to long training times and difficulty in achieving convergence [39]. To overcome this difficulty of training complexity, A voltage control strategy for DG based on MARL and attention mechanism is proposed. The advantage of the soft attention mechanism is that higher rewards might be obtained by selectively focusing on the actions of other nodes. The proposed algorithm can effectively utilize neighbor’s parameter to train own network, and its convergence speed is faster than that of traditional RL.

The attention mechanism originates from researchers’ study of human visual observation and has been widely applied to computer vision tasks [40]. This mechanism can allocate a large attention coefficient to the parts that need to be focused on according to the criticality of the features, allowing the algorithm to adaptively calibrate the extracted feature information for effective training. The soft attention mechanism selectively ignores some information and aggregates the remaining information through weighted summation. This can separate important information and avoid interference from unimportant information, thus improving accuracy.

In this paper, an attention mechanism is introduced into the centralized MA2C. The computation process of the soft attention mechanism includes two steps. The first step is to calculate the weight coefficients based on the query and the key. The second step is to perform a weighted summation of the value based on the weight coefficients.

The input vector

(K, V) = [(k_{1}, v_{1}), (k_{2}, v_{2}), \dots, (k_{N}, v_{N})]

appears in the form of key-value pairs. First, the similarity or correlation weight

a_{i}

between the query vector and the key vector is calculated. A scoring function

f (k_{i}, q)

is defined to determine the importance score, and the softmax function is used to normalize the original scores as:

W_{i} = s o f t m a x (f (k_{i}, q)) = \frac{e x p (f (k_{i}, q))}{\sum_{j = 1}^{N} e x p (f (k_{j}, q))}

(16)

After obtaining the attention distribution based on the query vector and key-value pairs, the input information is weighted and summed up according to the attention weights as:

V = \sum_{i = 1}^{N} W_{i} \cdot v_{i}

(17)

We embed the soft attention mechanism into the communication network, and the structure of the algorithm is shown in the Figure 4. h is the hidden layer,

m_{i}

is the observation of each agent, and the global information

M_{i}

is obtained after encoding and decoding by the neural network. Since the neural network is a fully connected layer structure, there is no distinction between the importance of information. In order to improve the efficiency of communication, we introduce the attention mechanism into the action network to selectively process observations, and finally input them into the policy network to participate in the update and obtain new action values. The structure that handles observation messages is called a message coordinator, which contains a predefined importance score. The superscript q constitutes the query feature space, and the superscript k constitutes the key-value feature space. The above formula shows that the importance score of agent j to agent i is obtained through query and key, and also needs to go through the softmax function. The sum of each agent’s importance scores relative to all neighbors should be 1.

W_{i}^{j} = {(_{i}^{q})}^{T} (m_{i j}^{k}), i \neq j

(18)

W_{i} = [W_{i}^{1}, \dots, W_{i}^{j}, \dots, W_{i}^{N}] = s o f t m a x (W_{i}^{1}, \dots, W_{i}^{j}, \dots, W_{i}^{N})

(19)

\sum_{j = 1}^{N} W_{i}^{j} = 1, i \neq j

(20)

After calculating the importance scores, the global message of each agent can be obtained by taking the weighted sum of the importance scores of all its neighbors observations.

M_{i} = \sum_{j = 1}^{N} W_{i}^{j} m_{j}

(21)

where

W_{i}^{j}

is the weight of

m_{j}

for

m_{i}

. The soft attention mechanism enables the agent to selectively process information during the decision-making process, and filter out information that does not affect the strategy by focusing on specific observations. This mechanism is fully differentiable, capable of end-to-end training of model parameters, suitable for gradient backpropagation of neural networks and combination with MARL. In addition, the measurement method of the soft attention mechanism is used to calculate the attention weight, which can be interpreted as a correlation measurement. This method has strong interpretability and provides a reference basis for modifying parameters and identifying data. All in all, the soft attention mechanism is suitable for MARL-based microgrid control due to its differentiability, interpretability, and robustness.

The complete MA2C algorithm embedded with attention mechanism follows the procedure outlined in Algorithm 1. The hyperparameters include the discount factor for distance decay

α

, the learning rate for the actor network

η_{ω}

, the learning rate for the critic network

η_{θ}

, the time discount factor

γ

, the time step length for each episode L, and the total training episodes S. To clarify further, the agents participate in interactions with the environment for a substantial number of epochs (lines 2–25). In lines 4–7, agents shared their state information and neighbor information through the message coordinator network, where importance scores are computed using a set “query” and “key” feature space. Then each agent state and policy information are reassembled and encoded, processed with an attention mechanism, and sent as a message

m_{i, t}

to its neighbors (line 11). Subsequently, the agents modify their states and the parameters of the policy network (lines 9–10). The state values are updated (lines 12–15). Afterwards, actions are applied to the environment (line 16), the agent transitions to a new state and immediately receives a reward (line 17), which will be included in the policy experience buffer (line 18). The policy gradient algorithm is employed to update these network parameters after each episode, by randomly selecting trajectories from the policy experience buffer (lines 20–25). If an episode is completed or there is no power flow solution, a DONE signal is flagged. Once the DONE signal is received, resetting all agents’ states to their initial states signifies the beginning of a new epoch.

The centralized training, decentralized execution approach is employed in our method [41]. In centralized training, the critic network evaluates and guides the update of the actor network by paying attention to valuable local observations in the environment, evaluating the data collected during decentralized execution, and using gradients to guide the actor network’s parameter updates, thus maximizing the overall expected return. During decentralized execution, the actor networks are controlled separately. Upon completion of the training process, the DNN parameters are set as constant, while the policy networks of agents are retained to facilitate real-time voltage regulation.

Algorithm 1: AM-MA2C algorithm

4. Simulation

We applied the proposed voltage control method to an improved IEEE 34-bus system (see Figure 5). The network consists of 6 DGs and a topology structure. Table 1 presents the key characteristics of the DGs, lines, and loads. Utilizing pandpower, the method was executed in Python to enable power system modeling. The simulation platform was designed with the specifications of the lines and loads detailed in Reference [42]. To simulate real-world conditions of the power system, the microgrid was subjected to random load adjustments, resulting in deviations of

\pm 20 %

from the expected values. Moreover, we introduced perturbations of

\pm 5 %

to each load to simulate load disturbances. The monitoring sampling time for all DGs was set to

0.05

s, and each DG was able to communicate with its adjacent DGs. The primary control at the lower level was implemented using the method proposed in Reference [43].

The discount factor for the state space is defined as

α = 0.9

, while the discount factor

γ

is set to

0.99

. The participant learning rate is set to

η_{θ} = 5 \times 10^{- 4}

, and the critic learning rate is set to

η_{ω} = 2.5 \times 10^{- 4}

. Each model is trained for 2000 episodes, and the duration of each episode spans

T = 20

steps. To ensure a fair comparison, a different random seed is generated for each episode, and consistent training environments are ensured across different algorithms through the utilization of a shared random seed. We compare the proposed MA2C method with the MA2C method and the IA2C method to demonstrate its effectiveness.

Figure 6 shows the reward change curve obtained by training of the improved MARL algorithm applied to the IEEE 34-node system. It is evident that the addition of the communication network and appropriate spatial discount factor enhances the effectiveness of sample utilization and accelerates the training speed. The attention-based MA2C algorithm can converge to higher reward values. The obtained policies were evaluated for 20 episodes on various load disruptions after 2000 training episodes, with each agent using the same random seed in each training episode.

To exhibit the performance of our strategy, we simulated the transient process of voltage control in a microgrid system. Assuming an initial voltage of 1.0 p.u. and a frequency of 60 Hz, the simulation time was set to 1.5 s. At 0.4 s, the voltage secondary control was introduced using the MA2C algorithm with attention mechanism, which was trained in advance. The results of the simulation are depicted in Figure 7.

As shown in this figure, from 0 s to 0.4 s, the voltage dropped slightly below the set value due to only using droop control. After introducing the secondary control strategy, the end voltage amplitude of the microgrid was synchronized to 1.0 p.u. after a 0.15 s adjustment time.

We also compared the voltage control performance of different algorithms under high load changes (

\pm 25 %

). The voltages of the six DGs under different algorithms are shown in Figure 8. The horizontal axis from 0 to 5 represents the terminal bus voltage values of the six DGs, and the vertical axis represents the voltage in units. We aim to maintain the voltage of all DGs in the range around 1 p.u. In this context, “r” denotes the average reward over the past five control steps, and the blue dots indicates the magnitude of the voltage value. As shown in Figure 6, all algorithms have the ability to regulate the voltage within the acceptable range. However, these voltage amplitudes are closer to the reference using the proposed algorithm.

In order to assess the efficacy of the proposed methodology, we performed random testing on the model that had undergone training. Table 2 exhibits the results of experiments with different random seeds. The average reward of the improved MA2C algorithm with the attention mechanism is 0.23, which is more than that of the MA2C and IA2C algorithms, respectively.

5. Discussion

In this paper, we formulate the secondary voltage control problem as a POMDP process that can be solved through reinforcement learning algorithms. We develop a new collaborative MA2C algorithm with an attention mechanism by introducing a communication network and appropriate spatial discount factor. Comprehensive experiments show that this algorithm is able to restore the DG bus voltage to the reference quickly, with voltage fluctuations that are stable and within the acceptable range. The algorithm gains better results in terms of convergence speed and voltage control performance. Moreover, the proposed control strategy only requires a communication network for exchanging state information between neighboring agents. Consequently, the proposed MARL algorithm can meet the demand for large-scale energy storage supporting the grid.

This indicates the huge potential of applying deep reinforcement learning in microgrid control, while there are also many challenges in scalability, generalization, security, and stability. We will work on building a more complex environment in the future based on real power system data and apply MARL algorithms to relatively complex frequency secondary recovery problems. Microgrid systems are highly cost-sensitive, and if decisions are made incorrectly, voltage violations can lead to serious consequences. Therefore, in the future, we will consider applying safety reinforcement learning to microgrid control.

Author Contributions

Methodology, T.W. and S.M.; software, Z.T.; formal analysis, T.X.; investigation, Y.J.; writing—original draft, Z.T.; writing—review and editing, C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of the State Grid Corporation of China (KJ22-1-63: Research on collaborative optimization method of multi-agent reinforcement learning in distributed information energy system).

Data Availability Statement

No data are available for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, H.; Li, F.; Xu, Y.; Rizy, D.T.; Adhikari, S. Autonomous and adaptive voltage control using multiple distributed energy resources. IEEE Trans. Power Syst. 2012, 28, 718–730. [Google Scholar] [CrossRef]
Olivares, D.E.; Mehrizi-Sani, A.; Etemadi, A.H.; Cañizares, C.A.; Iravani, R.; Kazerani, M.; Hajimiragha, A.H.; Gomis-Bellmunt, O.; Saeedifard, M.; Palma-Behnke, R.; et al. Trends in microgrid control. IEEE Trans. Smart Grid 2014, 5, 1905–1919. [Google Scholar] [CrossRef]
Korres, G.N.; Manousakis, N.M.; Xygkis, T.C.; Löfberg, J. Optimal phasor measurement unit placement for numerical observability in the presence of conventional measurements using semi-definite programming. IET Gener. Transm. Distrib. 2015, 9, 2427–2436. [Google Scholar] [CrossRef]
Su, H.Y.; Kang, F.M.; Liu, C.W. Transmission grid secondary voltage control method using PMU data. IEEE Trans. Smart Grid 2016, 9, 2908–2917. [Google Scholar] [CrossRef]
Wu, D.; Tang, F.; Dragicevic, T.; Vasquez, J.C.; Guerrero, J.M. A control architecture to coordinate renewable energy sources and energy storage systems in islanded microgrids. IEEE Trans. Smart Grid 2014, 6, 1156–1166. [Google Scholar] [CrossRef] [Green Version]
Bidram, A.; Davoudi, A. Hierarchical structure of microgrids control system. IEEE Trans. Smart Grid 2012, 3, 1963–1976. [Google Scholar] [CrossRef]
She, B.; Li, F.; Cui, H.; Wang, J.; Min, L.; Oboreh-Snapps, O.; Bo, R. Decentralized and coordinated V-f control for islanded microgrids considering der inadequacy and demand control. IEEE Trans. Energy Convers. 2023, 1–13, early access. [Google Scholar] [CrossRef]
Rokrok, E.; Shafie-Khah, M.; Catalão, J.P. Review of primary voltage and frequency control methods for inverter-based islanded microgrids with distributed generation. Renew. Sustain. Energy Rev. 2018, 82, 3225–3235. [Google Scholar] [CrossRef]
Xie, P.; Guerrero, J.M.; Tan, S.; Bazmohammadi, N.; Vasquez, J.C.; Mehrzadi, M.; Al-Turki, Y. Optimization-based power and energy management system in shipboard microgrid: A review. IEEE Syst. J. 2022, 16, 578–590. [Google Scholar] [CrossRef]
Singh, P.; Paliwal, P.; Arya, A. A review on challenges and techniques for secondary control of microgrid. IOP Conf. Ser. Mater. Sci. Eng. 2019, 561, 012075. [Google Scholar] [CrossRef] [Green Version]
Babayomi, O.; Zhang, Z.; Dragicevic, T.; Heydari, R.; Li, Y.; Garcia, C.; Rodriguez, J.; Kennel, R. Advances and opportunities in the model predictive control of microgrids: Part II–Secondary and tertiary layers. Int. J. Electr. Power Energy Syst. 2022, 134, 107339. [Google Scholar] [CrossRef]
Bidram, A.; Davoudi, A.; Lewis, F.L.; Guerrero, J.M. Distributed cooperative secondary control of microgrids using feedback linearization. IEEE Trans. Power Syst. 2013, 28, 3462–3470. [Google Scholar] [CrossRef] [Green Version]
Ning, B.; Han, Q.L.; Ding, L. Distributed finite-time secondary frequency and voltage control for islanded microgrids with communication delays and switching topologies. IEEE Trans. Cybern. 2020, 51, 3988–3999. [Google Scholar] [CrossRef] [PubMed]
Mu, C.; Zhang, Y.; Jia, H.; He, H. Energy-storage-based intelligent frequency control of microgrid with stochastic model uncertainties. IEEE Trans. Smart Grid 2019, 11, 1748–1758. [Google Scholar] [CrossRef]
Moharm, K. State of the art in big data applications in microgrid: A review. Adv. Eng. Inform. 2019, 42, 100945. [Google Scholar] [CrossRef]
Mohammadi, F.; Mohammadi-Ivatloo, B.; Gharehpetian, G.B.; Ali, M.H.; Wei, W.; Erdinç, O.; Shirkhani, M. Robust control strategies for microgrids: A review. IEEE Syst. J. 2022, 16, 2401–2412. [Google Scholar] [CrossRef]
Mu, C.; Wang, K.; Ma, S.; Chong, Z.; Ni, Z. Adaptive composite frequency control of power systems using reinforcement learning. CAAI Trans. Intell. Technol. 2022, 7, 671–684. [Google Scholar] [CrossRef]
Nikmehr, N.; Ravadanegh, S.N. Optimal power dispatch of multi-microgrids at future smart distribution grids. IEEE Trans. Smart Grid 2015, 6, 1648–1657. [Google Scholar] [CrossRef]
Wei, F.; Wan, Z.; He, H. Cyber-attack recovery strategy for smart grid based on deep reinforcement learning. IEEE Trans. Smart Grid 2019, 11, 2476–2486. [Google Scholar] [CrossRef]
Zhang, Q.; Dehghanpour, K.; Wang, Z.; Qiu, F.; Zhao, D. Multi-agent safe policy learning for power management of networked microgrids. IEEE Trans. Smart Grid 2020, 12, 1048–1062. [Google Scholar] [CrossRef]
Han, X.; Mu, C.; Yan, J.; Niu, Z. An autonomous control technology based on deep reinforcement learning for optimal active power dispatch. Int. J. Electr. Power Energy Syst. 2023, 145, 108686. [Google Scholar] [CrossRef]
Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-reinforcement-learning-based autonomous voltage control for power grid operations. IEEE Trans. Power Syst. 2019, 35, 814–817. [Google Scholar] [CrossRef]
Gao, Y.; Wang, W.; Yu, N. Consensus multi-agent reinforcement learning for volt-var control in power distribution networks. IEEE Trans. Smart Grid 2021, 12, 3594–3604. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Ding, F.; Huang, Q.; Chen, Z. Attention enabled multi-agent DRL for decentralized volt-VAR control of active distribution system using PV inverters and SVCs. IEEE Trans. Sustain. Energy 2021, 12, 1582–1592. [Google Scholar] [CrossRef]
Zhang, X.; Liu, Y.; Duan, J.; Qiu, G.; Liu, T.; Liu, J. DDPG-based multi-agent framework for SVC tuning in urban power grid with renewable energy resources. IEEE Trans. Power Syst. 2021, 36, 5465–5475. [Google Scholar] [CrossRef]
Singh, V.P.; Kishor, N.; Samuel, P. Distributed multi-agent system-based load frequency control for multi-area power system in smart grid. IEEE Trans. Ind. Electron. 2017, 64, 5151–5160. [Google Scholar] [CrossRef]
Yu, T.; Wang, H.; Zhou, B.; Chan, K.W.; Tang, J. Multi-agent correlated equilibrium Q(λ) learning for coordinated smart generation control of interconnected power grids. IEEE Trans. Power Syst. 2014, 30, 1669–1679. [Google Scholar] [CrossRef]
Oroojlooy, A.; Hajinezhad, D. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 2022, 53, 13677–13722. [Google Scholar] [CrossRef]
Wang, J.; Xu, W.; Gu, Y.; Song, W.; Green, T.C. Multi-agent reinforcement learning for active voltage control on power distribution networks. Adv. Neural Inf. Process. Syst. 2021, 34, 3271–3284. [Google Scholar]
Zhang, J.; Lu, C.; Si, J.; Song, J.; Su, Y. Deep reinforcement leaming for short-term voltage control by dynamic load shedding in china southem power grid. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; IEEE: Manhattan, NY, USA, 2018; pp. 1–8. [Google Scholar]
Wei, Y.; Yu, F.R.; Song, M.; Han, Z. User scheduling and resource allocation in HetNets with hybrid energy supply: An actor-critic reinforcement learning approach. IEEE Trans. Wirel. Commun. 2017, 17, 680–692. [Google Scholar] [CrossRef]
Bidram, A.; Lewis, F.L.; Davoudi, A.; Qu, Z. Frequency control of electric power microgrids using distributed cooperative control of multi-agent systems. In Proceedings of the 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, Nanjing, China, 26–29 May 2013; IEEE: Manhattan, NY, USA, 2013; pp. 223–228. [Google Scholar]
Qu, Z.; Peng, J.C.H.; Yang, H.; Srinivasan, D. Modeling and analysis of inner controls effects on damping and synchronizing torque components in VSG-controlled converter. IEEE Trans. Energy Convers. 2020, 36, 488–499. [Google Scholar] [CrossRef]
Guo, F.; Wen, C.; Mao, J.; Song, Y.D. Distributed secondary voltage and frequency restoration control of droop-controlled inverter-based microgrids. IEEE Trans. Ind. Electron. 2014, 62, 4355–4364. [Google Scholar] [CrossRef]
Bidram, A.; Davoudi, A.; Lewis, F.L. A multiobjective distributed control framework for islanded AC microgrids. IEEE Trans. Ind. Inform. 2014, 10, 1785–1798. [Google Scholar] [CrossRef]
Bakakeu, J.; Baer, S.; Klos, H.H.; Peschke, J.; Brossog, M.; Franke, J. Multi-agent reinforcement learning for the energy optimization of cyber-physical production systems. In Proceedings of the 2020 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), London, ON, Canada, 30 August–2 September 2020; pp. 143–163. [Google Scholar]
Tomin, N.; Voropai, N.; Kurbatsky, V.; Rehtanz, C. Management of voltage flexibility from inverter-based distributed generation using multi-agent reinforcement learning. Energies 2021, 14, 8270. [Google Scholar] [CrossRef]
Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef] [Green Version]
Zhou, Y.; Zhang, B.; Xu, C.; Lan, T.; Diao, R.; Shi, D.; Wang, Z.; Lee, W.J. A data-driven method for fast ac optimal power flow solutions via deep reinforcement learning. J. Mod. Power Syst. Clean Energy 2020, 8, 1128–1139. [Google Scholar] [CrossRef]
Guo, M.H.; Xu, T.X.; Liu, J.J.; Liu, Z.N.; Jiang, P.T.; Mu, T.J.; Zhang, S.H.; Martin, R.R.; Cheng, M.M.; Hu, S.M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Ding, F.; Huang, Q.; Chen, Z.; Blaabjerg, F. Data-driven multi-agent deep reinforcement learning for distribution system decentralized voltage control with high penetration of PVs. IEEE Trans. Smart Grid 2021, 12, 4137–4150. [Google Scholar] [CrossRef]
Mwakabuta, N.; Sekar, A. Comparative study of the IEEE 34 node test feeder under practical simplifications. In Proceedings of the 2007 39th North American Power Symposium, Manhattan, NY, USA, 30 September–2 October 2007; pp. 484–491. [Google Scholar]
Shafiee, Q.; Guerrero, J.M.; Vasquez, J.C. Distributed secondary control for islanded microgrids-A novel approach. IEEE Trans. Power Electron. 2013, 29, 1018–1031. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Schematic Diagram of agent training and interacting with the environment.

Figure 2. Large-signal model structure of distributed generators.

Figure 3. The cyber–physical architecture of microgrid voltage control based on MARL.

Figure 4. Schematic diagram of the information transmission process of the MA2C algorithm embedded with attention mechanism.

Figure 5. The IEEE 34-node system with 6 DGs for simulation.

Figure 6. The reward value curve obtained by the MARL training process under different algorithms.

Figure 7. The voltage value of agents within 1.5 s in the IEEE 34-node system.

Figure 8. The voltage distributions under different algorithms.

Table 1. Key configuration parameters of DG and load of the IEEE 34-node system.

		DG1, DG2, DG3, DG4		DG5, DG6
DG	$m_{P}$	$5.64 \times 10^{- 5}$	$m_{P}$	$5.64 \times 10^{- 5}$
	$n_{Q}$	$5.2 \times 10^{- 4}$	$n_{Q}$	$6 \times 10^{- 4}$
	$R_{c}$	$0.03 Ω$	$R_{c}$	$0.03 Ω$
	$L_{c}$	$0.35 mH$	$L_{c}$	$0.35 mH$
	$o m e g_{c}$	31.41	$o m e g a_{c}$	31.41
	$k_{p}$	4	$k_{p}$	4
	$k_{i}$	40	$k_{i}$	40
	Load1	Load2	Load3	Load4
Load	$1.5 Ω$	$0.5 Ω$	$1 Ω$	$0.8 Ω$
Load	$0.03 Ω$	$0.017 Ω$	$0.05 Ω$	$0.02 Ω$

Table 2. Average reward value under random test.

Random Seeds	AM-Attention	MA2C	IA2C
Case 1	0.24	0.18	0.19
Case 2	0.23	0.21	0.18
Case 3	0.22	0.21	0.19
Case 4	0.24	0.22	0.19
Case 5	0.22	0.21	0.20
Average reward	0.23	0.21	0.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, T.; Ma, S.; Tang, Z.; Xiang, T.; Mu, C.; Jin, Y. A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids. Energies 2023, 16, 5653. https://doi.org/10.3390/en16155653

AMA Style

Wang T, Ma S, Tang Z, Xiang T, Mu C, Jin Y. A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids. Energies. 2023; 16(15):5653. https://doi.org/10.3390/en16155653

Chicago/Turabian Style

Wang, Tianhao, Shiqian Ma, Zhuo Tang, Tianchun Xiang, Chaoxu Mu, and Yao Jin. 2023. "A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids" Energies 16, no. 15: 5653. https://doi.org/10.3390/en16155653

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Agent Reinforcement Learning Method for Cooperative Secondary Voltage Control of Microgrids

Abstract

1. Introduction

2. Problem Formulation

2.1. Preliminaries for POMDP

2.2. Microgrid Voltage Secondary Control

3. Methods

3.1. MA2C Algorithm

3.2. Attention Mechanisms

4. Simulation

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI