Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning

He, Gengsheng; Huang, Yu; Huang, Guori; Liu, Xi; Li, Pei; Zhang, Yan

doi:10.3390/en17153688

Open AccessArticle

Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning

by

Gengsheng He

^1,*,

Yu Huang

²,

Guori Huang

¹,

Xi Liu

¹,

Pei Li

¹ and

Yan Zhang

¹

Energy Development Research Institute, China Southern Power Grid, Guangzhou 510663, China

²

Electric Power Research Institute of Guizhou Power Grid Co., Ltd., Guiyang 550002, China

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(15), 3688; https://doi.org/10.3390/en17153688

Submission received: 11 June 2024 / Revised: 12 July 2024 / Accepted: 17 July 2024 / Published: 26 July 2024

(This article belongs to the Special Issue Zero Carbon Emissions, Green Environment and Sustainable Energy)

Download

Browse Figures

Versions Notes

Abstract

Virtual power plants (VPPs) aggregate a large number of distributed energy resources (DERs) through IoT technology to provide flexibility to the grid. It is an effective means to promote the utilization of renewable energy, and enable carbon neutrality for future power systems. This paper addresses the evaluation issue of DERs‘ low-carbon benefits, proposes a flexibility assessment model for self-organized VPP to quantify the low-carbon value of DERs’ response behavior in different time periods. Firstly, we introduce the definition of zero-carbon index based on the curve simultaneous rate of renewable energy and load demand. Then, we establish a multi-level self-organized aggregation method for virtual power plants, define the basic rules of DER, and characterize its self-organized aggregation as a Markov game process. Moreover, we use QMIX to achieve a bottom-up, hierarchical construction of VPP from simple to complex. Experimental results show that when users track the zero-carbon curve, they can achieve zero carbon emissions without reducing the overall load, significantly enhancing the grid’s regulation capabilities and the consumption of renewable energy. Additionally, self-organized algorithms can optimize the combinations of DERs to improve the coordination efficiency of VPPs in complex environments.

Keywords:

virtual power plant; distributed energy resources; flexibility; zero-carbon index; self-organized aggregation

1. Introduction

The virtual power plant (VPP) essentially aggregates a vast number of distributed energy resources (DERs) [1]. One of its greatest advantages is the ability to leverage digital technologies to activate idle DERs, thereby unlocking their potential flexibility. This promotes the utilization of renewable energy and facilitates the development of carbon-neutral power systems [2].

VPP is a typical complex system, with its “complexity” arising from the massive, heterogeneous, and time-varying DERs that adapt to the environment, interact, and co-evolve [3]. It is crucial to use low-carbon signals to guide the aggregation of numerous DERs, integrating smaller, lower-level units into larger, higher-level aggregators. This approach aims to achieve optimal configuration and flexibility of resources, as well as low-carbon benefits on the demand side [4].

The carbon emissions generated by the power system originate entirely from the supply side. However, the way users consume electricity can significantly influence the level of renewable energy utilization. Consequently, users need to bear the equivalent indirect carbon emission responsibility. Several studies have explored the evaluation methods for user equivalent carbon emissions and carbon emission reduction strategies. For example, Reference [5] explores cost-effective and low-carbon energy provision solutions for individual small-cell mobile networks, presenting two potential frameworks: centralized and distributed energy provision. Its numerical simulation results demonstrate that the proposed centralized renewable energy generation strategy for nearby small cells maximizes both the cost and energy efficiencies of the network. Reference [6] proposes a normalized low-carbon operation scheme for parks based on the Green Power Index (GPI). Its numerical simulation results show that the GPI can effectively promote renewable energy consumption as well as normalization and sustainable carbon reduction in the park. Reference [7] introduces a new power system carbon emission reduction mechanism that encourages users to actively respond, thereby reducing power system carbon emissions. This low-carbon demand response mechanism, supported by empirical analysis, exhibits significant carbon-reduction potential from both the system and user perspectives.

In the area of low-carbon demand response, several research achievements have emerged in recent years. For example, Reference [8] considers the dual incentives of electricity and carbon, proposing a logistic sparrow search algorithm–back propagation neural network (LSSA–BPNN)-based demand response potential and carbon-reduction-potential-forecasting model for load aggregators. The effectiveness and superiority of the proposed model are verified using a real dataset from Austin. Reference [9] reveals the real impact of enterprises’ digital transformation on carbon emissions and provides guidance for enterprises to implement digital transformation and carbon emission reduction. Reference [10] proposes a distribution system planning method with a specific emission reduction target, where the emission target is embedded in the objective function via the exterior point method rather than being considered as a constraint. This planning method is tested on modified IEEE 33-bus and 123-bus benchmark systems, demonstrating that the proposed model and method can achieve a reduction in emissions of more than 30% with only a 5% increase in system costs. Reference [11] designs a joint electricity–carbon-trading framework to reduce carbon emissions through trading and demand response. Simulation studies reveal that, compared to a market without carbon trading and users without delayed settlement, the proposed mechanism achieves carbon emission reductions of 40.7% and 12.7%, respectively. Reference [12] utilizes a modified optimal power flow formulation that considers carbon costs in addition to the fuel costs of generation. Quantitative results demonstrate that incorporating E-LMPs and the active participation of flexible loads can help reduce carbon emissions, particularly with renewable resources integrated into the system.

The aforementioned research generally assumes that the entire system has a clear structure with complete information, and that the optimization process will eventually converge to a certain “steady-state” condition. In cases of system imbalance, a centralized control or coordination unit is typically used to guide the load to adjust its state until stability is achieved. However, with large-scale flexible loads, centralized control or coordination is often difficult to implement. Additionally, the operating environments of these loads can vary significantly, and their states are time-varying and subject to human intervention. Therefore, obtaining an optimal solution for large-scale load coordination control within a limited time frame is challenging.

In this regard, this paper proposes a self-organized aggregation method from a bottom-up perspective, using a multi-agent reinforcement-learning algorithm to accelerate aggregation and promote the utilization of renewable energy. The main contributions are summarized as follows:

(1) A novel zero-carbon index for low-carbon benefit evaluation is proposed. Unlike traditional methods, this index does not rely on the amount of electricity consumed but depends on the load behavior, resulting in a simultaneous rate curve of renewable energy in the power grid and load curve.

(2) A multi-level self-organized aggregation model is established. Unlike top-down centralized aggregation methods, this model defines a basic set of rules for interactions between individual agents and achieves dynamic, multi-level aggregation from simple to complex based on the evolution of these rule combinations.

(3) An implementation method for agent self-organized aggregation based on the QMIX algorithm is realized. This algorithm optimizes aggregation using only environmental information, without relying on complete information, which accelerates convergence in complex and time-varying environments.

The remainder of this paper is organized as follows: Section 2 defines the assessment criteria “Zero Carbon Index” for low-carbon benefits and establishes the tracking objectives for flexible load through self-organized aggregation. Section 3 outlines the basic rules for the bottom-up self-organized aggregation of a large number of flexible loads. The specific implementation methods are described in Section 4. Section 5 presents the results of numerical tests. Finally, the conclusions are drawn in Section 6.

2. Low-Carbon Benefit Evaluation for Load Flexibility via Zero-Carbon Index

Accurately assessing the low-carbon value of load flexibility is fundamental to the large-scale aggregation of flexible loads. Although electricity is a secondary energy source and direct carbon emissions from the power system primarily originate from power plants, the operational mode of the load significantly impacts the utilization of renewable energy on the supply side due to the stringent requirement for power balance in the system [13]. Consequently, quantifying the low-carbon benefits of loads is crucial, as it hinges on how the adjustment of loads enhances the utilization of renewable energy sources.

Reasonably evaluating the low-carbon value of load flexibility is the basis for the large-scale self-organized aggregation of flexible loads. Electricity is a secondary energy source, even though the direct carbon emissions of the power system all come from the power plants, due to the strict requirement for power balance in the power system, the operating mode of the load will significantly affect the utilization of renewable energy sources on the supply side [14]. Therefore, quantifying the low-carbon benefits of loads is crucial, and it depends on how the adjustment of loads contributes to the utilization of renewable energy.

2.1. Definition of Zero-Carbon Index

To evaluate the contribution of flexible loads to the utilization of renewable energy, it is essential to statistically analyze the temporal distribution of renewable energy within the system. When the system load exactly matches the output characteristics of renewable energy, there is no need for conventional units to mitigate imbalance fluctuations. In such a scenario, the original output characteristics of renewable energy can be considered the zero-carbon index [15], as illustrated in Equation (1):

\{\begin{cases} E_{res} (T) = \{E_{res} (t_{1}), E_{res} (t_{2}), \dots, E_{res} (t_{N})\} \\ E_{tot} (T) = \{E_{tot} (t_{1}), E_{tot} (t_{2}), \dots, E_{tot} (t_{N})\} \\ \begin{matrix} η_{GI} (t) = {{E_{res} (t_{i}) / E}_{tot} (t_{i})}, & t = 1, 2, \dots t_{N} \end{matrix} \end{cases}

(1)

where

E_{res} (T)

represents the temporal series of the power generation of all renewable energy units during the research period T, with a total of discrete sampling points;

E_{res} (t_{i})

represents the power generation of the i-th period from all renewable energy units;

E_{tot} (T)

represents the total temporal distribution of power generation from all units in the research period, including both renewable energy units and conventional thermal power units;

η_{GI} (t)

represents the proportion of renewable energy generation in the total generation for each period within T, reflecting the distribution characteristics of natural resources of renewable energy in the current stage, which is defined as the zero-carbon index of the system.

2.2. Low-Carbon Benefits of Load Flexibility Using Zero-Carbon Index

The calculation method for the zero-carbon index of load flexibility must consider the presence of local renewable energy, such as distributed photovoltaics, which should primarily be consumed locally. Additionally, it is essential to evaluate its contribution to the utilization of renewable energy in the larger power grid, as represented in Equation (2):

\{\begin{cases} E_{user}^{out} (T) = \{E_{user}^{out} (t_{1}), E_{user}^{out} (t_{2}), \dots, E_{user}^{out} (t_{N})\} \\ E_{user}^{all} (T) = \{E_{user}^{all} (t_{1}), E_{user}^{all} (t_{2}), \dots, E_{user}^{all} (t_{N})\} \end{cases}

(2)

where

E_{user}^{out} (T)

represents the time series of electricity consumption recorded by the user’s smart meter during time period t, and

E_{user}^{all} (T)

represents the actual load of the user during that time period. If there is no local renewable energy, then

E_{user}^{out} (T) = E_{user}^{all} (T)

. For a given system zero-carbon index and any user load, the zero-carbon index

η_{RI}

is calculated as Equation (3):

\{\begin{cases} η_{RI} = ω_{out} (t) + ω_{in} (t) \\ ω_{out} (t) = \frac{sum (E_{user}^{out} (T))}{sum (E_{user}^{all} (T))} e^{- λ d} \\ d = \sqrt{{(l (t) - g (t))}^{2}} \\ ω_{i} (t) = 1 - sum (E_{user}^{out} (T)) / sum (E_{user}^{all} (T)) \end{cases}

(3)

where

ω_{in} (t)

and

ω_{out} (t)

are the internal and external zero carbon indices, respectively. The external zero-carbon index depends on the similarity between the load curve and the system’s renewable energy curve, while the internal zero carbon index depends on the proportion of local renewable energy electricity.

sum (\cdot)

is the vector sum function and

λ

is the similarity coefficient;

l (t)

and

g (t)

are the normalized values of the first-order differential of the user load and the system’s renewable energy electricity quantity time series, calculated using

l (t) = d E_{user}^{out} (t_{i}) / d t \cdot P_{user}^{\max}

and

g (t) = d E_{res} (t_{i}) / d t \cdot P_{G}^{\max}

, respectively. By using d to characterize the deviation between the renewable energy output and user load across different time periods, the similarity of temporal patterns can be assessed.

The purpose of using the load zero-carbon index is to quantitatively assess the proportion of renewable energy in the load. According to Equation (3), when there is no local renewable energy in the load,

ω_{in} (t) = 0

, the zero-carbon index is entirely determined by its contribution to the consumption of renewable energy in the power grid, and

ω_{out}

depends on the similarity of the curve shape between the load and the renewable energy in power system. The range ranges from 0 to 1, with an extreme value of 1 indicating a complete match between the renewable energy output pattern in the load and the power system.

ω_{out} = 0

,

ω_{in} (t) = 1

. In general, the index of external consumption of renewable energy depends on the shape of the load curve, and the overall zero carbon index of the load is the algebraic sum of the local and external consumption of renewable energy indices.

3. Multi-Level Self-Organized Aggregation Model for DERs

Using a bottom-up aggregation mode, each DER can be considered as an agent. DER agents cannot obtain information from other agents, nor can they influence the decisions of others. Whether they cooperate depends entirely on their adaptive behaviors. However, this behavior cannot be completely customized or designed and requires the constraints of basic rules to ensure that the actions of the lower-level individuals align with the system’s target. This section leverages the concept of zero-carbon index, as proposed earlier, to establish a model for individual environment adaptation. Subsequently, we define aggregation rules and a multi-level self-organized aggregation VPP model.

3.1. Low-Carbon Fitness Measure Function

The advancement of an agent is not determined by the number or capacity of the aggregation, but primarily reflects greater flexibility, enabling better adaptation to environmental changes. This adaptability dynamically evolves over time, influenced by the adaptive strategies adopted by the individual and the current state of the environment. To quantify this adaptability, this paper proposes a measurement function for the environmental adaptability of DERs. Assuming the agent is represented by A, the adaptive strategy by

π

, and the environment by

ξ

, the measure function of the agent adaptability to the environment can be expressed as Equation (4):

ξ ⇄_{π}^{\inf} A : μ_{ξ}^{π} (t)

(4)

where

\inf

represents the information perceived by the agent

A

from the environment

ξ

, and

μ_{A}^{π} (E)

is the fitness of A obtained by adopting adaptation strategies

π

in the current environment. This paper defines agent adaptability measurement function based on the zero-carbon index, as expressed in Equation (5):

μ_{A}^{π} (t) = ω_{z i} (t) = γ (t) f_{z i} (s (t), a (t))

(5)

where

γ (t)

is the zero-carbon index intensity of the power grid during a certain period; its value is related to the proportion of renewable energy in the preceding period.

ω_{z i} (t)

is the zero-carbon index of DER, with a value range of [0,1]. A zero-carbon index is 0 if it indicates that all the electricity for the DER comes from thermal power plants, while a zero-carbon index is 1 if it indicates that all the electricity of DER comes from the renewable power plant.

f_{z i}

is the load characteristics when the DER’s state is

s (t)

, and the action behavior is

a (t)

during the time period of t. The meaning of the above equation is that the fitness of DER is related to the similarity between zero-carbon index and the load curve.

3.2. Analysis of Evolutionary Factors

The reasons affecting an agent’s change in its own state include internal operational factors and external factors. Among them, the internal factors include the following:

(1) Working condition: For example, changes in temperature and humidity can alter the operation mode of air conditioning and humidification (or dehumidification) equipment [16]. The adaptive strategy in this context involves adjusting and optimizing the working state of the agent across various time periods.

(2) Human intervention: For instance, when a user leaves home, the lights are turned off, or an electric car is charging [17]. These are typically triggered by time or events to execute established procedures, thereby adjusting the agent’s state to meet the user’s expectations.

The external factors mainly include the following:

(1) Mutations in other intelligent states: Changes in the states of other intelligent agents in the lower aggregator of DERs can result in a mismatch of the zero-carbon index.

(2) Changes in zero-carbon index signals: Prediction errors in the system’s renewable energy can lead to updates in the zero-carbon index during the aggregation process [18].

3.3. Evolution Model

The environment represents the collective effect and is not an abstract entity independent of the agents. Instead, it is a comprehensive effect resulting from the interaction of numerous agents. Therefore, the adaptive behavior and evolutionary process of agents cannot be viewed in isolation; they must be considered as a synergistic evolution linked to the environment. The schematic diagram illustrating this concept is shown in Figure 1.

The system evolution is described as occurring over discrete time period with a step size of h. Throughout the entire research cycle T, the evolution process of the agent is represented by Equation (6):

[\begin{matrix} μ_{1} (E (t_{1})), & μ_{1} (E (t_{2})), & \dots, & μ_{1} (E (t_{n})) \\ μ_{2} (E (t_{1})), & μ_{2} (E (t_{2})), & \dots, & μ_{2} (E (t_{n})) \\ ⋮ & ⋮ & ⋮ \\ μ_{n} (E (t_{1})), & μ_{n} (E (t_{2})), & \dots, & μ_{n} (E (t_{n})) \end{matrix}]

(6)

where

μ_{i}^{π} (t_{j})

represents the environmental fitness of the i-th agent in the j-th time period. Each row of the matrix represents the evolution process of a specific agent during the research period, and each column represents the collaborative evolution status of the agent at a given time. The evolutionary goal of the agent is to optimize its strategies for each time period, so as to maximize its cumulative fitness over the research cycle, which calculates as Equation (7):

F = \max {\sum_{i = 0}^{n} μ_{ξ}^{π} (t_{i})}

(7)

3.4. Multi-Level Dynamic Self-Organized Aggregation Rules

To define the basic rules for the self-organized aggregation of agents, using two agents as examples, define the following:

Rule 1 (

R u l e_{1}

): Minimum fitness as non-Equation (8):

\min {μ_{A}, μ_{B}} < \min {μ_{A}^{A, B}, μ_{B}^{A, B}}

(8)

where

μ_{A}

and

μ_{B}

represent the environmental fitness of agent A and B before aggregation, respectively, and

μ_{A}^{A, B}

and

μ_{B}^{A, B}

represent the environmental fitness of A and B after aggregation. The rule represents that the minimum fitness value in A and B after aggregation is greater than the minimum value before aggregation, that is, aggregation improves the minimum fitness of the environment.

Rule 2 (

R u l e_{2}

): Maximum fitness as non-Equation (9):

\max {μ_{A}, μ_{B}} < \max {μ_{A}^{A, B}, μ_{B}^{A, B}}

(9)

where

R u l e_{2}

indicates an improvement in the maximum fitness of environment after aggregation.

Rule 3 (

R u l e_{3}

): Average fitness aggregation as non-Equation (10):

a v g {μ_{A}, μ_{B}} < a v g {μ_{A}^{A, B}, μ_{B}^{A, B}}

(10)

where

R u l e_{3}

indicates that the overall average fitness has been improved after aggregation.

Rule 4 (

R u l e_{4}

): Custom fitness aggregation as non-Equation (11):

f_{μ} {μ_{A}, μ_{B}} < f_{μ} {μ_{A}^{A, B}, μ_{B}^{A, B}}

(11)

where A custom function for fitness that represents the aggregation of agents towards a predetermined direction.

Based on the above four rules, agents can aggregate from simple individuals into complex aggregators, known as intermediaries; the intermediaries will continue to aggregate to form larger entities in subsequent decisions, thereby achieving a hierarchical structure of bottom-up, step-by-step aggregation (as shown in Figure 2).

Assuming VPP is an m-layer structure formed by DER agents, the structure can be shown as Equation (12):

\{\begin{cases} L (v p p) = 〈L (0), L (1), \dots, L (m)〉 \\ {x| x \in L (i)} \subseteq {x| x \in L (i - 1)} \end{cases}

(12)

where

L (i)

represents the hierarchical structure, which is an aggregation formed by the agents at the next level

L (i - 1)

according to aforementioned rules. It represents a specific agent within the hierarchy, where the set of agents at the previous level is a subset of the set of agents at the next level. Different levels denote varying degrees of abstraction and interaction complexity, and consequently, the aggregation rules differ for each level.

When DERs aggregates in a bottom-up, self-organized manner, the VPP can be viewed as an agent formed by multi-levels rules, and this hierarchy and combination method are dynamically adjusted in response to environmental changes [19]. Consequently, the VPP can achieve tracking of the system’s zero-carbon index by merely modifying environmental parameters, without needing to send commands to each individual DER agent. For an aggregator with m layer, the final convergence condition is shown as Equation (13):

\{\begin{cases} \begin{matrix} |P_{v p p} (h) - P_{v p p} (h - 1)| \leq ε_{c v g} & h = \{1, 2, \dots, h_{m}\} \end{matrix} \\ h_{m} \geq μ_{c v g} \end{cases}

(13)

where

P_{v p p} (h)

represents the overall load of the aggregator formed after h iterations.

ε_{c v g}

is a very small positive real number.

μ_{c v g}

is a positive integer. The meaning of the above equation is that, if the load change in the aggregator is less than a certain threshold,

ε_{c v g}

for

h_{m}

consecutive iterations, indicating that the overall load of the aggregator tends to stabilize, the self-organized aggregation is considered to have converged. The time taken from the start of the iteration to convergence is referred to as the convergence time.

Thus, this paper transforms the optimization problem of VPP control over DERs into a problem of rule design and agent evolution in a multi-agent system. By observing the evolution process of DERs, effective control over a large number of DERs can be achieved.

4. QMIX-Based Self-Organized Aggregation Algorithm

The multi-agent evolution model and multi-level dynamic self-organization aggregation method provide the foundational model and underlying mechanism for the collaborative control of agents. This chapter will propose a self-organized aggregation implementation method based on multi-agent reinforcement learning. Firstly, the self-organized process will be characterized as a Markov game process. Next, the objectives of multi-agent reinforcement learning will be outlined. Finally, the QMIX algorithm will be employed to solve the action–value function and determine the optimal joint behavior in the DERs aggregator.

4.1. Self-Organized Aggregation Based on Markov Game Theory

The state changes in DERs depend solely on the current state and behavior at a given time period, making the evolution of the agent a Markov process [20]. Additionally, the decision to “aggregate” or “split” between any two agents is a decision problem. The formation of the entire VPP hierarchical structure is a multi-stage decision process, where the state of the lower-level structure determines the specific form of the upper-level structure.

Considering the Markov process and game problem, we use the Markov game [21] to characterize the self-organized aggregation of the agents. A Markov game, also known as stochastic game, combines Markov decision processes and game theory. In this framework, the Markov processes describe the state of the agent, and the game describes the interactions between agents. The following 5-tuple is defined as Equation (14):

< N, S, A, M, R >

(14)

where

N = {1, 2, \dots, n}

represents n-th agent, S represents the joint state space of agent combination,

A

represents the action space of agent, T represents the state transition matrix of joint behavior, and

R

represents the reward of agent.

4.2. Objectives of Multi Agent Reinforcement Learning

Multi-agent reinforcement learning [22] is a Markov game process that integrates the Nash strategies of each stage–game state into a comprehensive strategy for an agent operating in a dynamic environment. This approach involves continuous interaction with the environment to update the value function (game reward) of each stage–game state. The goal of multi-agent reinforcement learning can be expressed as Equation (15):

\{\begin{cases} \sum Q * (s, a_{1}, \dots, a_{n}) π_{1}^{*} (s, a_{1}) \dots π_{n}^{*} (s, a_{n}) \geq \sum Q_{i} (s, a_{1}, \dots, a_{n}) π_{1} (s, a_{1}) \dots π_{n} (s, a_{n}) \\ V_{i}^{*} (s) = \sum_{a_{1}, \dots, a_{n} \in A_{1} \times \dots \times A_{n}} Q_{i}^{*} (s, a_{1}, \dots, a_{n}) π_{1}^{*} (s, a_{1}) \dots π_{n}^{*} (s, a_{n}) \\ Q_{i}^{*} (s, a_{1}, \dots, a_{n}) = \sum_{s^{'} \in S} T r (s, a_{1}, \dots, a_{n}, s^{'}) [R_{i} (s, a_{1}, \dots, a_{n}, s^{'}) + γ V_{i}^{*} (s^{'}) \end{cases}

(15)

where

s \in S

, which represents a specific state combination after the aggregation of agents, and

π_{i} (s, a_{i})

represents the action

a_{i}

taken by the i-th agent under the policy

π_{i}

when the state is s.

V_{i} (s)

is the state value function of the i-th combination under the condition, and

Q_{i} (s)

is the action value function at that state of s. In the self-organized aggregation problem of DERs, the value of Q is the algebraic sum of individual fitness in the organization, denoted by

\sum_{i = 1}^{n} μ_{i} (E)

. The * denotes the theoretical optimal value of this quantity, and

γ

is the discount factor. The meaning of this equation is that, for a given combination object

N = {1, 2, \dots, n}

, if the Q value obtained by traversing all policy combinations

{π_{1}, \dots, π_{n}}

is not greater than the corresponding Q value, then the joint policy

{π_{1}^{*}, \dots, π_{n}^{*}}

is the Nash equilibrium state, and

{a_{1}^{*}, \dots, a_{n}^{*}}

is the optimal joint behavior.

4.3. Training Process Based on QMIX Algorithm

Unlike single-agent reinforcement learning, multi-agent reinforcement learning must consider not only feedback from the environment but also the strategies adopted by other agents. The core challenge is how to quickly fit the joint value function and extract optimal distributed joint behavior through this function. Currently, widely used algorithms fall into three main categories:

(1) Decentralized training methods are represented by IQL (Independent Q-learning) [23], where each agent independently executes a Q-learning algorithm. This method fails to capture interactions with other agents and cannot adapt to dynamic changes in the environment, making convergence difficult.

(2) Fully centralized training methods are represented by COMA (Counterfactual Multi-Agent Policy Gradients) [24], which uses a neural network to calculate the value function of all actions of all agents simultaneously. While effective, this algorithm is less efficient when there are many agents.

(3) Value function decomposition methods are represented by VDN (Value Decomposition Network) [25], which directly sums the value functions of each agent to obtain the joint value function. However, this method limits significance of Q value.

The QMIX algorithm, proposed by Tabish Rashid, is an efficient value function decomposition algorithm. Building upon the VDN algorithm [26], QMIX uses a hybrid network to merge the local value functions of agents and incorporates global state information during the training process to enhance performance. Essentially, the QMIX algorithm improves the overall efficiency by optimizing the mapping relationship between the individual agent value functions and the joint value functions.

The training process based on the QMIX algorithm, as shown in Figure 3, primarily involves training the agent’s proxy network using the Deep Recurrent Q-Network (DRQN) and global training utilizing the hybrid network.

(1) Training of the agent’s proxy network based on DRQN

A single agent cannot obtain the complete global state, as it operates within a partially observable Markov decision process (POMDP). The Deep Recurrent Q Network (DRQN) is employed to address the decision-making and valuation of the agent under partially observable conditions. The fundamental workings of the algorithm can be expressed as Equation (16):

(o_{t}^{i}, a_{t - 1}^{i}) \Rightarrow Q_{i} (τ^{i}, a_{t}^{i})

(16)

This equation indicates that the agent determines the current action

a_{t}^{i}

and

Q

value based on the input of the current observation

o_{t}^{i}

(combined with the actions taken by other agents) and the agent’s own action

a_{t - 1}^{i}

at the previous time step, then records the sample. Here,

τ^{i} = (a_{0}^{i}, o_{1}^{i}, \dots, a_{t - 1}^{i}, o_{t}^{i})

represents the sample record of the i-th agent’s action-observation starting from the initial state.

In the DRQN, the structure of the Deep Q Network (DQN) is utilized, where the fully connected layer at the end of the convolutional layers is replaced by a variant of the LSTM model called Gated Recurrent Unit (GRU). The hidden layer’s state parameters at time t are recorded. Specifically, the DRQN replaces the fully connected layer of the last convolutional layer with the GRU (Gated Recurrent Unit) variant of the LSTM model on the structure of the DQN and records the hidden layer state parameters by

h_{t}

during time t.

(2) Global Training Based on Hybrid Networks

The QMIX algorithm uses a centralized learning approach to obtain distributed policies. The training process of the joint action value function does not record the value

a_{t}^{i}

of each individual agent. It only needs to ensure that the optimal behavior executed on the joint value function aligns with the optimal behavior set executed by each agent, as shown in Equation (17):

\arg \max Q_{t o t} (τ, a) = (\begin{matrix} \arg \max Q_{1} (τ^{1}, a^{1}) \\ ⋮ \\ \arg \max Q_{n} (τ^{n}, a^{n}) \end{matrix})

(17)

where

\arg \max Q_{i}

represents the maximum Q value of the action value function for the i-th agent, and

\arg \max Q_{t o t}

represents the maximum value Q of the joint value function. As a result, each agent only needs to adopt a greedy policy during the training process to select actions

a^{i}

that maximize

\arg \max Q_{i}

, thereby participating in the decentralized decision-making process. To satisfy this equation, QMIX converts it into a monotonicity constraint and employs a hybrid network to achieve this (as illustrated in Figure 3b). The calculation process is shown as Equation (18).

\begin{matrix} \frac{\partial Q_{t o t}}{\partial Q_{i}} \geq 0 & \forall i \in \{1, 2, \dots, n\} \end{matrix}

(18)

The basic functions of hybrid networks can be expressed as Equation (19):

\{\begin{cases} {Q_{i} (τ^{i}, a_{t}^{i})} \\ s_{t} \end{cases} \Rightarrow \{\begin{cases} {W_{j}} \\ b \end{cases}

(19)

The hybrid network takes as input the optimal actions

a_{t}^{i}

and Q taken by each agent at time t and the system’s state

s_{t}

, outputs the weights

W_{j}

, and offsets b of the hybrid network. To ensure the non-negativity of the weights, a linear network combined with an absolute value activation function is used, guaranteeing that the output remains non-negative. The offset of the last layer of the hybrid network is obtained through a two-layer network and ReLU activation function to achieve non-linear mapping.

The global training loss function of QMIX is expressed as Equation (20):

L (θ) = [\sum_{i = 1}^{m} (y_{i}^{t o t} - Q_{t o t} (τ, a, s, θ))^{2}]

(20)

where

y_{i}^{t o t}

is the i-th global sample, and

θ

is the network parameters.

4.4. An Example Using Two Air Conditioners

To test the training process, we take two air conditioners as examples. The state sets are {off, ModeC-18 °C, ModeC-22 °C, ModeC-26 °C, and ModeC-30 °C}, where “off” indicates the air conditioner is turned off, and ”ModeC-18” indicates that the air conditioner is operating at a cooling temperature of 18 °C. The action set is {ctr-off, ctr-18, ctr-22, ctr-26, ctr-30}, where “ctr-off” turns the air conditioner off and “ctr-22” sets the cooling temperature to 22 °C. At an ambient temperature of 22 °C, the power consumption for each state is (in kW) {0, 1.5, 1.33, 1.1, 0.76}.

Given a specified zero-carbon index, the results of the self-organized aggregation training for two air conditioners using the QMIX algorithm during peak periods (when load reduction is required) are shown in Table 1.

In Table 1, the first column and first row represent the initial state of any two air conditioners. Assuming two agents, A and B, the content within this table represents the target state and fitness. For instance, when the values in the third row and fifth column indicate the states A and B are ModeC-22 °C and ModeC-26 °C, respectively, the joint actions ctr-30 and ctr-30 are obtained through QMIX training, resulting in joint states of (ModeC-30 °C, ModeC-30 °C). At this point, the fitness of A and B is 1.07 and 1.08, respectively.

Table 1 details the decision-making behaviors of the air conditioners during a peak period when the system has limited renewable energy, necessitating a reduction in their load. The training results show that the maximum fitness is obtained when both air conditioners are set to 30 °C (1.08, 1.08), resulting in the lowest load. These results suggest that the air conditioners’ behaviors are crucial in determining the fitness, reflecting the system’s regulation needs and aligning with its guidance objectives.

4.5. Framework and Workflow of Algorithm

Through the centralized training method described above, any combination of agents can quickly achieve the maximum zero-carbon index and determine the corresponding optimal joint behavior for actions such as “ aggregate” or “splitting”. The algorithm for self-organized aggregation of agents is shown in Figure 4.

5. Experimental Results

5.1. Experimental Setup

We use three typical parks prosumers as experiments. These parks collectively contain a total of 560 agents including 42 charging piles with a maximum load of 7 kW each and 3 photovoltaics systems with installed capacities of 2800 kW, 430 kW, and 150 kW, respectively. The remaining adjustable load is for air conditioning, with a maximum power of 1.5 kW per unit. The zero-carbon indexes of the system and local renewable energy are shown in Figure 5, and the load curves are shown in Figure 6. The aggregation level of the agent limited to 5 levels.

5.2. Result Analysis

(1) Flexibility Analysis of Aggregators

According to the aforementioned aggregation rules, all activated air conditioners will self-organize and aggregate under the guidance of the zero-carbon index.

Figure 7a shows the distribution of aggregates at the 4th level after a series of aggregations and splitting in the 10th hour. From the scatter plot, it can be observed that the purple and green agents have higher fitness, indicating a higher flexibility of loads during the current period and thus occupying the majority. Conversely, the blue and red agents have lower fitness, indicating a lower flexibility of loads during the current period and thus occupying the minority.

Based on the incentive direction of the zero-carbon index, the flexibility distribution status of all intelligent agents throughout the day, obtained through self-organized aggregation is shown in Figure 7b. During operating hours from 7:00 am to 7:00 pm, the load flexibility is primarily adjusted downwards. Similarly, during non-operating hours, the load is mainly adjusted downwards.

(2) Tracking of Zero-Carbon Index

The experimental results targeting the zero-carbon index are shown in Figure 8. The blue line represents the zero-carbon index curve, the black line represents the original load curve of parks, and the green line represents the load curve of the parks after optimization. Through self-organized aggregation control methods, the peak-to-valley difference in the park load has been reduced, and the overall load curve is approaching the target encouraged by the zero-carbon index.

Simultaneously, the overall change in the external load of the park provides a certain equivalent adjustment capability for the power grid. Additionally, from the perspective of the adjustment direction in each period, emission reduction in the parks is no longer merely about reducing the load. During periods of low power-grid demand and high renewable-energy output, it is also possible to promote the utilization of renewable energy by appropriately increasing the load, thereby alleviating the issue of insufficient power grid regulation resources. The Euclidean distance between the park load and the zero-carbon index before optimization is 0.13. Through load regulation, this Euclidean distance has decreased to 0.084, increasing the similarity by 35.4%.

5.3. Comparison Analysis with Existing Methods

(1) Analysis of Assessment Methods

This paper compares two low-carbon assessment methods: assessment by baseline load and assessment by zero-carbon index. The baseline load uses the average value of the load at the same time in the previous week.

In Figure 9, when the adjustment amount is greater than 0, it indicates an increase in the overall power consumption of the park. Conversely, when the adjustment amount is less than 0, it leads to a decrease in the overall load. As shown in Figure 9, using the energy-saving operation method based on baseline load, emission reduction in the park can only be achieved by reducing the load. Consequently, the park’s load is adjusted downward during various periods, and its benefit is calculated by comparing it with the historical average level. This method of evaluating low-carbon efficiency is tied to past average electricity consumption levels. While it may achieve some results at certain times, it is difficult to sustain, and it is uncertain whether this load reduction can achieve substantial emission reduction benefits for the overall grid.

On the other hand, the energy-saving operation method based on the zero-carbon index, which is standardized across the entire grid, allows for the possibility of both increasing and decreasing the load. This means that emission reductions in the park do not necessarily have to be achieved solely by reducing the load. Even with the same total electricity consumption, as long as the temporal usage pattern aligns with that of renewable energy, the same emission reduction benefits can be realized. Additionally, driven by carbon constraints, the grid can fully leverage the temporal flexibility of the load to enhance the utilization of renewable energy. Therefore, the method proposed in this paper can provide richer regulatory resources for the grid, facilitating a higher utilization of renewable energy into the power system, while effectively promoting the reduction in direct carbon emissions within the power system.

(2) Analysis of Low-Carbon Benefits

Compared to traditional emission reduction methods, the approach proposed in this paper shifts the focus from local individual optimization to the overall transformation of the power system towards low-carbonization. The modification in park load shape will transform it into a low-carbon-friendly user from the perspective of the power grid. As the scale of participating responsive agents increases, these large-scale, low-carbon-friendly users will be able to significantly enhance the power system’s ability to utilize a higher proportion of renewable energy.

To verify the effectiveness of this method, this paper conducts simulation analyses based on the IEEE 14-node and 39-node systems. These simulations consider the impact of different scales of intelligent agents on renewable energy utilization, as shown in Figure 10.

As seen in Figure 10, with the increasing scale of agent integration, the proportion of renewable energy utilization steadily rises, and the system’s carbon emissions gradually decrease. There are two primary reasons for this:

First, when the agent scale is small, the system’s flexibility is limited, leading to minimal changes in load shape. Some of the park’s load may be closely tied to thermal power units, resulting in a limited promotion effect on renewable energy utilization and a smaller impact on encouraging renewable energy use across various periods.

Second, in more complex power grid scenarios, such as the IEEE 33-node system, renewable energy units are more dispersed, and the spatial constraints have a diminished impact on load flexibility. Thus, under complex network conditions, with the integration of a high proportion of renewable energy, the flexibility and emission reduction capabilities of the agents are enhanced.

(3) Analysis of Algorithm Performance

To verify the effectiveness of the algorithm proposed in this paper, a comparative analysis was conducted using centralized, distributed, and self-organized aggregation algorithms. The centralized algorithm employs mixed integer programming (MIP), while the distributed approach utilizes the Lagrange multiplier method. The comparison index is convergence time, and the simulation results are shown in Table 2.

From Table 2, it can be seen that when the number of optimized agents is relatively small—such as 56 agents—the MIP algorithm only takes 0.71 s to converge, whereas the QMIX algorithm requires 0.81 s. This is because, with a small number of agents, the centralized environment can utilize complete information to quickly find the optimal solution, whereas the time required for information exchange in a distributed environment affects overall convergence efficiency.

However, when the number of agents reaches 5600 or 56,000, the efficiency of centralized solving decreases significantly, with a convergence time of 14.99 s. In contrast, the distributed algorithm performs better. The Lagrange multiplier method takes 8.65 s, and the QMIX method only takes 6.2 s for a single optimization, improving by approximately 8.79 s. This is because the QMIX algorithm achieves higher efficiency through concurrent coordinated solving of the optimal combination compared to the centralized approach.

5.4. Recommendations

We introduce the concept of the zero-carbon index, which establishes users’ low-carbon benefits based on their contribution to renewable energy consumption and their provision of flexibility to the power grid. Consequently, building a zero-carbon power system is not solely a power generation issue; it requires the coordinated cooperation of generation, grid, and load. To promote the implementation of this approach, we propose the following three recommendations:

(1) Improve forecasting accuracy: Enhance the accuracy of renewable energy output forecasts to obtain a more precise zero-carbon curve in the day-ahead stage.

(2) Assess grid regulation capabilities: The grid must accurately evaluate the regulation capabilities of each bus node to ensure users can effectively track the zero-carbon curve.

(3) Link benefits with economics: Tie users’ low-carbon benefits to economic incentives, gradually reduce carbon quotas annually, and thereby encourage large-scale user participation in grid regulation to promote renewable energy consumption.

6. Conclusions

The paper proposes a self-organized aggregation method for virtual power plants using multi-agent reinforcement learning, and experiments are conducted to verify the effectiveness of the method. The conclusions are summarized as follows:

(1) The proposed zero-carbon index evaluation metric expands the issue of carbon reduction for demand-side agents from a local perspective to a holistic view of the power grid. This approach provides substantial flexibility for large-scale grids and effectively promotes the integration of renewable energy.

(2) The self-organized aggregation algorithm achieves bottom-up, simple-to-complex aggregation, effectively aligning the behavior of agents with the incentive direction of the zero-carbon index.

(3) The QMIX algorithm can be effectively applied to the integration and optimization of large-scale flexible loads in complex environments, significantly improving the coordination efficiency of a massive number of agents in time-varying environments.

In this paper, the authors’ primary focus is on the technical aspects of the methods and strategies for the self-organized aggregation of large-scale loads to track the zero-carbon curve. In the next phase, we plan to build on this foundation by incorporating economic factors. From the user’s perspective, we aim to develop bottom-up aggregation rules and state optimization control strategies for large-scale loads, taking into account the trading mechanisms of both the electricity market and the carbon market.

Author Contributions

G.H. (Gengsheng He), methodology; Y.H., software; G.H. (Guori Huang), validation; X.L., formal analysis; P.L., investigation; Y.Z., resources; G.H. (Gengsheng He), data curation; G.H. (Gengsheng He). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Southern Power Grid Corporation Technology Project 066600KK52222044/GZKJXM20222165.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

Author Yu Huang was employed by the company Electric Power Research Institute of Guizhou Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Shokouhinejad, H.; Guerra, E.C. Self-Scheduling Virtual Power Plant for Peak Management. Energies 2024, 17, 2705. [Google Scholar] [CrossRef]
Ramos, L.F.; Canha, L.N.; Prado, J.C.d.; de Menezes, L.R.A.X. A Novel Virtual Power Plant Uncertainty Modeling Framework Using Unscented Transform. Energies 2022, 15, 3716. [Google Scholar] [CrossRef]
Zhang, L.; Li, F.; Wang, Z.; Zhang, B.; Qu, D.; Lv, Q.; Liu, D. A Two-Stage Optimization Model of Capacity Allocation and Regulation Operation for Virtual Power Plant. Math. Probl. Eng. 2022, 2022, 7055106. [Google Scholar] [CrossRef]
Zhang, J.; Liu, D.; Lyu, L.; Zhang, L.; Du, H.; Luan, H.; Zheng, L. Multi-Time-Scale Low-Carbon Economic Dispatch Method for Virtual Power Plants Considering Pumped Storage Coordination. Energies 2024, 17, 2348. [Google Scholar] [CrossRef]
Israr, A.; Yang, Q.; Israr, A. Emission-Aware Sustainable Energy Provision for 5G and B5G Mobile Networks. IEEE Trans. Sustain. Comput. 2023, 8, 670–681. [Google Scholar] [CrossRef]
Huang, J.H.; Ciren, O.Z.; Zhou, H.; Xie, B.P.; Huang, T.; Hu, Y.L. Normalized Low-carbon Operation Scheme of Park Based on Green Power Index. High Volt. Eng. 2022, 48, 2554–2562. [Google Scholar]
Li, Y.; Zhang, N.; Du, E.; Liu, Y.; Cai, X.; He, D. Mechanism study and benefit analysis on power system low carbon demand response based on carbon emission flow. Proc. CSEE 2022, 42, 2830–2842. [Google Scholar]
Zhang, Y.; Ge, X.; Li, M.; Li, N.; Wang, F.; Wang, L.; Sun, Q. Demand Response Potential Day-Ahead Forecasting Approach Based on LSSA-BPNN Considering the Electricity-Carbon Coupling Incentive Effects. IEEE Trans. Ind. Appl. 2024, 60, 4505–4516. [Google Scholar] [CrossRef]
Yuxia, D.; Mingjie, L. Enterprise digital transformation and Carbon Emissions: Reduce or promote?—Experience from listed companies in China (October 2023). IEEE Access 2024, 12, 15726–15734. [Google Scholar] [CrossRef]
Yang, Y.; Qiu, J.; Zhang, C. Distribution Network Planning Towards a Low-Carbon Transition: A Spatial-Temporal Carbon Response Method. IEEE Trans. Sustain. Energy 2023, 15, 429–442. [Google Scholar] [CrossRef]
Hua, H.; Chen, X.; Gan, L.; Sun, J.; Dong, N.; Liu, D.; Qin, Z.; Li, K.; Hu, S. Demand-Side Joint Electricity and Carbon Trading Mechanism. IEEE Trans. Ind. Cyber-Phys. Syst. 2024, 2, 14–25. [Google Scholar] [CrossRef]
Liu, B.; Dong, J.; Park, B.; Lian, J.; Kuruganti, T. Impact of Price-Responsive Load and Renewables in an Emission-Aware Power Systems. IEEE Open Access J. Power Energy 2024, 11, 15–26. [Google Scholar] [CrossRef]
Zhou, H.; Huang, T.; Lu, S.X.; He, G.Y.; Zhou, Y.C.; Yan, Z. Discrete Analysis Theory and Calculation Method of Electricity-carbon Decoupling Sharing by Contribution to Carbon Emission Reduction. Proc. CSEE 2023, 43, 9033–9045. [Google Scholar]
Kang, C.; Zhou, T.; Chen, Q.; Wang, J.; Sun, Y.; Xia, Q.; Yan, H. Carbon Emission Flow from Generation to Demand: A Network-Based Model. IEEE Trans. Smart Grid 2015, 6, 2386–2394. [Google Scholar] [CrossRef]
Chen, X.; Wang, J.; Xie, J.; Xu, S.; Yu, K.; Gan, L. Demand response potential evaluation for residential air conditioning loads. IET Gener. Transm. Distrib. 2018, 12, 4260–4268. [Google Scholar] [CrossRef]
Hui, H.; Tang, J.Y.; Wang, Y.F.; Xia, X.R.; Wang, F.; Hu, P.F. Long-time-scale Charging and Discharging Scheduling of Electric Vehicles Under Joint Price and Incentive Demand Response. Autom. Electr. Power Syst. 2022, 46, 46–55. [Google Scholar]
Chen, C.S.; Duan, S.X.; Cai, T.; Dai, Q. Short-Term Photovoltaic Generation Forecasting System Based on Fuzzy Recognition. Trans. China Electrotech. Soc. 2011, 26, 83–89. [Google Scholar]
Zhou, H.; Shuai, F.; Wu, Q.; Dong, L.X.; Li, Z.Y.; He, G.Y. Stimulus-response control strategy based on autonomous decentralized system theory for exploitation of flexibility by virtual power plant. Appl. Energy 2021, 285, 116424. [Google Scholar] [CrossRef]
Zhai, S.; Zhou, H.; Wang, Z.; He, G. Analysis of dynamic appliance flexibility considering user behavior via non-intrusive load monitoring and deep user modeling. CSEE J. Power Energy Syst. 2020, 6, 41–51. [Google Scholar] [CrossRef]
Chao, H.; Quan, D.; Yang, T. Multi-time Scale Simulation of Optimal Scheduling Strategy for Virtual Power Plant Considering Load Response. In Proceedings of the 2018 International Conference on Power System Technology (POWERCON), Guangzhou, China, 6–8 November 2018; pp. 2142–2148. [Google Scholar]
Zhu, Z.; Chan, K.W.; Xia, S.; Bu, S. Optimal Bi-Level Bidding and Dispatching Strategy Between Active Distribution Network and Virtual Alliances Using Distributed Robust Multi-Agent Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 13, 2833–2843. [Google Scholar] [CrossRef]
Zhao, L.; Chang, T.; Chu, K.; Guo, L.; Zhang, L. Survey of Fully Cooperative Multi-Agent Deep Reinforcement Learning. Comput. Eng. Appl. 2023, 59, 14. [Google Scholar] [CrossRef]
Han, O.; Ding, T.; Bai, L.; He, Y.; Li, F.; Shahidehpour, M. Evolutionary Game Based Demand Response Bidding Strategy for End-Users Using Q-Learning and Compound Differential Evolution. IEEE Trans. Cloud Comput. 2022, 10, 97–110. [Google Scholar] [CrossRef]
Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual Multi-Agent Policy Gradients. Proc. AAAI Conf. Artif. Intell. 2018, 32, 11794. [Google Scholar] [CrossRef]
Dietzenbacher, B.; Borm, P.; Hendrickx, R. Decomposition of network communication games. Math. Methods Oper. Res. 2017, 85, 407–423. [Google Scholar] [CrossRef]
Han, C.; Yao, H.; Mai, T.; Zhang, N.; Guizani, M. QMIX Aided Routing in Social-Based Delay-Tolerant Networks. IEEE Trans. Veh. Technol. 2022, 71, 1952–1963. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of agent evolution.

Figure 2. Hierarchical self-organizing architecture for agents.

Figure 3. Self-organized aggregation training process of agents using the QMIX algorithm.

Figure 4. Large-scale load-tracking control process for zero-carbon curve index.

Figure 5. The zero-carbon index in power grid and local renewable energy curve in park.

Figure 6. Different types of load curve within the park.

Figure 7. Flexibility analysis of aggregators.

Figure 8. Control Performance of Park Load Tracking for Zero-Carbon Index.

Figure 9. Comparative analysis of the two evaluation methods.

Figure 10. Analysis of low-carbon benefits in IEEE 14-node and IEEE-39-node systems.

Table 1. The self-organized aggregation training results of the air conditioner.

	Off	ModeC-18 °C	ModeC-22 °C	ModeC-26 °C	ModeC-30 °C
off	(off, off) (0, 0)	(off, ModeC-26 °C) (0, 1.01)	(off, ModeC-30 °C) (0, 1.07)	(off, ModeC-30 °C) (0, 1.08)	(off, ModeC-30 °C) (0, 0.84)
ModeC-18 °C	(ModeC-26 °C, off) (1.01, 0)	(ModeC-26 °C, ModeC-26 °C) (1.01, 1.01)	(ModeC-26 °C, ModeC-30 °C) (1.01, 1.07)	(ModeC-26 °C, ModeC-30 °C) (1.01, 1.08)	(ModeC-26 °C, ModeC-30 °C) (1.01, 0.84)
ModeC-22 °C	(ModeC-30 °C, off) (1.07, 0)	(ModeC-30 °C, ModeC-26 °C) (1.07, 1.01)	(ModeC-30 °C, ModeC-30 °C) (1.07, 1.07)	(ModeC-30 °C, ModeC-30 °C) (1.07, 1.08)	(ModeC-30 °C, ModeC-30 °C) (1.07, 0.84)
ModeC-26 °C	(ModeC-30 °C, off) (1.08, 0)	(ModeC-30 °C, ModeC-26 °C) (1.08, 1.01)	(ModeC-30 °C, ModeC-30 °C) (1.08, 1.07)	(ModeC-30 °C, ModeC-30 °C) (1.08, 1.08)	(ModeC-30 °C, ModeC-30 °C) (1.08, 0.84)
ModeC-30 °C	(ModeC-30 °C, off) (0.84, 0)	(ModeC-30 °C, ModeC-26 °C) (0.84, 1.01)	(ModeC-30 °C, ModeC-30 °C) (0.84, 1.07)	(ModeC-30 °C, ModeC-30 °C) (0.84, 1.08)	(ModeC-30 °C, ModeC-30 °C) (0.84, 0.84)

Table 2. Performance comparison and analysis of algorithms.

Number of Agents	Centralized Algorithm Mixed Integer Programming	Distributed Algorithm Lagrangian Multipliers	Self-Organized Aggregation QMIX
56	0.71	1.11	0.81
560	0.84	1.28	0.96
5600	2.6	1.59	1.35
56000	14.99	8.65	6.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, G.; Huang, Y.; Huang, G.; Liu, X.; Li, P.; Zhang, Y. Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning. Energies 2024, 17, 3688. https://doi.org/10.3390/en17153688

AMA Style

He G, Huang Y, Huang G, Liu X, Li P, Zhang Y. Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning. Energies. 2024; 17(15):3688. https://doi.org/10.3390/en17153688

Chicago/Turabian Style

He, Gengsheng, Yu Huang, Guori Huang, Xi Liu, Pei Li, and Yan Zhang. 2024. "Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning" Energies 17, no. 15: 3688. https://doi.org/10.3390/en17153688

APA Style

He, G., Huang, Y., Huang, G., Liu, X., Li, P., & Zhang, Y. (2024). Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning. Energies, 17(15), 3688. https://doi.org/10.3390/en17153688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessment of Low-Carbon Flexibility in Self-Organized Virtual Power Plants Using Multi-Agent Reinforcement Learning

Abstract

1. Introduction

2. Low-Carbon Benefit Evaluation for Load Flexibility via Zero-Carbon Index

2.1. Definition of Zero-Carbon Index

2.2. Low-Carbon Benefits of Load Flexibility Using Zero-Carbon Index

3. Multi-Level Self-Organized Aggregation Model for DERs

3.1. Low-Carbon Fitness Measure Function

3.2. Analysis of Evolutionary Factors

3.3. Evolution Model

3.4. Multi-Level Dynamic Self-Organized Aggregation Rules

4. QMIX-Based Self-Organized Aggregation Algorithm

4.1. Self-Organized Aggregation Based on Markov Game Theory

4.2. Objectives of Multi Agent Reinforcement Learning

4.3. Training Process Based on QMIX Algorithm

4.4. An Example Using Two Air Conditioners

4.5. Framework and Workflow of Algorithm

5. Experimental Results

5.1. Experimental Setup

5.2. Result Analysis

5.3. Comparison Analysis with Existing Methods

5.4. Recommendations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI