Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach

Pan, Weiqi; Yu, Xiaorong; Guo, Zishan; Qian, Tao; Li, Yang

doi:10.3390/en17112491

Open AccessArticle

Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach

by

Weiqi Pan

¹,

Xiaorong Yu

²,

Zishan Guo

¹,

Tao Qian

^1,* and

Yang Li

¹

School of Electrical Engineering, Southeast University, Nanjing 210096, China

²

State Grid Jiangsu Electric Vehicle Service Co., Ltd., Nanjing 320105, China

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(11), 2491; https://doi.org/10.3390/en17112491

Submission received: 9 April 2024 / Revised: 14 May 2024 / Accepted: 20 May 2024 / Published: 22 May 2024

(This article belongs to the Special Issue Advanced Optimization and Control Strategies of Electric Vehicles and Green Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The integration of electric vehicles (EVs) into vehicle-to-grid (V2G) scheduling offers a promising opportunity to enhance the profitability of multi-energy microgrid operators (MMOs). MMOs aim to maximize their total profits by coordinating V2G scheduling and multi-energy flexible loads of end-users while adhering to operational constraints. However, scheduling V2G strategies online poses challenges due to uncertainties such as electricity prices and EV arrival/departure patterns. To address this, we propose an online V2G scheduling framework based on deep reinforcement learning (DRL) to optimize EV battery utilization in microgrids with different energy sources. Firstly, our approach proposes an online scheduling model that integrates the management of V2G and multi-energy flexible demands, modeled as a Markov Decision Process (MDP) with an unknown transition. Secondly, a DRL-based Soft Actor-Critic (SAC) algorithm is utilized to efficiently train neural networks and dynamically schedule EV charging and discharging activities in response to real-time grid conditions and energy demand patterns. Extensive simulations are conducted in case studies to testify to the effectiveness of our proposed approach. The overall results validate the efficacy of the DRL-based online V2G scheduling framework, highlighting its potential to drive profitability and sustainability in multi-energy microgrid operations.

Keywords:

electric vehicles; vehicle-to-grid; multi-energy microgrids deep reinforcement learning

1. Introduction

Electric vehicles (EVs) are increasingly being recognized not just as modes of transportation but also as valuable energy storage resources within smart grid contexts [1,2]. Vehicle-to-grid (V2G) technology enables bidirectional energy flow between EVs and the grid, allowing EVs to serve as mobile energy storage units that can both charge from and discharge to the grid [3,4]. This concept has garnered significant attention for its potential to enhance grid stability, facilitate renewable energy integration [5], and provide additional revenue streams for EV owners and grid operators [6,7].

A substantial body of literature exists on V2G technology, exploring various aspects such as its technical feasibility, economic viability, and environmental impacts [8,9,10,11,12]. Researchers have investigated different V2G control strategies, charging/discharging algorithms, and optimization techniques to maximize the benefits of V2G integration while minimizing its drawbacks. Reference [8] investigates the approach to reducing the cost of charging for electric utilities, thus increasing profits for EV owners. Reference [9] designs a two-stage optimization formulation to determine the charging and discharging schedule for EVs participating in a V2G service at an office building. A principles-based techno-economic model in [10] is developed to estimate the levelized cost of storage (LCOS) of V2G technology for energy arbitrage and frequency regulation. A two-stage stochastic optimization framework in [11] is built to derive the optimal power management strategy for the V2G systems, aiming to minimize operating costs. Reference [12] assesses the feasibility of the hybrid backup system within the microgrid that incorporates V2G technology to minimize the operational costs, the wasted energy in the dummy load, and the loss of power supply probability. Additionally, studies [13,14] have examined the potential challenges and barriers to widespread V2G adoption, including interoperability issues [15,16], regulatory hurdles [17], and consumer acceptance [18].

Despite the considerable body of research on V2G systems, a notable gap persists in the literature concerning the coordinated integration of V2G with multi-energy microgrids [19,20,21]. While existing studies have predominantly examined V2G in isolation, scant attention has been directed toward exploring its synergistic interactions with other distributed energy resources within the framework of multi-energy microgrids. This gap in the literature is particularly noteworthy given the escalating complexity of contemporary energy systems and the imperative for holistic approaches to energy management. The challenges inherent in this integration stem from the myriad sources of uncertainties in V2G online scheduling and the management of multi-energy flexible loads.

Deep Learning (DL) and Reinforcement Learning (RL) algorithms, while powerful in various domains, face specific limitations when applied to Vehicle-to-Grid (V2G) scheduling problems. DL algorithms, known for their ability to learn complex patterns from data, often struggle with the dynamic and stochastic nature of V2G environments. They may require extensive training data and struggle to adapt to real-time changes in EV charging patterns and grid conditions. On the other hand, RL algorithms, while effective in learning decision-making policies using trial and error, can cause challenges in V2G scenarios due to the high-dimensional action spaces and continuous state transitions involved. Additionally, RL algorithms may exhibit slow convergence rates and struggle with reward function design complexities in V2G scheduling, where the optimization goals are multifaceted and dynamic. These limitations underscore the need for more sophisticated approaches, such as Deep Reinforcement Learning (DRL) [22,23,24,25], which can leverage the strengths of DL for feature learning and RL for decision-making to effectively tackle the intricate challenges of online V2G scheduling and optimize multi-energy microgrids utilization while ensuring operator profitability.

DRL has demonstrated remarkable success in various domains of smart energy management, including demand response (DR), EV charging management, and frequency regulation. In reference [26], the authors implement a DRL-based approach to facilitating the pricing strategy of demand response providers (DRPs). The objective is to maximize their profits by strategically scheduling DR activities to enhance system reliability. In reference [27], a DR scheme is introduced for virtual power plants, aimed at minimizing difference penalties incurred from participation in the electricity market. Notably, the novel DRL-based framework is adept at handling inherent randomness within the system model. In addition, references [28,29] design a single EV V2G scheduling problem as a constrained Markov Decision Process (MDP). The aim is to find a constrained charging/discharging scheduling strategy to minimize the charging cost and guarantee the EV can be fully charged. Reference [30] formalizes the scheduling problem of EV charging as an MDP. It utilizes DRL algorithms to minimize the total charging time of EVs and maximal reduction in the origin-destination distance. References [31,32] apply advanced DRL algorithms to jointly optimize vehicle path planning and energy usage schemes for IoT networks. Finally, references [33,34] propose DRL-based approaches to enhance the overall operation efficiency in the coordinated power and transportation networks. However, the potential application of DRL in the context of V2G and multi-energy microgrids remains largely unexplored. By leveraging the power of DRL algorithms, it may be possible to develop intelligent and adaptive online V2G scheduling frameworks that can dynamically respond to changing grid conditions and user preferences in real time, thereby maximizing the economic and environmental benefits of V2G integration within multi-energy microgrids.

In this paper, we propose an innovative framework for online scheduling of V2G and multi-energy microgrids, leveraging DRL techniques. We begin by developing an integrated scheduling model that combines V2G management with the optimization of multi-energy flexible demands within microgrids. This model is formulated as an MDP, considering the uncertain preferences of end-users and fluctuating electricity prices. Our key contribution lies in the application of a novel Soft Actor-Critic (SAC) algorithm tailored specifically for dynamic V2G scheduling in response to real-time grid conditions and energy demand patterns. Using extensive simulations and case studies, we demonstrate the effectiveness of our proposed approach in maximizing the total profits of multi-energy microgrid operators (MMOs) while ensuring operational constraints and satisfying user requirements. Additionally, we highlight the scalability and practical implications of deploying our V2G scheduling framework in real-world microgrid deployments, paving the way for more efficient and sustainable energy management practices.

Our main contributions are twofold:

(1): Integration of V2G and multi-energy DR: Unlike existing literature, which often focuses solely on V2G scheduling or multi-energy DR in isolation, our framework integrates both aspects into a unified scheduling model. This integration allows for coordinated management of V2G and multi-energy flexible loads, maximizing the overall profitability of MMOs while adhering to operational constraints.
(2): DRL-based online scheduling framework with novel SAC algorithm: A novel online scheduling framework is proposed that leverages DRL to optimize the utilization of EV batteries within multi-energy microgrids. By formulating the scheduling problem as an MDP and employing a SAC algorithm, our framework dynamically schedules V2G activities in response to real-time grid conditions and energy demand patterns.

The remaining sections of the paper are structured as follows: Section 2 outlines the formulation of a V2G scheduling coordinated with microgrids. Section 3 details the proposed online V2G framework based on DRL with a Soft Actor-Critic (SAC) algorithm. Section 4 covers the simulation experiments and their corresponding results. Lastly, Section 5 provides conclusions and explores avenues for future research.

2. Formulation of V2G Scheduling Coordinated with Multi-Energy Microgrids

In this section, the V2G Scheduling Coordinated with Multi-energy Microgrids is formulated using the bilevel model. Specifically, the lower level of the bi-level model is the EVs with uncertain arrival and departure patterns and action responses by users in microgrids with different energy sources. On the other hand, the upper level of the bilevel model tackles the efficient coordination of multi-energy microgrids, integrating dynamic pricing strategies established by operators to manage multi-energy microgrid operations effectively.

2.1. The Overall Framework

Figure 1 depicts the proposed bi-level model of online V2G scheduling, strategically formed to optimize the operations of microgrids with the primary objective of minimizing total operational costs. The microgrids with different energy sources encompass three fundamental energy flows. Firstly, the power flow encompasses various components such as electricity transactions with the upper-level grid (either purchase or sale), interactions with EVs, the unpredictable output of Photovoltaic systems (PVs), production from a gas turbine, usage by electric heat pumps, and end-user consumption. Notably, EVs participate in bi-directional power flow within the microgrid, facilitated by V2G scheduling capabilities.

Secondly, the heat flow component represents heat generation from the electric heat pump and gas turbine, along with heat consumption by end users. Additionally, the gas flow component delineates the balance between gas supply, gas utilization by the gas turbine, and gas consumption by end-users. Energy prices established by operators influence end-users. End-users adapt their energy consumption patterns based on personal preferences, which are not disclosed to operators. Consequently, operators encounter two primary challenges in resolving the proposed model: (1) devising an online V2G scheduling strategy that navigates uncertainties in electricity prices and EV arrival/departure patterns and (2) orchestrating the harmonious integration of diverse energy sources within the microgrid to minimize overall operational costs. The formulation of the proposed model employs bilevel programming techniques, which will be elaborated upon in subsequent sections.

2.2. The Upper-Level Problem

The upper level pertains to the optimal operations of microgrids with electricity, heat, and gas demands. The dynamic pricing set by operators from the perspective of MMOs. Specifically, the objective function aims to minimize the total operational cost and is defined as follows:

\min \sum_{t \in T} (λ_{t}^{e, i n} p_{t}^{i n} + λ_{t}^{g, i n} g_{t}^{i n} - λ_{t}^{e, o u t} p_{t}^{o u t} - \sum_{k \in K} (λ_{t}^{k, e} p_{t}^{k} + λ_{t}^{k, g} g_{t}^{k} + λ_{t}^{k, h} h_{t}^{k}))

(1)

λ_{t}^{e, i n} \leq λ_{t}^{k, e} \leq λ_{t}^{e, \max}, \forall t, k

(2)

λ_{t}^{g, i n} \leq λ_{t}^{k, g} \leq λ_{t}^{g, \max}, \forall t, k

(3)

λ_{t}^{h, i n} \leq λ_{t}^{k, h} \leq λ_{t}^{h, \max}, \forall t, k

(4)

where

λ_{t}^{e, i n}

,

λ_{t}^{e, o u t}

and

λ_{t}^{g, i n}

represent the electricity purchase, selling prices, and gas purchase price at period

t

.

p_{t}^{i n}

,

g_{t}^{i n}

and

p_{t}^{o u t}

denote the purchasing power, the generation power of the gas turbine, and the selling power to the upper-level grid. In addition,

p_{t}^{k}

,

g_{t}^{k}

and

h_{t}^{k}

are the electricity, gas and heat demands at period

t

of end user

k

. These parameters are the feedback from the end users at the lower level based on different preferences for the prices of electricity

λ_{t}^{k, e}

, gas prices

λ_{t}^{k, g}

and heat prices

λ_{t}^{k, h}

set by MMOs. The relationships among the prices above can be formulated as constraints (2)–(4), which impose limits on the various demand pricing ranges set by MMOs.

Consequently, end-users will respond to these prices according to their preferences, adjusting their consumption behavior accordingly. Thus, the formulation of balance constraints of different energy sources can be expressed as follows:

p_{t}^{i n} - p_{t}^{o u t} + p_{t}^{g} + p_{t}^{P V} + \sum_{i} p_{t}^{i, V 2 G} = \sum_{k} p_{t}^{k} + p_{t}^{h}, \forall t

(5)

h_{t}^{p} + h_{t}^{g} = \sum_{k} h_{t}^{k}, \forall t

(6)

g_{t}^{i n} = g_{t}^{g} + \sum_{k} g_{t}^{k}, \forall t

(7)

where constraints (5)–(7) ensure the balances between demands and supply of electricity, heat, and gas, respectively. Specifically,

p_{t}^{i, V 2 G}

is the V2G power of EV i.

p_{t}^{g}

and

p_{t}^{P V}

are the generation power of the gas turbine and the uncertain output of PV. In addition,

p_{t}^{h}

is the consumption power of electric heat pump. As for the balance constraints of heat and gas,

h_{t}^{p}

and

h_{t}^{g}

describe the heat output by the gas turbine and the electric heat pump, while

g_{t}^{i n}

and

g_{t}^{g}

are the procured and used gas. The decision variables are subject to the following constraints:

p_{t}^{g, \min} \leq p_{t}^{g} \leq p_{t}^{g, \max}, \forall t

(8)

0 \leq p_{t}^{P V} \leq {\bar{p}}_{t}^{P V}, \forall t

(9)

0 \leq p_{t}^{h} \leq p_{t}^{h, \max}, \forall t

(10)

η^{h} p_{t}^{h} = h_{t}^{p}, \forall t

(11)

η_{g}^{h} g_{t}^{g} = h_{t}^{g}, \forall t

(12)

η_{g}^{p} g_{t}^{g} = p_{t}^{g}, \forall t

(13)

where constraints (8) describe the upper and lower limits of gas turbine generation

p_{t}^{g, \min}

and

p_{t}^{g, \max}

. Constraints (9) describe the PV actual output

{\bar{p}}_{t}^{P V}

. Constraints (10) limit the limits of generation out of electric heat pump. Constraints (11)–(13) describe the relation between the energy conversion process, where

η^{h}

,

η_{g}^{h}

and

η_{g}^{p}

denote the corresponding conversion efficiency. In specific,

η^{h}

is the efficiency between electricity converting into heat, while

η_{g}^{h}

and

η_{g}^{p}

denote the conversion ratio of different energy sources of the gas turbine.

In addition, EV i is required to charge its battery to a predetermined State of Charge (SoC) level before departing. It is assumed that during the parking intervals, MMOs have access to chargers and EVs, enabling them to regulate the charging power and switch between charging and discharging modes to facilitate V2G scheduling. This capability allows MMOs to achieve additional reductions in operational costs and enhance overall energy efficiency. The corresponding SoC constraints can be formulated as follows:

L_{t}^{i} = L_{t - 1}^{i} + η^{i, c h} p_{t}^{i, c h} - p_{t}^{i, d i s} / η^{i, d i s}, \forall i, t

(14)

L_{t}^{i, \min} \leq L_{t}^{i} \leq L_{t}^{i, \max}, \forall i, t

(15)

0 \leq P_{t}^{i, c h} \leq P_{t}^{c h, \max}, \forall i, t

(16)

0 \leq P_{t}^{i, d i s} \leq P_{t}^{d i s, \max}, \forall i, t

(17)

where

L_{t}^{i}

denotes the SoC level of EV i at period

t

.

L_{t}^{i, \min}

and

L_{t}^{i, \max}

denote the minimum and maximum levels of SoC, respectively.

η^{i, c h}

and

η^{i, d i s}

are the charging and discharging efficiency of EV i.

p_{t}^{i, c h}

and

p_{t}^{i, d i s}

are the charging and discharging power of EV i at period

t

, which are bounded by the charging and discharging power maximum limits

P_{t}^{c h, \max}

and

P_{t}^{d i s, \max}

. Constraints (14) describe the relationship between the SoC level and charging/discharging power. Constraints (15)–(17) denote the upper and lower limits of the EV states. Based on that, the total power of V2G can be represented as follows:

p_{t}^{i, V 2 G} = p_{t}^{i, d i s} - p_{t}^{i, c h}, \forall t, i

(18)

2.3. The Formulation of Lower-Level Problem

After MMOs establish the V2G scheduling and set prices for multi-energy demands, end-users would respond to these price signals by adapting their consumption patterns according to their preferences, aiming to maximize their welfare. Subsequently, MMOs monitor the multi-energy demands and make additional operational decisions. The specifics of the reaction of the kth end users in the lower-level problem can be formulated as follows:

\max_{p_{t}^{k}, g_{t}^{k}, h_{t}^{k}} \sum_{t \in T} (U (p_{t}^{k}, g_{t}^{k}, h_{t}^{k}) - λ_{t}^{k, e} p_{t}^{k} - λ_{t}^{k, g} g_{t}^{k} - λ_{t}^{k, h} h_{t}^{k})

(19)

U (p_{t}^{k}, g_{t}^{k}, h_{t}^{k}) = a_{t}^{k, p} {(p_{t}^{k})}^{2} + b_{t}^{k, p} p_{t}^{k} + a_{t}^{k, g} {(g_{t}^{k})}^{2} + b_{t}^{k, g} g_{t}^{k} + a_{t}^{k, h} {(h_{t}^{k})}^{2} + b_{t}^{k, h} h_{t}^{k}, \forall t, k

(20)

0 \leq p_{t}^{k} \leq p_{t}^{k, \max}, \forall k, t

(21)

0 \leq g_{t}^{k} \leq g_{t}^{k, \max}, \forall k, t

(22)

0 \leq h_{t}^{k} \leq h_{t}^{k, \max}, \forall k, t

(23)

where constraints (20) represent the welfare expression of kth end user, and

a_{t}^{k, p}

,

b_{t}^{k, p}

,

a_{t}^{k, g}

,

b_{t}^{k, g}

,

a_{t}^{k, h}

, and

b_{t}^{k, h}

denote the preference parameters that significantly impact the responses of end-users to the prices of different energy prices. Furthermore, constraints (21)–(23) delineate the upper limits on the energy consumption of various energy sources. Notably, parameters such as preferences information of end-users and energy prices remain private and undisclosed to MMOs. This confidentiality constraint underscores the necessity for developing an MDP-based formulation, which will be discussed in subsequent sections.

3. MDP Formulation for MMOs Online V2G Scheduling in Multi-Energy Microgrids

To devise the scheduling strategy for MMOs, the previously discussed bi-level programming is restructured as an MDP problem. This transformation facilitates the development of a DRL-based framework aimed at learning an effective online V2G scheduling strategy, considering the uncertainties surrounding EVs’ arrival and departure patterns, electricity prices, and the undisclosed end-users’ preference behavior patterns. The proposed framework is formulated within a finite MDP framework, empowering the DRL process to devise an online V2G scheduling that is synchronized with multi-energy microgrids.

Properly selecting states in an MDP is crucial, particularly in the context of online V2G scheduling, to achieve optimal performance and devise effective strategies. This importance becomes particularly pronounced when considering the substantial presence of EVs as distributed resources within the system. The choice of states in an MDP significantly influences the decision-making process of the reinforcement learning algorithm, impacting the quality of the resulting V2G scheduling strategy. Specifically, in the case of a large number of EVs acting as distributed resources, the states should adequately capture the dynamic nature of EV-related parameters such as arrival and departure patterns, battery state of charge, charging and discharging rates, and EV owners’ preferences and behaviors. By accurately defining these states, the MDP-based framework can learn and adapt to the complex and uncertain environment of V2G scheduling, leading to improved performance, enhanced energy efficiency, and better coordination of multi-energy microgrids with EVs as integral components.

In our proposed MDP formulation, the MMOs receive information regarding the remaining SoC EVs required

L_{t}^{i, r e}

, remaining charging time

T^{i, r e}

, electricity and gas prices

λ_{t}^{e, i n}

,

λ_{t}^{e, o u t}

and

λ_{t}^{g, i n}

. Given this information and the state at previous time periods, the DRP constructs the tensor of

s_{t}

. On that basis, the neural network of the MMOs determines its action

a_{t}

. Subsequently, this information is sent to each end-user at the corresponding time period. Upon receiving the dynamic pricing information, each end-user formulates its own optimization problem and adjusts its response accordingly. MMOs then calculate the total operation costs based on end users’ responses. Subsequently, MMOs observe the next state and generate an action. This iterative process continues in an online fashion until MMOs’ scheduling and pricing strategy terminates.

3.1. States

The upper level of the bi-level model describes optimal online scheduling of V2G in microgrids with different energy sources and time-varying price settings of an MMO. In specific, the objective function is to minimize the total operation cost as follows:

The states of MMOs include remaining SoC EVs required

L_{t}^{i, r e}

, remaining charging time

T^{i, r e}

, the selling and purchase of electricity prices and gas prices

λ_{t}^{e, i n}

,

λ_{t}^{e, o u t}

and

λ_{t}^{g, i n}

, at previous

N

and

M

periods, listed below:

s_{t} = [{(L_{t}^{i, r e}, T^{i, r e})}_{i = 1}^{I}, {(p_{t - n}^{k}, g_{t - n}^{k}, h_{t - n}^{k})}_{k = 1, n = 1}^{# K, # N}, {(λ_{t - m + 1}^{e, i n}, λ_{t - m + 1}^{e, o u t}, λ_{t - m + 1}^{g, i n})}_{m = 1}^{# M}], t \in T

(24)

3.2. Actions

Receiving the states

s_{t}

, the MMOs determine an action, denoting the online V2G scheduling and time-varying prices of different energy demands, which are sent to each end user

k

:

a_{t} = [{(p_{t}^{i, V 2 G})}_{i = 1}^{# I}, {(λ_{t}^{k, e}, λ_{t}^{k, g}, λ_{t}^{k, h})}_{k = 1}^{# K}], \forall t

(25)

3.3. Reward

In response to the action of MMOs, EVs perform the charging and discharging orders, and end users determine the energy usage plan, resulting in the total operation cost. The reward signals can be designed and expressed as follows:

r_{t} = - (λ_{t}^{e, i n} p_{t}^{i n} + λ_{t}^{g, i n} g_{t}^{i n} - λ_{t}^{e, o u t} p_{t}^{o u t} - \sum_{k \in K} (λ_{t}^{k, e} p_{t}^{k} + λ_{t}^{k, g} g_{t}^{k} + λ_{t}^{k, h} h_{t}^{k}) + C_{t}^{E V}), \forall t

(26)

where

C_{t}^{E V}

is associated with the penalty of not satisfying the required SoC when EV leaves at period

t

, which can be formulated as follows:

C_{t}^{E V} = \sum_{i} λ^{p e} \max {0, (L_{t}^{i} - L_{t}^{s e t})}, \forall t

(27)

3.4. Transition Function

States transition from

s_{t}

to

s_{t + 1}

is determined by action

a_{t}

of MMOs and inaccessible information of end users

k

as follows:

s_{t + 1} = f (a_{t}, s_{t}, ω)

(28)

where

ω

denotes the exogenous randomness information. In particular, the MMOs’ scheduling will explicitly influence the EVs’ SoC.

4. SAC Algorithm for Solving MDP Formulation

In this section, the SAC algorithm is introduced as a method to approximate the solution to the online V2G scheduling challenge. This approach enables the algorithm to obtain an optimized pricing strategy adaptively.

4.1. Preliminaries

The SAC algorithm represents a maximum-entropy off-policy DRL approach renowned for its enhanced sample efficiency and robust training outcomes. Diverging from conventional deep Q networks (DQN), SAC implements soft-version updates to the Q-network, fostering exploration across the action space. Moreover, the agent also receives an entropy bonus that encapsulates the policy’s stochastic nature, mitigating premature convergence towards suboptimal solutions. Consequently, the training objective aims to maximize the cumulative reward (or minimize the total cost) expressed as:

π_{θ}^{*} = \arg \max_{θ} \sum_{t = 1}^{T} γ^{t} E [r_{t} (s_{t}, a_{t}) + α H (π_{θ} (s_{t}))]

(29)

where

π_{θ}

denotes the policy network

θ

, and

γ

indicates the discount factor.

H (\cdot)

is the entropy expression and

α

is the temperature factor of the entropy term.

4.2. Training Process

Figure 2 illustrates the structural layout of the SAC algorithm: the actor network, the critic network, and the temperature parameter network. The critic network is characterized by

φ

and is trained to implement the soft Bellman optimality equation. In order to mitigate the issue of overestimation of Q values, the clipped double DQN approach is implemented, utilizing target networks parameterized by

\bar{φ}

.

y = \min_{i = 1, 2} r_{t} + γ (Q^{{\bar{φ}}_{i}} (s_{t + 1}, a_{t + 1}) - α \log π_{θ} (a_{t + 1} | s_{t + 1}))

(30)

J (φ_{i}) = \underset{(s_{t}, a_{t}, r_{t}, s_{t +!}) ~ B}{E} [{(Q^{φ_{i}} (s_{t}, a_{t}) - y)}^{2}], i = 1, 2

(31)

where

y

is the update value of two q functions, and

\log π_{θ} (\cdot)

indicates the policy entropy:

a_{θ} = \tanh (μ_{θ} + ε σ_{θ}), ε ~ N

(32)

J (θ) = \max_{θ} \underset{ε ~ N}{E} [\min_{i = 1, 2} Q^{φ_{i}} (s_{t}, a_{θ}) - α \log π_{θ} (a_{θ} | s_{t})]

(33)

Hence, the entropy automated adjusting method is applied to adaptively adjust the

α

via minimizing the expression as follows:

J (α) = \underset{ε ~ N}{E} [- α \log π_{θ} (a_{θ} | s_{t}) - α \bar{H}]

(34)

where the target entropy

\bar{H}

is set as the negative of the action dimension. In addition, target networks are applied to smooth the approximation of q functions. Based on that, the detailed training process of the SAC algorithm is presented in Algorithm 1.

Algorithm 1 The Proposed DRL-based Online V2G in Multi-energy Microgrids with SAC
1:	Initialize replay buffer $B$
2:	Initialize actor $θ$ , critic $φ_{i}$ , $α$ and target network ${\bar{φ}}_{i}$
3:	for each epoch do
4:	for each state transition step do
5:	Given $s_{t}$ , take actions $a_{t}$ based on (32)
6:	Observe the multi-energy demands (19) with $a_{t}$ as V2G scheduling and multi-energy prices
7:	Solve the scheduling model and obtain operation costs
8:	Receive $r_{t}, s_{t + 1}$ and record them in buffer $B$
9:	end for
10:	for each gradient step do
11:	$θ \leftarrow θ - λ_{θ} \nabla_{θ} J (θ)$
12:	$φ_{i} \leftarrow θ - λ_{φ_{i}} \nabla_{φ_{i}} J (φ_{i})$
13:	$α \leftarrow α - λ_{α} \nabla_{α} J (α)$
14:	${\bar{φ}}_{i} \leftarrow τ φ_{i} - (1 - τ) {\bar{φ}}_{i}$
15:	end for
16:	end for

5. Case Studies and Discussion

In this section, we conducted different case studies to verify the effectiveness of the proposed DRL-based framework for online V2G scheduling in microgrids with different energy sources. First, the training process of the policy network in SAC that approximates the optimal online V2G scheduling and multi-energy pricing strategy is shown. Then, we compare the V2G power of the proposed approach and optimal values solved in hindsight. Finally, we illustrate the ability of the proposed approach to approximate the SoC constraints given different arrival and departure times to showcase its huge capacity in the real-world V2G application. The case studies are composed of two parts with five EVs and ten EVs in microgrids which are denoted as Case I and Case II, respectively. In specific, the patterns of EVs’ arrival and departure are sampled from the given distributions, as shown in Table 1 and Table 2. Additionally, the electricity prices are selected from the open data of the New York Independent System Operator (NYISO). Specifically, the electricity prices of the NYC zone on 1 March 2023 are used.

5.1. Training Process and Results of Case I

This part presents the outcomes derived from the extensive training regimen of neural networks involving 5 EVs conducted over a span of 5000 epochs, as delineated in Figure 3. The efficacy of the DRL-based online V2G framework underwent scrutiny via the computation of the ratio between the optimal returns yielded by the bi-level model and those achieved by the neural network. It is imperative to note that the bi-level model benefitted from complete access to EV arrival and departure times as well as end users’ private preferences, contrasting with the proposed DRL-based approach that operates without any prior information.

The graphical representation illustrates a substantial variance in the neural network’s ratio, ranging between 0.20 and 0.65 during the initial 1500 training epochs owing to the Stochastic Actor-Critic (SAC) algorithm. This algorithm, designed to foster exploration, inclines the neural network towards selecting actions with elevated entropy to traverse the action space comprehensively. Such actions, albeit infrequently chosen, contribute to refining the neural network’s accuracy in estimating global values. Consequently, the policy neural network initiates the adoption of an optimal strategy grounded on the current fitting of the Q-function. As a result, the ratio rapidly increases to 0.90 between the 3500th and 5000th training epochs, indicating the neural network’s remarkable ability to approximate the optimal online V2G scheduling. After the 4000th training epoch, the neural network’s performance stabilizes, converging around 0.95.

The effectiveness of the proposed approach in approximating optimal values for V2G power management is evident from the charging and discharging strategy shown in Figure 4. The analysis of calculated mean absolute error (MAE) values reveals that the proposed approach approximates optimal values, showcasing its robustness and accuracy in handling charging and discharging power operations. Specifically, the MAE for charging power approximates 0.1029, indicating a relatively small deviation from the optimal values. Similarly, the MAE for discharging power is even lower at approximately 0.0701, underscoring the proposed approach’s exceptional ability to approximate optimal values with high precision. This close alignment is further evidenced by the shape of the charging and discharging power curves, which exhibit a significant resemblance to the optimal curves. The negligible discrepancies observed in the MAE values reinforce the notion that the proposed approach accurately captures the intricacies of V2G power dynamics, resulting in effective and reliable management strategies for energy systems optimization.

The analysis of the net purchase data for gas and electricity reveals distinct patterns indicative of strategic decision-making in energy procurement, as shown in Figure 5. Notably, during periods of low electricity prices, such as between 0:00 and 3:00, the net purchase of electricity peaks while gas purchases decline. Conversely, during daytime hours, when electricity prices are higher, there is a corresponding increase in gas purchases and a reduction in electricity procurement. This strategic shift in purchasing behavior aligns with optimizing the utilization of gas turbine power generation, effectively leveraging the complementary characteristics of multi-energy microgrids. By prioritizing electricity purchases during off-peak hours and strategically adjusting gas procurement based on market dynamics, the system efficiently meets the demand for power generation while minimizing costs. These findings underscore the adaptive and responsive nature of energy procurement strategies, showcasing the effectiveness of leveraging diverse energy sources within microgrid systems.

The analysis of State of Charge (SoC) constraints for EVs under the proposed approach reveals a commendable performance in meeting desired SoC levels, as shown in Figure 6. The data from 5 EVs, with SoC values ranging from approximately 0.793 to 0.810, showcases the effectiveness of the proposed approach in maintaining SoC close to the optimal target of 0.8. The proximity of the observed SoC values to the desired 0.8 thresholds indicates a high degree of compliance with SoC constraints, reflecting the precision and accuracy of the proposed approach in managing EV battery states. This capability is crucial for optimizing EV operations and ensuring optimal battery utilization while meeting performance requirements. Overall, the results highlight the effectiveness of the proposed approach in successfully managing SoC constraints for EVs, contributing to efficient and reliable EV operation within energy systems.

5.2. Training Process and Results of Case II

This part presents the outcomes derived from the extensive training regimen of neural networks involving 10 EVs conducted throughout 5000 epochs, as shown in Figure 7. The training trajectory of the proposed methodology, under the guidance of the SAC algorithm, manifests a gradual evolution towards heightened performance, substantiated by meticulous ratio curve analysis. Throughout the exploration phase, the ratio demonstrates a dynamic variation ranging from approximately 0.1 to 0.7, portraying the algorithm’s active engagement in exploratory learning processes. However, a discernible upward surge in the ratio becomes apparent with progressive training, achieving notable peaks reaching up to 0.90 between the 3500th and 5000th training epochs. This steep ascent underscores the algorithm’s adeptness in refining policies and executing judicious decisions, resulting in a substantial performance enhancement. After the 4000th training epoch, the ratio stabilizes at approximately 0.90, denoting a consistent and elevated level of performance closely aligned with optimal objectives. Compared with the training process in Case I, the neural network exhibits the ability to adapt effectively to the scenario with more EVs. The discerned pattern within the ratio curve serves as a testament to the DRL-based algorithm’s efficacy in fostering efficacious learning processes, adaptive behaviors, and performance amplification within the proposed approach.

In addition, the SAC algorithm is compared with Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) to illustrate its superiority in solving online V2G problems, as shown in Figure 7. It can be observed that SAC achieves a higher ratio than DDPG and TD3 thanks to its exploration policy, which is driven by the maximum entropy mechanism. In online V2G scheduling, the policy is challenged by the high dimensions of action space, and SAC is capable of efficiently updating the temperature factors and fine-tuning the policy network. Specifically, the ratio achieved by SAC is 103.5% and 66.1% higher than that of DDPG and TD3.

The efficacy of the proposed methodology in approximating optimal values for V2G power management is demonstrated in Figure 8. The MAE assessment reveals an effective approximation between the proposed approach and optimal values, highlighting its robustness and precision in managing charging and discharging power operations. Specifically, the MAE for charging power approximates 0.1382, indicating a minor deviation from optimal values. Similarly, the MAE for discharging power is even lower at approximately 0.1014, underscoring the remarkable accuracy of the proposed approach in approximating optimal values with exceptional precision. This alignment is further substantiated by the striking similarity between the charging and discharging power curves and the optimal curves, further reinforcing the accuracy of the proposed approach in capturing the complexities of V2G power dynamics. These findings underscore the effectiveness of the proposed methodology in devising efficient and reliable management strategies for optimizing energy systems.

Examining gas and electricity net purchase data reveals discernible patterns that reflect strategic decision-making in energy procurement, as depicted in Figure 9. Particularly noteworthy is the observed trend during periods of low electricity prices, notably between 0:00 and 3:00, where electricity procurement peaks while gas purchases diminish. In contrast, as daytime ensues and electricity prices rise, there is a corresponding surge in gas procurement coupled with a reduction in electricity acquisition. This strategic realignment of procurement strategies aligns with optimizing the utilization of gas turbine power generation, effectively capitalizing on the synergistic attributes of multi-energy microgrids. The strategic emphasis on electricity procurement during off-peak intervals, alongside the adept adjustment of gas acquisition in response to market dynamics, ensures efficient meeting of power generation demands while mitigating costs. These observations underscore the adaptive and responsive nature inherent in energy procurement strategies, highlighting the efficacy of integrating diverse energy sources within microgrid systems.

The assessment of SoC constraints for EVs within the proposed framework demonstrates commendable performance, as illustrated in Figure 10. The dataset comprising SoC values from 10 EVs, ranging between approximately 0.783 and 0.819, underscores the effectiveness of the proposed approach in maintaining SoC levels near the optimal target of 0.8 at the period of departure. This proximity to the desired SoC threshold signifies a high level of adherence to SoC constraints, indicative of the precision and accuracy embedded in the proposed methodology for managing EV battery states. This capability is of significant importance in optimizing EV operations and ensuring the optimal utilization of batteries while meeting performance benchmarks. In summary, these findings affirm the effectiveness of the proposed approach in adeptly managing SoC constraints for EVs, thus enhancing the efficiency and reliability of EV operations within energy systems.

5.3. Tests on the Robustness and Efficiency of the SAC Algorithm

In this section, we further testify to the robustness and efficiency of SAC algorithms under different levels of uncertainties. We first define the level of uncertainties as the ratio of variances and mean of the random variables. Specifically, the uncertainties of electricity prices and the various demands of end users are considered. We note that only the uncertain patterns of EV arrivals and departures are included in the training. The detailed performance of trained SAC under different levels of uncertainties is shown in Figure 11. The range of total profits becomes wider when the levels of uncertainties grow. Nonetheless, the average total profits slightly drop from $318.3 to $313.6 by 1.45%, which demonstrates the robustness and efficiency of the SAC algorithm.

5.4. Tests on the Scenarios with Different Numbers of EVs

In this section, we increase the number of EVs that participate in the online V2G scheduling to further validate the effectiveness of the SAC algorithm, as illustrated in Figure 12. It can be observed that the increase in the number of EVs will have an impact on the performance of the SAC algorithm due to the growing dimension of action space. With the same number of training epochs, the policy network has less chance to find the optimal V2G scheduling strategy. Hence, the ratio which represents the effectiveness of SAC will decrease with the rise in the number of EVs. Specifically, compared with the scenario with 10 EVs, the ratios drop by 5.1%, 6.6%, 11.4%, and 18.9% in the scenarios with 15, 20, 25, and 30 EVs, respectively. We note that the above comparison is conducted with the same size of policy neural network. With the larger capacities of neural networks, the performance will become better.

5.5. The Daily Battery Profiles of EVs

In this section, we testify the daily battery profiles of EVs to illustrate the effectiveness of the proposed approach. In specific, the daily battery profiles of five and ten EVs are shown in Figure 13 and Figure 14, respectively. It can be observed that, on the one hand, the online V2G scheduling controlled by the SAC algorithm will adapt to the EVs’ arrival and departure patterns and regulate the SoC levels to satisfy the charging requirements. In addition, SAC will utilize the capability of V2G to increase profits via charging at periods with lower electricity prices and discharging at periods with higher prices. When the departure periods are close, SAC will adjust SoC to the predefined level (80%).

6. Conclusions

In this paper, we introduce a novel approach for online Electric Vehicles (EVs) Vehicle-to-Grid (V2G) scheduling coordinated with multi-energy microgrids, leveraging Deep Reinforcement Learning (DRL) techniques. Our proposed framework effectively tackles challenges arising from uncertainties such as fluctuating electricity prices and EV arrival/departure patterns. By formulating the scheduling problem as a Markov Decision Process (MDP) and employing a Soft Actor-Critic (SAC) algorithm, we achieve dynamic and optimized scheduling of EV charging and discharging activities in response to real-time grid conditions and energy demand fluctuations. The results from extensive simulations in the case studies demonstrate the efficacy of our approach. We observed significant cost reductions for Multi-Model Operators (MMOs) due to improved coordination of V2G scheduling and multi-energy flexible loads. The average deviation from optimal V2G schedules was minimized, leading to enhanced profitability and sustainability in microgrid operations. Furthermore, our approach exhibited robustness and adaptability in handling uncertainties, showcasing its ability to dynamically adjust scheduling strategies based on evolving grid conditions and energy demand patterns. Overall, our work contributes to advancing smart grid technologies and promoting efficient utilization of renewable energy resources. The intelligent and adaptive online V2G scheduling strategies proposed in our framework have the potential to revolutionize energy management practices, unlocking the potential for more resilient and environmentally sustainable energy infrastructure.

For future work, we aim to expand our research in several directions. Firstly, we plan to investigate the integration of advanced forecasting techniques, such as machine learning-based prediction models, to enhance the accuracy of demand forecasting and improve the adaptability of our scheduling framework. Additionally, we will explore the incorporation of demand response mechanisms and energy storage technologies to further optimize energy utilization and grid stability. Moreover, we seek to evaluate the scalability and deployment feasibility of our approach in real-world multi-energy microgrid environments, considering factors like communication protocols, computational efficiency, and scalability.

Author Contributions

Conceptualization, W.P. and Y.L.; methodology, W.P.; software, W.P.; validation, W.P.; formal analysis, W.P., Z.G. and T.Q.; investigation, W.P. and X.Y.; resources, W.P.; data curation, W.P.; writing—original draft preparation, W.P.; writing—review and editing, Z.G. and T.Q.; visualization, W.P.; supervision, Y.L. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by Jiangsu Province Key Research and Development Program (BE2023093-2), and Jiangsu Key Laboratory of Smart Grid Technology and Equipment.

Data Availability Statement

The code used in this study is available from the authors upon request.

Conflicts of Interest

Author Xiaorong Yu was employed by the company State Grid Jiangsu Electric Vehicle Service Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, K.; Shao, C.; Zhang, H.; Wang, X. Strategic Pricing of Electric Vehicle Charging Service Providers in Coupled Power-Transportation Networks. IEEE Trans. Smart Grid 2023, 14, 2189–2201. [Google Scholar] [CrossRef]
Shao, C.; Li, K.; Qian, T.; Shahidehpour, M.; Wang, X. Generalized User Equilibrium for Coordination of Coupled Power-Transportation Network. IEEE Trans. Smart Grid 2023, 14, 2140–2151. [Google Scholar] [CrossRef]
Martin, X.A.; Escoto, M.; Guerrero, A.; Juan, A.A. Battery Management in Electric Vehicle Routing Problems: A Review. Energies 2024, 17, 1141. [Google Scholar] [CrossRef]
Armenta-Déu, C.; Demas, L. Optimization of Grid Energy Balance Using Vehicle-to-Grid Network System. Energies 2024, 17, 1008. [Google Scholar] [CrossRef]
Belany, P.; Hrabovsky, P.; Florkova, Z. Probability Calculation for Utilization of Photovoltaic Energy in Electric Vehicle Charging Stations. Energies 2024, 17, 1073. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Wang, X.; Shahidehpour, M. Deep Reinforcement Learning for EV Charging Navigation by Coordinating Smart Grid and Intelligent Transportation System. IEEE Trans. Smart Grid 2020, 11, 1714–1723. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Wang, X.; Zhou, Q.; Shahidehpour, M. Shadow-Price DRL: A Framework for Online Scheduling of Shared Autonomous EVs Fleets. IEEE Trans. Smart Grid 2022, 13, 3106–3117. [Google Scholar] [CrossRef]
Panchanathan, S.; Vishnuram, P.; Rajamanickam, N.; Bajaj, M.; Blazek, V.; Prokop, L.; Misak, S. A Comprehensive Review of the Bidirectional Converter Topologies for the Vehicle-to-Grid System. Energies 2023, 16, 2503. [Google Scholar] [CrossRef]
Chai, Y.T.; Che, H.S.; Tan, C.; Tan, W.-N.; Yip, S.-C.; Gan, M.-T. A Two-Stage Optimization Method for Vehicle to Grid Coordination Considering Building and Electric Vehicle User Expectations. Int. J. Electr. Power Energy Syst. 2023, 148, 108984. [Google Scholar] [CrossRef]
Rahman, M.M.; Gemechu, E.; Oni, A.O.; Kumar, A. The Development of a Techno-Economic Model for Assessment of Cost of Energy Storage for Vehicle-to-Grid Applications in a Cold Climate. Energy 2023, 262, 125398. [Google Scholar] [CrossRef]
Hou, L.; Dong, J.; Herrera, O.E.; Mérida, W. Energy Management for Solar-Hydrogen Microgrids with Vehicle-to-Grid and Power-to-Gas Transactions. Int. J. Hydrogen Energy 2023, 48, 2013–2029. [Google Scholar] [CrossRef]
Elkholy, M.H.; Said, T.; Elymany, M.; Senjyu, T.; Gamil, M.M.; Song, D.; Ueda, S.; Lotfy, M.E. Techno-Economic Configuration of a Hybrid Backup System within a Microgrid Considering Vehicle-to-Grid Technology: A Case Study of a Remote Area. Energy Convers. Manag. 2024, 301, 118032. [Google Scholar] [CrossRef]
Wan, M.; Yu, H.; Huo, Y.; Yu, K.; Jiang, Q.; Geng, G. Feasibility and Challenges for Vehicle-to-Grid in Electricity Market: A Review. Energies 2024, 17, 679. [Google Scholar] [CrossRef]
Jia, H.; Ma, Q.; Li, Y.; Liu, M.; Liu, D. Integrating Electric Vehicles to Power Grids: A Review on Modeling, Regulation, and Market Operation. Energies 2023, 16, 6151. [Google Scholar] [CrossRef]
Wang, W.; Chen, J.; Pan, Y.; Yang, Y.; Hu, J. A Two-Stage Scheduling Strategy for Electric Vehicles Based on Model Predictive Control. Energies 2023, 16, 7737. [Google Scholar] [CrossRef]
Zhang, G.; Liu, H.; Xie, T.; Li, H.; Zhang, K.; Wang, R. Research on the Dispatching of Electric Vehicles Participating in Vehicle-to-Grid Interaction: Considering Grid Stability and User Benefits. Energies 2024, 17, 812. [Google Scholar] [CrossRef]
Eltamaly, A.M. Smart Decentralized Electric Vehicle Aggregators for Optimal Dispatch Technologies. Energies 2023, 16, 8112. [Google Scholar] [CrossRef]
Ahsan, S.M.; Khan, H.A.; Sohaib, S.; Hashmi, A.M. Optimized Power Dispatch for Smart Building and Electric Vehicles with V2V, V2B and V2G Operations. Energies 2023, 16, 4884. [Google Scholar] [CrossRef]
Xu, C.; Huang, Y. Integrated Demand Response in Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach. Energies 2023, 16, 4769. [Google Scholar] [CrossRef]
Chen, T.; Bu, S.; Liu, X.; Kang, J.; Yu, F.R.; Han, Z. Peer-to-Peer Energy Trading and Energy Conversion in Interconnected Multi-Energy Microgrids Using Multi-Agent Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 13, 715–727. [Google Scholar] [CrossRef]
Good, N.; Mancarella, P. Flexibility in Multi-Energy Communities With Electrical and Thermal Storage: A Stochastic, Robust Approach for Multi-Service Demand Response. IEEE Trans. Smart Grid 2019, 10, 503–513. [Google Scholar] [CrossRef]
Bahrami, S.; Chen, Y.C.; Wong, V.W.S. Deep Reinforcement Learning for Demand Response in Distribution Networks. IEEE Trans. Smart Grid 2021, 12, 1496–1506. [Google Scholar] [CrossRef]
Agostinelli, F.; McAleer, S.; Shmakov, A.; Baldi, P. Solving the Rubik’s Cube with Deep Reinforcement Learning and Search. Nat. Mach. Intell. 2019, 1, 356–363. [Google Scholar] [CrossRef]
Duan, J.; Shi, D.; Diao, R.; Li, H.; Wang, Z.; Zhang, B.; Bian, D.; Yi, Z. Deep-Reinforcement-Learning-Based Autonomous Voltage Control for Power Grid Operations. IEEE Trans. Power Syst. 2020, 35, 814–817. [Google Scholar] [CrossRef]
Huang, Y.; Li, G.; Chen, C.; Bian, Y.; Qian, T.; Bie, Z. Resilient Distribution Networks by Microgrid Formation Using Deep Reinforcement Learning. IEEE Trans. Smart Grid 2022, 13, 4918–4930. [Google Scholar] [CrossRef]
Zhang, Z.; Chen, Z.; Lee, W.-J. Soft Actor–Critic Algorithm Featured Residential Demand Response Strategic Bidding for Load Aggregators. IEEE Trans. Ind. Appl. 2022, 58, 4298–4308. [Google Scholar] [CrossRef]
Kuang, Y.; Wang, X.; Zhao, H.; Qian, T.; Li, N.; Wang, J.; Wang, X. Model-Free Demand Response Scheduling Strategy for Virtual Power Plants Considering Risk Attitude of Consumers. CSEE J. Power Energy Syst. 2023, 9, 516–528. [Google Scholar] [CrossRef]
Li, H.; Wan, Z.; He, H. Constrained EV Charging Scheduling Based on Safe Deep Reinforcement Learning. IEEE Trans. Smart Grid 2020, 11, 2427–2439. [Google Scholar] [CrossRef]
Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning. IEEE Trans. Smart Grid 2019, 10, 5246–5257. [Google Scholar] [CrossRef]
Zhang, C.; Liu, Y.; Wu, F.; Tang, B.; Fan, W. Effective Charging Planning Based on Deep Reinforcement Learning for Electric Vehicles. IEEE Trans. Intell. Transp. Syst. 2021, 22, 542–554. [Google Scholar] [CrossRef]
Liu, R.; Xie, M.; Liu, A.; Song, H. Joint Optimization Risk Factor and Energy Consumption in IoT Networks with TinyML-Enabled Internet of UAVs. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Liu, R.; Qu, Z.; Huang, G.; Dong, M.; Wang, T.; Zhang, S.; Liu, A. DRL-UTPS: DRL-Based Trajectory Planning for Unmanned Aerial Vehicles for Data Collection in Dynamic IoT Network. IEEE Trans. Intell. Veh. 2022, 8, 1204–1218. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Li, X.; Wang, X.; Chen, Z.; Shahidehpour, M. Multi-Agent Deep Reinforcement Learning Method for EV Charging Station Game. IEEE Trans. Power Syst. 2022, 37, 1682–1694. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Li, X.; Wang, X.; Shahidehpour, M. Enhanced Coordinated Operations of Electric Power and Transportation Networks via EV Charging Services. IEEE Trans. Smart Grid 2020, 11, 3019–3030. [Google Scholar] [CrossRef]

Figure 1. The overall scheme of V2G is coordinated with multi-energy microgrids.

Figure 2. The architecture of the SAC algorithm.

Figure 3. The training process of the proposed approach with 5 EVs.

Figure 4. The comparison of charging and discharging loads of 5 EVs between the proposed approach and the optimal values.

Figure 5. The net purchase of electricity and gas of the multi-energy microgrids with 5 EVs.

Figure 6. The SoC of 5 EVs at the period of departure.

Figure 7. The training process of the proposed approach with 10 EVs.

Figure 8. The comparison of the charging and discharging loads of 10 EVs between the proposed approach and the optimal values.

Figure 9. The net purchase of electricity and gas of the multi-energy microgrids with 10 EVs.

Figure 10. The SoC of 10 EVs at the period of departure.

Figure 11. The total profits of SAC under different levels of uncertainties.

Figure 12. The performance of SAC with different numbers of EVs.

Figure 13. The daily battery profiles of 5 EVs.

Figure 14. The daily battery profiles of 10 EVs.

Table 1. Probability of EVs’ hourly arrival.

Hour	Probability of Arrival
1	0.070
2	0.070
3	0.062
4	0.060
5	0.023
6	0.033
7	0.050
8	0.060
9	0.060
10	0.050
11	0.040
12	0.030
13	0.030
14	0.040
15	0.040
16	0.060
17	0.040
18	0.060
19	0.040
20	0.040
21	0.030
22	0.005
23	0.005
24	0.002

Table 2. Probability of EVs plug-in intervals.

Lasting Hours	Probability
1	0.00
2	0.10
3	0.15
4	0.20
5	0.15
6	0.15
7	0.13
8	0.05
9	0.05
10	0.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, W.; Yu, X.; Guo, Z.; Qian, T.; Li, Y. Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach. Energies 2024, 17, 2491. https://doi.org/10.3390/en17112491

AMA Style

Pan W, Yu X, Guo Z, Qian T, Li Y. Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach. Energies. 2024; 17(11):2491. https://doi.org/10.3390/en17112491

Chicago/Turabian Style

Pan, Weiqi, Xiaorong Yu, Zishan Guo, Tao Qian, and Yang Li. 2024. "Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach" Energies 17, no. 11: 2491. https://doi.org/10.3390/en17112491

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Online EVs Vehicle-to-Grid Scheduling Coordinated with Multi-Energy Microgrids: A Deep Reinforcement Learning-Based Approach

Abstract

1. Introduction

2. Formulation of V2G Scheduling Coordinated with Multi-Energy Microgrids

2.1. The Overall Framework

2.2. The Upper-Level Problem

2.3. The Formulation of Lower-Level Problem

3. MDP Formulation for MMOs Online V2G Scheduling in Multi-Energy Microgrids

3.1. States

3.2. Actions

3.3. Reward

3.4. Transition Function

4. SAC Algorithm for Solving MDP Formulation

4.1. Preliminaries

4.2. Training Process

5. Case Studies and Discussion

5.1. Training Process and Results of Case I

5.2. Training Process and Results of Case II

5.3. Tests on the Robustness and Efficiency of the SAC Algorithm

5.4. Tests on the Scenarios with Different Numbers of EVs

5.5. The Daily Battery Profiles of EVs

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI