Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling

Korkas, Christos D.; Tsaknakis, Christos D.; Kapoutsis, Athanasios Ch.; Kosmatopoulos, Elias

doi:10.3390/en17153694

Open AccessFeature PaperArticle

Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling

by

Christos D. Korkas

^*,†

,

Christos D. Tsaknakis

^†,

Athanasios Ch. Kapoutsis

and

Elias Kosmatopoulos

Center for Research and Technology Hellas, Informatics & Telematics Institute (ITI-CERTH), 57001 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Energies 2024, 17(15), 3694; https://doi.org/10.3390/en17153694

Submission received: 29 May 2024 / Revised: 11 July 2024 / Accepted: 17 July 2024 / Published: 26 July 2024

(This article belongs to the Section E: Electric Vehicles)

Download

Browse Figures

Versions Notes

Abstract

The increasing number of electric vehicles (EVs) necessitates the installation of more charging stations. The challenge of managing these grid-connected charging stations leads to a multi-objective optimal control problem where station profitability, user preferences, grid requirements and stability should be optimized. However, it is challenging to determine the optimal charging/discharging EV schedule, since the controller should exploit fluctuations in the electricity prices, available renewable resources and available stored energy of other vehicles and cope with the uncertainty of EV arrival/departure scheduling. In addition, the growing number of connected vehicles results in a complex state and action vectors, making it difficult for centralized and single-agent controllers to handle the problem. In this paper, we propose a novel Multi-Agent and distributed Reinforcement Learning (MARL) framework that tackles the challenges mentioned above, producing controllers that achieve high performance levels under diverse conditions. In the proposed distributed framework, each charging spot makes its own charging/discharging decisions toward a cumulative cost reduction without sharing any type of private information, such as the arrival/departure time of a vehicle and its state of charge, addressing the problem of cost minimization and user satisfaction. The framework significantly improves the scalability and sample efficiency of the underlying Deep Deterministic Policy Gradient (DDPG) algorithm. Extensive numerical studies and simulations demonstrate the efficacy of the proposed approach compared with Rule-Based Controllers (RBCs) and well-established, state-of-the-art centralized RL (Reinforcement Learning) algorithms, offering performance improvements of up to 25% and 20% in reducing the energy cost and increasing user satisfaction, respectively.

Keywords:

EV charging; energy scheduling; user preferences; smart grids; multi-agent reinforcement learning; distributed decision making

1. Introduction

The recent advances in battery technologies, alongside the rise in fuel prices, are increasing the importance of EVs. EVs introduce various benefits in comparison to fossil fuel vehicles and can significantly reduce transport-related pollution [1,2]. Moreover, the cost of use of an EV for the customer is cheaper, since charging an EV by using electricity is less costly than filling up a vehicle with gasoline [3]. Finally, Vehicle-to-Grid (V2G) capabilities allow EVs to discharge their batteries, offering energy back to the power grid and allowing operators to optimize grid operation and market scheduling and EV owners to participate in profit-earning Demand–Response (DR) schemes [4,5]. All of these reasons led to a big increase in the numbers of EVs and of plug-in electric vehicles (PEVs). Sales of electric vehicles are seeing exponential growth, as sales exceeded 10 million in 2022. In 2022, electric vehicles accounted for 14% of all new automobile sales, up from 9% in 2021 and less than 5% in 2020 [6].

However, the integration of such vehicles is significantly affecting the power system, creating different load profiles and new demands for electricity production and storage. That is why researchers have tried to tackle these problems and provide innovative solutions for the reduction in energy consumption [7,8]. Moreover, the stochasticity of EV arrival and departure times, user preferences and commuting behavior transform the EV charging/discharging problem into a dynamic, time-varying solution. Thus, in order to support further marketability and utilization of EVs, it is critical to develop suitable control and management algorithms that will satisfy the charging needs to contribute to optimized grid operation, to reduce the charging costs and to satisfy user needs and preferences.

This paper addresses the critical issue of optimizing energy consumption in electric vehicle charging stations (EVCSs), with the dual objective of minimizing overall energy usage while maximizing user satisfaction regarding the charging levels of their vehicles. By implementing advanced optimization techniques, the proposed solution aims to enhance the efficiency and effectiveness of EV charging infrastructure, thereby contributing to sustainable energy management and improved user experience.

1.1. Related Work

During the last years, various EV charging management and load-scheduling schemes have been proposed, aiming to tackle different problems. In [9,10,11,12], the authors presented comprehensive surveys on the charging scheduling algorithms’ enhancement in power grid performance as well as the improvement in grid reliability. In [13,14], EVs offer auxiliary services to the power grid (such as peak shaving, grid backup and load balancing) and are used to promote the integration and storage of renewable energy sources in microgrids. The literature is vast on such examples, and a thorough review of similar works can be found in [15]. On the other hand, focusing on EV scheduling for reducing the charging operation costs, a vast number of different frameworks and charging policies can be found in the literature. In general, we can divide charging/discharging frameworks into two major categories: (a) model-based (where a model of the problem is constructed or the dynamics of the problem are known) and (b) model-free approaches.

Regarding the first category, that of model-based solutions, in [16], a model predictive control approach was proposed, introducing random EV arrivals and departures and constructing a low-level computational complexity problem. In [17], a photovoltaic-equipped charging station was presented by using a mixed-integer programming approach to reduce the cost of energy trading. The works [18,19,20] presented robust optimization approaches that dealt with uncertainties in electricity prices, weather conditions and user behavior to provide intraday and day-ahead EV charging scheduling. The authors in [21] formulated the charging problem as an open-loop cost minimization problem. However, apart from the above optimization approaches, stochastic optimization is also widely used to handle the uncertainty that is introduced by user preferences, grid conditions and weather forecasts. A lot of effort was put into designing and developing real-time scheduling strategies to respond to the changing and time-varying dynamics of an EV charging station. For example, in [22], a binary programming strategy was developed in order to coordinate multiple EV charging spots in a parking station, aiming to provide curtailment capabilities to utility requests. In [23], the authors developed a two-stage stochastic approach to maximize the profit in intraday, balancing market and day-ahead situations, taking into account the uncertainties on market prices and EV fleet mobility. Finally, we have already developed a two-stage stochastic approach based on approximate dynamic programming [24,25,26] for the control of a grid-connected EV charging station equipped with a photovoltaic panel, aiming to reduce the cost of operation under dynamic pricing and diverse weather conditions. However, despite the fact that all the presented works have obtained remarkable results on the scheduling problem, they rely on the use of models or known dynamics of the problem in order to deliver robust and acceptable solutions, and they are often tailored to specific charging scenarios.

Recently, developments in RL and Machine Learning (ML) have led to algorithms that can undertake complex decision-making problems with great success [27]. The big advantage of these methods, compared with those already presented, is that RL methods usually do not need any system model information. Some first attempts to use RL methods for a charging station problem were presented in [28,29], where a tabular Q-learning approach was used. Although both works offered acceptable results and performance, the studied problems only considered a small number of states and actions. To overcome such problems, recent works use deep neural networks in order to learn complex problems from high-dimensional data [30,31]. However, it is only in recent years that deep RL methods which tackle the charging scheduling problem have been presented in the literature [32,33,34,35]. Moreover, refs. [36,37] used Deep Q-Learning (DQN) and State–Action–Reward–State–Action (SARSA) algorithms, two Q-Learning-based approaches to solve the optimal scheduling problem, whereas in [38,39], DDPG, Soft Actor–Critic (SAC) and Twin Delayed DDPG (TD3) were utilized. In addition to single-agent approaches, a few Multi-Agent Reinforcement Learning (MARL) strategies were presented in [40,41] to address the challenging problem of applying RL to coordinate the charging of multiple EVs in a charging station. A thorough review of RL implementations in EV management and control can be found in [42]. Although RL approaches have shown promising results, there are several challenges that make it difficult to apply them to real-world problems. One of the main challenges is that deep RL algorithms typically require a large number of interactions with the environment in order to produce robust policies. This can translate into hundreds of days of data that are needed in real-world problems. Additionally, dynamic programming approaches like RL algorithms are known to suffer from the “curse of dimensionality” [43] where they are unable to converge if the size of the state–action vectors is too large. This is particularly problematic for popular algorithms like DDPG [44] and SAC [45], which are widely used in real-world continuous setups.

1.2. Main Contributions

In this paper, the EV charging/discharging scheduling problem is formulated as a Markov Decision Process (MDP) from a single charging station owner’s perspective. The goal is to find the optimal charging/discharging scheduling of EVs to take advantage of real-time fluctuations in electricity prices (which are assumed to be known on a day-ahead basis) and available solar energy, all while respecting the EV owner’s demands with respect to cost effectiveness, departure schedules and general user satisfaction. To achieve generalization to different charging station configurations, a model-free approach is proposed to determine the optimal policy in a real-world scenario. To tackle the problems identified above, we propose a novel innovative Multi-Agent Reinforcement Learning framework, the Difference Reward Multi-Agent Reinforcement Learning (DR-MARL) algorithm, where each agent controls (charges/discharges) a single charging spot of the charging station. Instead of using one agent to control every charging spot, in this work, we use one agent per charging connection. By combining the advantages of RL, specifically DDPG, the proposed MARL methodology is highly scalable and sample-efficient without simplifying either the states or the actions of the environment and by using low-level continuous space charging/discharging actions rather than simplistic discrete-based solutions. Overall, the main contributions of the DR-MARL algorithm are the presented in the next paragraph.

All the decision variables’ updates can be made in a (parallel) distributed fashion, ensuring that no private information will be shared among different charging spots. It is protected against malfunctions, having the ability to change the number of supported charging stations on the fly without the need to start the whole optimization process from scratch. A thorough experimental study establishes the superiority of DR-MARL over Rule-Based (without learning) Controllers and state-of-the-art baseline algorithms. DR-MARL shows significant improvement in terms of sample efficiency and number of environment interactions during training while maintaining the same or even a better level of performance compared with the current state-of-the-art methods (as shown in Section 4, where DR-MARL is compared with DDPG and SAC). A cornerstone result is that although the local controllers of DR-MARL only have access to the local measurements related with the local control parameters along with their evaluation, the overall performance is on par with powerful centralized RL algorithms. On top of that, due to the distributed nature, the overall performance of DR-MARL seems not to be affected by the number of charging spots (overall number of decision variables). Such result renders DR-MARL suitable for charging stations with a large number of charging spots, cases that are really challenging for the most advanced state-of-the-art RL algorithms.

The rest of this paper is organized as follows: We introduce the system model and the Chargym environment and formulate the problem as an MDP problem. Afterwards, we propose the DR-MARL algorithm. The results and comparison studies are presented, and finally, in the last section, we conclude the paper and present the final outcomes of our work.

2. Materials and Methods

In this subsection, we present the charging station custom-built environment used for the evaluation of our work. The electric vehicle charging station (EVCS) used in this paper, was presented by the authors of [46] (It is publicly available at https://github.com/georkara/Chargym-Charging-Station, accessed on 1 July 2023). The EVCS environment is available online and can be used by interested researchers to replicate the presented results or compare new proposed methods with state-of-the-art algorithms. In a nutshell, here, we briefly present how we build upon this environment to construct the test cases for the evaluation of DR-MARL.

The EVCS architecture is presented in Figure 1 and showcases the basic details of the station. (i) The station comprises N charging spots; (ii) a photovoltaic (PV) generation system is connected directly to the station, offering “free” energy to the station; (iii) a power grid connection is available, enabling the station to absorb any amount of energy necessary; (iv) the charging station has V2G capabilities, allowing the charging policy to use the stored energy of the vehicles when necessary. The owner of the station is aiming to minimize the cost of the charging of the connected vehicles while ensuring that all EVs will reach the desired level of state of charge (SOC) by the departure time.

The remarks below describe the operation of the EVCS, providing a clear picture concerning the implementation of the environment’s operating framework.

Remark 1.

The time horizon is divided into 1-h time slots. Each EV arrival, departure and charging/discharging action is updated every hour.

Remark 2.

The station can use the produced energy from the PV installation or the stored energy in the vehicles (under V2G operation). The station can absorb energy from the grid, at any timestep, under fluctuating pricing schemes.

Remark 3.

The decisions made for each charging spot at the electric vehicle charging station (EVCS) may impact the overall operation due to the shared nature of resources. The EVCS features a shared photovoltaic (PV) unit that provides energy to the charging vehicles and a virtual common storage unit created by the discharging vehicles. This interdependence highlights the need for coordinated charging and discharging management.

Remark 4.

Due to simplicity, all vehicles that request connection to the station share the same technical properties (battery capacity, charging/discharging rates, etc.). The details can be found in [46].

Remark 5.

Each charging spot can be used more than once per day, without a constraint regarding the minimum or maximum times of usage.

Remark 6.

The arrival of EVs to the station is random and not known a priori to the station owner. If all charging spots are full, the new arrival is not connected to the station.

Remark 7.

The departure time and the desired state of charge for every EV at departure time are selected by the EV owners and is shared with the station upon EV arrival.

2.1. Markov Decision Process Framework

In this subsection, we introduce the MDP framework, which corresponds to the problem of the presented charging station. Our goal is to formulate the real-time charging/discharging scheduling problem as an RL problem, where an agent (or a set of agents) interact with the environment over a sequence of discrete timesteps. At the beginning of each timestep t, the charging station calculates the charging or discharging rate (action) of each vehicle spot based on the information (state) of the current events gathered and the available predictions, including the weather conditions, the energy pricing, the SOC of each connected vehicle and the departure times. Thus, we construct an MDP framework with a tuple containing 4 major elements:

(S, A, P, R)

, where

S

is the set of observations/states, with states

s \in S

; A is the set of actions/charging set-points, with actions

a \in A

;

P \subset ((S \times A \times S) \to [0, 1])

is the probabilistic state transition set; and R is the transition response set defined as

P : S \times A \times S \to [0, 1]

R : S \times A \times S \to R

.

P (s_{1}, a, s_{2}) = p (s_{2} | s_{1}, a)

is the probability that state

s_{1}

becomes state

s_{2}

after taking action a, and

R (s_{1}, a, s_{2}) = r (s_{1}, a, s_{2})

is the reward given to the algorithm based on the accuracy of its predictions. In the following sections, we discuss the construction of the state vector, the calculation of the action set-points, the reward function and the transition dynamics of the environment that lead to the state of the next timestep

t + 1

.

2.2. State

The state vector of the charging station at timestep t is described as

\begin{matrix} s_{t} = (\begin{matrix} G_{t}, G_{t + 1}, \dots, G_{t + 3}, p r_{t}, p r_{t + 1}, \dots, p r_{t + 3}, \end{matrix} \dots \\ \dots \begin{matrix} S o C_{t}^{1}, \dots, S o C_{t}^{N}, T l_{t}^{1}, \dots, T l_{t}^{N} \end{matrix}) \end{matrix}

(1)

The vector encapsulates four different types of information: (1)

G_{t}

and

(G_{t + 1}, \dots, G_{t + 3})

represent the current and three-timestep-ahead predictions of the solar radiation values. (2)

p r_{t}

and

(p r_{t + 1}, \dots, p r_{t + 3})

represent the electricity prices for the current timestep and again the three next timesteps. The idea behind including only 3-timestep-ahead predictions for both solar radiation and electricity prices reflects our intention to restrict the agent’s information only to estimations with high accuracy. The pricing is not affected by the demand of the station; rather, it is a value that is dynamically defined by the grid operator (e.g., electricity power distributor). (3)

S o C_{t}^{i}

denotes the state of charge of the EV at the ith charging spot at timestep t. (4)

T l_{t}^{i}

represents the remaining hours until departure for the EV at the ith charging spot. Figure 2a presents the four pricing profiles that are used for simulations, whereas Figure 2b shows a typical day of electricity production. Pricing models 1 and 2 were obtained from [47], while pricing model 3 is taken from [48], with more complex pricing levels than the previous two models. Finally, model 4, taken from NordPool [49], considers late night hours as high-demand hours. These profiles have also been used in former works by the authors [46,50].

Since the state contains information regarding the SoC and the remaining hours of connection for each charging spot, no additional parameter is needed to indicate whether an EV is connected to the charging spot or not, while if the

S o C_{t}^{i}

of the ith charging spot is equal to 0, this means that this specific charging spot is unoccupied, i.e., there is no vehicle.

2.3. Action

Having described the state

s_{t}

, the actions

a_{t}^{i}

that are available to the user represent the charging/discharging power of the 10 charging spots. These 10 action set-points (

a^{i}

) are defined as continuous variables, which are constrained in the

[- 1, 1]

space. The charging/discharging power for each vehicle i at timestep t is defined as

M P_{t}^{i} = \{\begin{matrix} (1 - S O C_{t}^{i}) \cdot B_{m a x} & a_{t}^{i} > = 0 \\ (S O C_{t}^{i}) \cdot B_{m a x} & a_{t}^{i} < 0 \end{matrix}

(2)

M P_{t}^{i} < = P_{c h, m a x} \cdot η_{c h}

(3)

P_{t}^{d e m, i} = a c t i o n_{t}^{i} \cdot M P_{t}^{i}

(4)

where

M P_{t}^{i}

is the maximum charging/discharging power for each vehicle i at timestep t,

S o C_{t}^{i}

is the state of charge of each EV (i) at time t and

B_{m a x}

,

P_{c h, m a x}

,

η_{c h}

are the EV maximum battery capacity, the maximum charging output of each charging spot and the charging/discharging efficiency, respectively, and are considered constant over the evaluation horizon.

All three equations above present the calculation of the electricity demand of each spot at any given timestep t by incorporating the selected action. Note that the action is positive when the EV is charging and negative when discharging. The value of

a c t i o n^{i}

directly affects the power demand

P_{t}^{d e m, i}

of each electric vehicle spot at timestep t as shown in Equation (4). Equations (2) and (3) describe the constraints on the maximum charging/discharging power that can be allocated in one timestep.

2.4. Transition Function

The transition probability

p (s^{'} | s, a)

represents the likelihood of moving from one state s to another state

s^{'}

given an action a. This probability depends on the

N = 10

charging/discharging power

a c t i o n_{t}^{i}

, the physical and mathematical dynamics of the EVCS, the random nature of customer-selected arrival and departure times and the fluctuations in electricity prices and solar radiation. Considering that the exact transition probability is unknown, the state transition from state

s_{t}

to state

s_{t + 1}

can only be calculated for the already connected vehicles:

S o C_{t + 1}^{c o n n e c t e d, i} = S o C_{t}^{c o n n e c t e d, i} + P_{t}^{d e m, i}

(5)

On the other hand, the overall requested power from the grid, for timestep t, is given as

P_{t}^{d e m, g r i d} = \sum_{i = 1}^{10} P_{t}^{d e m, i} - P_{t}^{P V}

(6)

where

P_{t}^{P V}

is the mean production of solar power at timestep t.

2.5. Reward

The goal of the EVCS’s controller and owner is to use a charging/discharging scheduling policy that will minimize the cost of the grid-absorbed electricity. Therefore, the main part of the reward function encapsulates the cost of the electricity bill that is paid to the grid/energy company. On the other hand, as presented in the Section 1, an additional term is incorporated in order to penalize situations where the departing EVs are not fully charged (since this is a common request of vehicle owners). The equation describing this specific formulation is the following:

r_{t} (S_{t}, A_{t}) = \sum_{i \in Ω_{t}} (p r_{t} \cdot P_{t}^{d e m, g r i d}) + \sum_{i \in Ψ_{t}} {[2 \cdot (1 - S o C_{t}^{i})]}^{2}

(7)

where

P_{t}^{d e m} \in Ω_{t}

denotes the total energy demand of the EVCS to the energy company,

Ω

is the set of all time periods during which the energy demand is measured and accounted for and

Ψ

is the set of all EVs for which the

S o C

is evaluated. The electricity price,

p r_{t}

, fluctuates in every timestep h, and as in real life, the information is shared to the EVCS by the energy company. The billing is presented in Figure 2a. The second cost term measures the level of customer satisfaction. The EVCS’s goal is to reach 100%

S o C

for each vehicle by the given vehicle’s chosen departure time. The second term in the reward function penalizes situations where departing EVs are not fully charged. The factor of 2 within the square amplifies the penalty for not meeting the target SoC. The squaring of

(1 - S o C_{t}^{i})

ensures that the penalty increases more significantly as the SoC deviates further from 100.

A Pareto frontier, also known as a Pareto front or Pareto boundary, represents the set of all Pareto optimal solutions in a multi-objective optimization problem. In such problems, we aim to optimize multiple conflicting objectives simultaneously. A solution is considered Pareto optimal if no other solution can improve one objective without worsening at least one other objective [51]. In order to visually illustrate the inherent multi-objective nature of the problem, Figure 3 presents a Pareto analysis chart depicting the trade-off between cost reduction and user dissatisfaction. The horizontal axis (X-axis) represents the degree of cost reduction achieved through our proposed EV charging strategy, while the vertical axis (Y-axis) quantifies the level of user dissatisfaction attained. Each point on the chart corresponds to a specific charging profile generated by the RL state-of-art methodology presented in the next section. As we move along the Pareto frontier depicted in Figure 3, we observe the delicate balance between minimizing costs and maximizing user satisfaction. This graphical representation serves as an intuitive demonstration of the complex relationship between these two objectives, highlighting the diverse range of solutions available to decision makers. It becomes evident that any substantial improvement in one objective comes at the expense of the other, reinforcing the need for a multi-objective approach that accommodates varying priorities.

3. Proposed DR-MARL Method

3.1. Problem Transformation

Having presented the real-world details that define the EVCS, we proceed to first transform the problem into an equivalent MDP setup; then, in the subsequent subsection, we propose an algorithm that builds upon a distributed, multi-agent algorithm [52] and delivers a MARL dedicated algorithm for the problem at hand.

The EVCS consists of N charging spots interacting with each other and a common energy source (solar panel and grid connection) for achieving a global objective (minimize the cost of operation of the charging station). The matrix containing all the decision variables is set to be

x (k) = {[x_{1}^{τ} (k), x_{2}^{τ} (k), \dots, x_{N}^{τ} (k)]}^{τ}

(8)

where

x_{i} (k) \in R^{m}

represents the decision set-point vector of the ith charging spot at the kth timestep, with m being the dimensionality of the vector. These decision set-points represent the controllable charging/discharging rates of the ith charging spot. The MARL framework proposes a setup where the local agents utilize knowledge only from the respective charging spot, and the global coordinator calculates the discrepancy

Δ_{i} (k)

, which is fed back to local agents.

Furthermore, the observation/state matrix which represents the available vehicle measurements is defined to be

y (k) = {[y_{1}^{τ} (k), y_{2}^{τ} (k), \dots, y_{N}^{τ} (k)]}^{τ}

(9)

where

y_{i} (k) \in R^{m}

denotes the state transition of the ith vehicle at the k timestep. It is important to mention the information contained in

y_{i} (k)

and highlight the difference between distributed and centralized methods. The information shared in

y_{i} (k)

is the following:

y_{i} (k) = (\begin{matrix} G_{t}, p r_{t}, G_{t + 1}, G_{t + 2}, G_{t + 3}, p r_{t + 1}, p r_{t + 2}, p r_{t + 3}, S o C_{t}^{i}, T l_{t}^{i} \end{matrix})

.

On the other hand, for centralized implementations,

y (k)

is as described in Equation (1). Therefore, in charging station implementations that utilize 10 charging spots, the

y_{i} (k)

distributed information vector utilizes 10 values, whereas the centralized implementations utilize 28 values. On larger implementations (20-35-50 charging spot implementations), DR-MARL still utilizes the same 10-value information vector for each charging spot, whereas the centralized approaches utilize vectors with up to 108 values (50 spots). The results of this difference will be analyzed in the next section.

The measurement vector’s evolution can be defined as

y_{i} (k) = h_{i} (k, x_{i} (k))

(10)

where

h_{i} (\cdot)

denotes an unknown, nonlinear function that depends on both

x_{i} (k)

and the specific problem characteristics (presented previously in the Transition State subsection). The aim of the multi-vehicle system’s objectives (e.g., minimization of the charging cost and respect of the user preferences/leaving hours) can be translated into the maximization of a specifically defined global cost function

j_{k}

and set to be

j_{k} = J (x_{1} (k), x_{2} (k), \dots, x_{N} (k))

(11)

where

J (\cdot)

is a non-negative, nonlinear, scalar function that depends, apart from the charging-spot decision variables, on the particular dynamics of the problem (e.g., the EVCS environment).

3.2. DR-MARL—Global Coordination

Step 1. For each timestep k, the charging spots deliver the measured state observations, after the appliance of the

x (k)

decision action/set-point.

Step 2. The cost function associated with the whole EVCS operation can be directly derived from (see (11))

j_{k} = J (y_{1} (k), \dots, y_{N} (k))

In addition, for each ith agent/vehicle, we calculate the following index:

Δ_{i} (k) = j_{k} - J (y_{1} (k), \dots, y_{i - 1} (k), y_{i} (k - 1), y_{i + 1} (k), \dots, y_{N} (k))

(12)

Δ_{i} (k)

describes the contribution of that

x_{i} (k)

on the current problem for the kth timestamp.

Note that since the second part of (12) is analytically available, we can calculate this term without the use of a new experiment or the use of a system model/simulator, although the resulting calculation does not necessary match the actual value when the charging spots utilize the following decision variables:

\{x_{1} (k), \dots, x_{i - 1} (k), x_{i} (k - 1), x_{i + 1} (k) \dots, x_{N} (k)\}

The convergence of the proposed methodology is not affected by the difference between the calculation of

J (\cdot)

and its actual value. The resulting discrepancy is based on the application and the environment operation and describes the effect of other agent’s actions to each vehicle’s measurements.

Step 3. Next, the calculated discrepancy

Δ_{i} (k)

is sent to the ith charging spot/agent.

The diagram of the algorithm described is presented in Figure 4.

3.3. Local Agent—DDPG

In each charging spot/agent, an instance of Deep Deterministic Policy Gradient [44] algorithm is utilized to find the best policy that maximizes the previously defined difference reward. The main advantage of the DDPG algorithm is the ability to directly deal with continuous state and control variables, as well as utilizing a policy that maximizes the return/reward/cost by maintaining a set of estimates of expected returns for the optimal policy (off-policy function). The EVCS environment is a fully continuous environment, as most of the energy/smart grid-related problems where all the state and control variables are in continuous form.

The most crucial parameters of the EVCS, namely, the SOC of each vehicle connected to the station, the electricity–energy pricing and the departure time of each vehicle are used for action selection for achieving the minimization of the cost (or maximization for the reward). By utilizing these input parameters, the learning agent decides whether the vehicle should be charged or discharged and chooses the charging/discharging set-point, as shown in Figure 5. This figure showcases a generic description of the local agent operation. The actual output of the local agent is a continuous parameter from −100 to +100.

The actor of our formulation is defined as

μ (s_{t} | θ^{μ})

, where

s_{t}

is the input vector state we mentioned before and

θ^{μ}

stands for the weights of the actor neural network, while

a_{t}

is the control action. The actor network links an action to each vector state, while the critic network calculates the expected reward according to the policy using the action proposed by the actor. The critic network is defined as

Q (s_{t}, a_{t} | θ^{Q})

, where

S_{t}

is the state vector,

a_{t}

the control action by the actor and

θ^{Q}

the weights of the critic. The Q function describes the expected reward by taking the specific control action

a_{t}

at state

s_{t}

with the policy

π

. To improve training stability, we use two target networks, which are the copies of each critic

Q^{'} (s_{t}, s_{t}, | {θ^{Q}}^{'})

and actor

μ^{'} (s_{t}, {θ^{μ}}^{'})

. Moreover, to ensure robustness at both actor and critic levels, a finite-sized replay buffer is defined. By interacting with the EVCS, the DDPG agent network gets trained. Specifically, the

S O C

of each connected vehicle, the present electricity pricing and the departure time of each vehicle every hour h are collected and fed into the policy network, which then returns the action. Due to the deterministic nature of the action selected, noise is added to the actions at training time to improve DDPG’s exploration performance:

a_{k} = μ (s_{k} | θ^{μ}) + N_{k}

(13)

where

N_{k}

is the exploration Ornstein–Uhlenberk (O-U) noise. In order to explore effectively in momentum-rich physical situations, we employed this temporally correlated noise for the exploration noise process. The Ornstein–Uhlenbeck [53] procedure simulates the speed of a Brownian particle subjected to friction, yielding values that are temporally coupled and cluster around 0.

In every transition, the 4-tuple

(s_{t}, a_{t}, r_{t}, s_{k + 1})

is saved and stored in batches in the replay buffer and used for actor training.

Regarding the update of the actor–critic networks, the first step is to calculate the immediate reward by using the following equation:

y_{i} =_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | {θ^{μ}}^{'}) | {θ^{Q}}^{'})

(14)

where

γ

is a discount factor, with

γ \in [0, 1]

.

The critic network in each transition will be updated by minimizing the MSE (mean square error) between the estimated reward and the expected reward from the critic.

L = \frac{1}{N} \sum_{i}^{N} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2}

(15)

The actor network will be updated by a Deep Deterministic Policy Gradient:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i}^{N} \nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) | s_{i}

(16)

where J is

J_{i} = \sum_{t = 1}^{T_{m a x}} γ^{t - 1} Δ_{i} (t)

(17)

Target actor and target critic network are smoothly updated by using the following expressions:

{θ^{Q}}^{'} \leftarrow t θ^{Q} + (1 - t) {θ^{Q}}^{'}

{θ^{μ}}^{'} \leftarrow t θ^{μ} + (1 - t) {θ^{μ}}^{'}

We should note that after the training process, only the actor network is necessary and is stored.

4. Results

In this section, we evaluate the performance of the proposed MARL algorithm compared with a Rule-Based Controller and two state-of-the-art, centralized, off-policy RL algorithms: DDPG and SAC. Both these algorithms are among the most used and successful approaches in off-policy Reinforcement Learning strategies. Note that based on the description provided in the previous section, DDPG is used as the local-agent decision-making tool in our MARL framework. We assess the scalability and optimality of these methods in managing the EVCS’s operations under various scenarios and conditions, including different weather and pricing situations.

4.1. Setup of Numerical Study

The following analysis provides a detailed overview of the operational and experimental framework that was used during the validation study. The simulations were based on the following key attributes.

Day = episode: Each simulation day corresponds to an RL episode. For an episode in RL frameworks, one sequence of states, actions and rewards, which ends with a terminal state, is defined. Therefore, our terminal state is the end of day.

Diverse EV scheduling: The overall dynamics and EV scheduling for each episode are decided by a particular automated and random procedure. These values reflect the stochastic nature of the EVs’ numbers, arrival and departure rates, timetables, initial SoCs upon arrival and the varying and fluctuating nature of electricity pricing and solar energy every day. This methodology aims for the control strategies to be trained and evaluated under diverse conditions, resulting in robust solutions under unpredictable and random real-world conditions.

Real-world application: Since the EVCS formulation is based on common and applicable EV knowledge and assumptions, any acquired policy may be easily applied to an appropriate EVCS instance. Our objective is to develop strategies that determine the best charging/discharging schedule based on a general sense of the environment. However, in this work, we mainly focus on online setups, where the tested algorithms try to optimize their performance by interacting with the environment, rather than by training on historical or previous data.

Diverse conditions: In order to validate the performance of the proposed methodologies under different conditions and scenarios, we created five different pricing cases with the four different pricing tariffs presented in Figure 2. The first four cases correspond to keeping the four pricing schedules constant from day to day, whereas in the fifth case, each day/episode introduces one of the four main pricing schedules in a random order. It is crucial to mention that the weather conditions are different each day.

As mentioned in the Section 1, the proposed algorithm (DR-MARL), as well as state-of-the-art RL algorithms, were compared with a Rule-Based Controller (RBC), which emulates basic rules (human-like decisions) on how to charge and discharge the connected EVs. The RBC provides straightforward decision making; however, it is far from the optimal choice. It is used in [46]. The following equation presents the function of the RBC:

a c t i o n_{t}^{i} = \{\begin{matrix} 1 & T l_{t}^{i} < = 3 \\ \frac{(G_{t} + G_{t + 1})}{2} & T l_{t}^{i} > 3 \end{matrix}

(18)

Equation (18) presents the following logic: If a vehicle is bound to leave during the next three hours (the time needed by the EVCS parameters to fully charge a vehicle from 0 to 100), the charging set-point is set to maximum (1). If a vehicle is not leaving in the next three hours, the charging set-point is defined by the current available and the one-hour-ahead forecasted available solar energy. It is obvious that the RBC does not take advantage of the fluctuating electricity prices and the V2G functionality. However, it is a good baseline with which to compare the performance of the proposed framework.

4.2. Cost Comparison with Baseline Methods

As mentioned earlier, we compared the performance of the proposed DR-MARL method with the two centralized approaches, as well as the RBC, which was also used as the baseline. It is crucial to mention that since we are presenting an RL formulation, the performance index is a reward index (maximization problem) and that sample efficiency is another factor to be considered. This is because we are proposing a method that has direct, real-world implementation and fast convergence to a superior solution leads to a greater cost reduction.

In Figure 6, we present the learning performance of the proposed DR-MARL algorithm compared with DDPG and SAC. The figure showcases the evolution of the average daily reward. The MARL algorithm converges significantly faster than the centralized approaches of DDPG and SAC while achieving similar levels of reward by the end of the training process. More specifically, DR-MARL achieves a similar or greater level of performance compared with the RBC in less than 20k interactions with the environment, while both SAC and DDPG require more than 100k.

Moreover, in order to showcase that DR-MARL is not only faster than the state-of-the-art approaches but also converges to the same level of performance, we present Table 1. As shown in this table, each algorithm was trained under diverse weather conditions and random EV schedules, in five different experiments. In the first four experiments, the EVCS was constantly operating (every day) under the four pricing models presented in Section 3, whereas in the fifth experiment, the EVCS was operated each day under a different and randomly selected pricing scheme (as in the case presented in Figure 6). For each different pricing profile, we simulated 10 days with random EV arrival scheduling and weather conditions, and we tested the algorithms on these days. To evaluate the algorithms, we simulated 10 episodes (days) under diverse conditions. Each day presented a random car arrival schedule, diverse weather conditions and a random pricing profile. So, each day was unique and different both from the previous and the next one. However, in order to successfully compare the algorithms and provide meaningful results, the algorithms were tested on the same set of 10 episodes. We collected the rewards for each day and presented the mean of the episodic reward of each algorithm (the average daily reward). We considered that 10 episodes are adequate to showcase valid results about the performance of the algorithms. As a reward, we define the index presented in Equation (7), which is a metric for both energy cost minimization and user satisfaction. It is clear (and expected) that each algorithm outperformed the RBC. However, DR-MARL achieved similar (yet slightly worse) performance levels with the regular version of DDPG and in most cases was better than the SAC algorithm. We should highlight that the DR-MARL method is a distributed approach, not sharing information between the different charging spots, whereas centralized DDPG and SAC operate at the station level, utilizing knowledge and data from all charging spots.

Even if utilizing data from all charging spots may seem to make centralized DDPG converge to slightly better solutions for the presented 10-spot EVCS , in the next subsection, we will showcase that handling all this information is a heavy duty in more complex and large cases.

4.3. Scalability Analysis

As shown in this subsection, we tested the scalability of the proposed method in contrast with centralized SAC and DDPG. We evaluated the performance of the presented algorithms in three different EVCS cases, one with 20 available charging spots, one with 35 spots and one with 50 spots. The EV arrival pattern was used and scaled linearly with the number of chargers, as well as the available solar/produced energy. In Table 1, we present the mean reward over 10 different random days/episodes for a 10-spot EVCS. In Table 2, we present the standard deviation to convey the variability in the data. The standard deviation provides insights into how much the results differ from the mean value. Comparing the results in Table 1 and Table 3, one can notice that as the charging station size increases, the average daily performance of DR-MARL increases compared with the centralized RL algorithms. A comparison of the algorithms’ performance for the EVCS use case of 20 charging spots is also shown in Figure 7. The MARL algorithm not only minimizes the energy needed from the grid but also maximizes the user satisfaction. These algorithms, after some point, fail to handle the large number of state–actions and do not converge to an acceptable solution. More importantly, as shown in Figure 8a–c, DR-MARL achieved better performance than the RBC baseline quite early on in the training process.

Finally, in Table 4, the number of iterations that were required from each method to achieve a similar performance to the RBC are presented. We note that DR-MARL in every case needed around 20k–30k iterations to achieve a performance similar to that of the RBC. On the other hand, DDPG and SAC needed significantly more iterations; alternatively, in some cases, they failed to converge to an acceptable solution altogether. These results showcase that the proposed DR-MARL methodology could scale to large numbers of the action/state dimensions, in contrast with the state-of-the-art RL algorithms, which, in the two last cases (35 and 50 charging spots), failed to converge.

5. Discussion

5.1. Limitations

Our study recognizes several limitations that could impact the effectiveness and applicability of the proposed MARL framework. Firstly, the availability and quality of data are crucial for the performance of our algorithm. Insufficient or inaccurate data can lead to suboptimal decision making, potentially affecting the efficiency of the charging station operations. Additionally, user willingness to adopt and interact with the algorithm plays a significant role. Reluctance or lack of engagement from users could hinder the algorithm’s practical implementation and effectiveness. From a technical standpoint, the computational intensity of the MARL algorithm is another limitation. To address this, future research could explore the utilization of parallel computing algorithms, which would significantly enhance the speed of computational calculations and improve the algorithm’s scalability. Furthermore, our current framework operates on an hourly timestep, which may not capture the finer dynamics of EV charging behaviors. Future studies could investigate the use of shorter timesteps to increase the granularity and accuracy of the scheduling process, thereby enhancing the generalizability and robustness of the results across different scenarios and environments.

5.2. Conclusions and Future Work

In this paper, a distributed RL framework, the DR-MARL algorithm, was presented and managed the operation of a charging station with multiple charging spots considering V2G operation, a connected renewable energy system unit and fluctuating electricity prices. The objective of the proposed control policy is to maximize the reward of operation of the station by minimizing the cost of absorbed grid energy and maximizing user satisfaction. Extensive numerical and scalability studies show that the proposed MARL framework offers equal or (even better) results compared to state-of-the-art baseline methods, but with much fewer iterations needed. The scalability analysis shows that the proposed framework is able to find satisfactory policies in challenging setups (35 and 50 spots) where traditional state-of-the-art RL algorithms fail to converge. This underscores the robustness and scalability of our approach, making it a significant advancement in smart grid management, because of the decentralized control approach, which is a less common yet highly effective alternative to traditional centralized methods. Future work will focus on enhancing the exploration attributes of the DR-MARL framework by integrating state-of-the-art exploration algorithms, implementing reward-shaping techniques that incentivize exploration, optimizing the learning rate parameters to strike a balance between exploration and exploitation and using its distributed capabilities with parallel computing implementation. Additionally, research can build on these findings by exploring the scalability of the MARL framework in larger and more complex environments, experimenting with different cost criteria and environments that also incorporate diverse energy sources. Furthermore, in our prospective endeavors, a central emphasis will be placed on conducting a comparative analysis between our method and other notable approaches, including but not limited to Multi-Agent Proximal Policy Optimization MAPPO [54], Heterogeneous-Agent Trust Region Policy Optimisation HATRPO [55] and Multi-Agent DDPG MADDPG [56].

Author Contributions

Conceptualization, C.D.K. and A.C.K.; methodology, C.D.K. and A.C.K.; software, C.D.K. and C.D.T.; validation, C.D.T. and C.D.K.; formal analysis, C.D.K. and C.D.T.; investigation, C.D.K. and C.D.T.; data curation, C.D.K.; writing—original draft preparation, C.D.K. and C.D.T.; writing—review and editing, C.D.K. and C.D.T.; visualization, C.D.K. and C.D.T.; supervision, E.K.; project administration, E.K.; funding acquisition, E.K. All authors have read and agreed to the published version of the manuscript.

Funding

The research leading to these results was partially funded by European Commission H2020-EU.2.1.5.2.—LC-EEB-07-2020—Smart Operation of Proactive Residential Buildings (IA) (grant agreement ID: 958284; PRECEPT https://www.precept-project.eu) and CL5-2021-D4-02-02—Efficient, sustainable and inclusive energy use (grant agreement ID: 101079951; REHOUSE https://rehouseproject.eu).

Data Availability Statement

The data supporting the results in this study are publicly available at https://github.com/georkara/Chargym-Charging-Station. This resource can be used by interested researchers to replicate and reproduce the presented results or even to apply their methods and algorithms to this benchmark.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Notations and Symbols

The following abbreviations are used in this manuscript:

DDPG	Deep Deterministic Policy Gradient
DQN	Deep Q-Network
DR	Demand–Response
DR-MARL	Difference Reward-Multi-Agent Reinforcement Learning
EV	Electric vehicle
EVCS	electric vehicle charging station
HATRPO	Heterogeneous-Agent Trust Region Policy Optimisation
MADDPG	Multi-Agent DDPG
MARL	Multi-Agent Reinforcement Learning
MAPPO	Multi-Agent Proximal Policy Optimization
MDP	Markov Decision Process
ML	Machine Learning
PEVs	plug-in electric vehicles
PV	photovoltaic
R	reward function
RES	energy storage
RBC	Rule-Based Controller
RL	Reinforcement Learning
S	state space
SAC	Soft Actor–Critic
SARSA	State–Action–Reward–State–Action
SoC	state of charge
TD3	Twin Delayed Deep Deterministic Policy Gradient
V2G	Vehicle to Grid
$Δ_{i} (k)$	discrepancy for each vehicle i at timestep t
$θ^{μ}$	weights of the actor network
$θ^{Q}$	weights of the critic network
$μ (s_{t} \| θ^{μ})$	actor network
$μ^{'} (s_{t} \| {θ^{μ}}^{'})$	target actor network
$π$	policy of DDPG algorithm
A	space of actions
$B_{m a x}$	EV maximum battery capacity
$G_{t}$	solar radiation value at time step t
$h_{i} (\cdot)$	unknown, nonlinear function
$J (\cdot)$	non-negative, nonlinear, scalar function
$j_{k}$	global cost function
L	mean square error of estimated and expected reward
$M P_{t}^{i}$	charging/discharging power for each vehicle i at timestep t
$η_{c h}$	charging/discharging efficiency
$N_{k}$	Ornstein–Uhlenberk (O-U) noise
P	state transition probability
$P_{c h, m a x}$	maximum charging power output of each charging spot
$P_{t}^{d e m, i}$	power demand for each vehicle i at timestep t
$P_{t}^{d e m}$	total power demand of the EVCS for energy company
$P_{t}^{d e m, g r i d}$	overall requested power from the grid at timestep t
$P_{t}^{P V}$	mean production of solar power at timestep t
$p_{r t}$	electricity price at time step t
$Q (s_{t}, a_{t} \| θ^{Q})$	critic network
$Q^{'} (s_{t}, a_{t} \| {θ^{Q}}^{'})$	target critic network
$T l_{t}^{i}$	remaining hours until departure for an EV
$x {(i)}_{k}$	decision set-point of the i charging spot at the k timestep
$y {(i)}_{k}$	measurement vector of the vehicle i at timestep t
$y_{k}$	immediate reward

References

Yilmaz, M.; Krein, P.T. Review of the impact of vehicle-to-grid technologies on distribution systems and utility interfaces. IEEE Trans. Power Electron. 2012, 28, 5673–5689. [Google Scholar] [CrossRef]
Dallinger, D.; Link, J.; Büttner, M. Smart grid agent: Plug-in electric vehicle. IEEE Trans. Sustain. Energy 2014, 5, 710–717. [Google Scholar] [CrossRef]
Wang, H.; Zhang, X.; Ouyang, M. Energy consumption of electric vehicles based on real-world driving patterns: A case study of Beijing. Appl. Energy 2015, 157, 710–719. [Google Scholar] [CrossRef]
Rodrigues, Y.R.; de Souza, A.Z.; Ribeiro, P.F. An inclusive methodology for Plug-in electrical vehicle operation with G2V and V2G in smart microgrid environments. Int. J. Electr. Power Energy Syst. 2018, 102, 312–323. [Google Scholar] [CrossRef]
Wang, K.; Gu, L.; He, X.; Guo, S.; Sun, Y.; Vinel, A.; Shen, J. Distributed energy management for vehicle-to-grid networks. IEEE Netw. 2017, 31, 22–28. [Google Scholar] [CrossRef]
IEA. Global EV Outlook 2023; Technical Report; IEA: Paris, France, 2023. [Google Scholar]
Lu, C.; Wang, Z.; Wu, C. Storage-Aided Service Surcharge Design for EV Charging Stations. In Proceedings of the 2021 60th IEEE Conference on Decision and Control (CDC), Austin, TX, USA, 13–17 December 2021; pp. 5653–5658. [Google Scholar]
Wang, E.; Ding, R.; Yang, Z.; Jin, H.; Miao, C.; Su, L.; Zhang, F.; Qiao, C.; Wang, X. Joint Charging and Relocation Recommendation for E-Taxi Drivers via Multi-Agent Mean Field Hierarchical Reinforcement Learning. IEEE Trans. Mob. Comput. 2022, 21, 1274–1290. [Google Scholar] [CrossRef]
Deilami, S.; Masoum, A.S.; Moses, P.S.; Masoum, M.A. Real-time coordination of plug-in electric vehicle charging in smart grids to minimize power losses and improve voltage profile. IEEE Trans. Smart Grid 2011, 2, 456–467. [Google Scholar] [CrossRef]
Tursini, M.; Parasiliti, F.; Fabri, G.; Della Loggia, E. A fault tolerant e-motor drive system for auxiliary services in hybrid electric light commercial vehicle. In Proceedings of the 2014 IEEE International Electric Vehicle Conference (IEVC), Florence, Italy, 16–19 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–6. [Google Scholar]
Tang, W.; Bi, S.; Zhang, Y.J. Online charging scheduling algorithms of electric vehicles in smart grid: An overview. IEEE Commun. Mag. 2016, 54, 76–83. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, B.; Chiang, Y.Y.; Zhang, X.; Chen, Y.; Li, M.; Li, F. BiS4EV: A fast routing algorithm considering charging stations and preferences for electric vehicles. Eng. Appl. Artif. Intell. 2021, 104, 104378. [Google Scholar] [CrossRef]
Zou, N.; Qian, L.; Li, H. Auxiliary frequency and voltage regulation in microgrid via intelligent electric vehicle charging. In Proceedings of the 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm), Venice, Italy, 3–6 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 662–667. [Google Scholar]
Liu, N.; Chen, Q.; Liu, J.; Lu, X.; Li, P.; Lei, J.; Zhang, J. A heuristic operation strategy for commercial building microgrids containing EVs and PV system. IEEE Trans. Ind. Electron. 2014, 62, 2560–2570. [Google Scholar] [CrossRef]
Shareef, H.; Islam, M.M.; Mohamed, A. A review of the stage-of-the-art charging technologies, placement methodologies, and impacts of electric vehicles. Renew. Sustain. Energy Rev. 2016, 64, 403–420. [Google Scholar] [CrossRef]
Tang, W.; Zhang, Y.J. A model predictive control approach for low-complexity electric vehicle charging scheduling: Optimality and scalability. IEEE Trans. Power Syst. 2016, 32, 1050–1063. [Google Scholar] [CrossRef]
Franco, J.F.; Rider, M.J.; Romero, R. A mixed-integer linear programming model for the electric vehicle charging coordination problem in unbalanced electrical distribution systems. IEEE Trans. Smart Grid 2015, 6, 2200–2210. [Google Scholar] [CrossRef]
Ortega-Vazquez, M.A. Optimal scheduling of electric vehicle charging and vehicle-to-grid services at household level including battery degradation and price uncertainty. Iet Gener. Transm. Distrib. 2014, 8, 1007. [Google Scholar] [CrossRef]
Zhao, J.; Wan, C.; Xu, Z.; Wang, J. Risk-based day-ahead scheduling of electric vehicle aggregator using information gap decision theory. IEEE Trans. Smart Grid 2015, 8, 1609–1618. [Google Scholar] [CrossRef]
Balasubramaniam, S.; Syed, M.H.; More, N.S.; Polepally, V. Deep learning-based power prediction aware charge scheduling approach in cloud based electric vehicular network. Eng. Appl. Artif. Intell. 2023, 121, 105869. [Google Scholar]
Zhang, M.; Chen, J. The energy management and optimized operation of electric vehicles based on microgrid. IEEE Trans. Power Deliv. 2014, 29, 1427–1435. [Google Scholar] [CrossRef]
Yao, L.; Lim, W.H.; Tsai, T.S. A real-time charging scheme for demand response in electric vehicle parking station. IEEE Trans. Smart Grid 2016, 8, 52–62. [Google Scholar] [CrossRef]
Momber, I.; Siddiqui, A.; San Roman, T.G.; Söder, L. Risk averse scheduling by a PEV aggregator under uncertainty. IEEE Trans. Power Syst. 2014, 30, 882–891. [Google Scholar] [CrossRef]
Korkas, C.D.; Baldi, S.; Yuan, S.; Kosmatopoulos, E.B. An adaptive learning-based approach for nearly optimal dynamic charging of electric vehicle fleets. IEEE Trans. Intell. Transp. Syst. 2017, 19, 2066–2075. [Google Scholar] [CrossRef]
Korkas, C.D.; Baldi, S.; Michailidis, P.; Kosmatopoulos, E.B. A cognitive stochastic approximation approach to optimal charging schedule in electric vehicle stations. In Proceedings of the 2017 25th Mediterranean Conference on Control and Automation (MED), Valletta, Malta, 3–6 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 484–489. [Google Scholar]
Korkas, C.D.; Terzopoulos, M.; Tsaknakis, C.; Kosmatopoulos, E.B. Nearly optimal demand side management for energy, thermal, EV and storage loads: An Approximate Dynamic Programming approach for smarter buildings. Energy Build. 2022, 255, 111676. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Dimitrov, S.; Lguensat, R. Reinforcement learning based algorithm for the maximization of EV charging station revenue. In Proceedings of the 2014 International Conference on Mathematics and Computers in Sciences and in Industry, Varna, Bulgaria, 13–15 September 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 235–239. [Google Scholar]
Wen, Z.; O’Neill, D.; Maei, H. Optimal demand response using device-based reinforcement learning. IEEE Trans. Smart Grid 2015, 6, 2312–2324. [Google Scholar] [CrossRef]
Zhang, Y.; Rao, X.; Liu, C.; Zhang, X.; Zhou, Y. A cooperative EV charging scheduling strategy based on double deep Q-network and Prioritized experience replay. Eng. Appl. Artif. Intell. 2023, 118, 105642. [Google Scholar] [CrossRef]
Jiao, M.; Wang, D.; Yang, Y.; Liu, F. More intelligent and robust estimation of battery state-of-charge with an improved regularized extreme learning machine. Eng. Appl. Artif. Intell. 2021, 104, 104407. [Google Scholar] [CrossRef]
Liu, J.; Guo, H.; Xiong, J.; Kato, N.; Zhang, J.; Zhang, Y. Smart and resilient EV charging in SDN-enhanced vehicular edge computing networks. IEEE J. Sel. Areas Commun. 2019, 38, 217–228. [Google Scholar] [CrossRef]
Lee, J.; Lee, E.; Kim, J. Electric vehicle charging and discharging algorithm based on reinforcement learning with data-driven approach in dynamic pricing scheme. Energies 2020, 13, 1950. [Google Scholar] [CrossRef]
Ding, T.; Zeng, Z.; Bai, J.; Qin, B.; Yang, Y.; Shahidehpour, M. Optimal electric vehicle charging strategy with Markov decision process and reinforcement learning technique. IEEE Trans. Ind. Appl. 2020, 56, 5811–5823. [Google Scholar] [CrossRef]
Haydari, A.; Yılmaz, Y. Deep reinforcement learning for intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2020, 23, 11–32. [Google Scholar] [CrossRef]
Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-free real-time EV charging scheduling based on deep reinforcement learning. IEEE Trans. Smart Grid 2018, 10, 5246–5257. [Google Scholar]
Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement learning for real-time pricing and scheduling control in EV charging stations. IEEE Trans. Ind. Inform. 2019, 17, 849–859. [Google Scholar] [CrossRef]
Abdalrahman, A.; Zhuang, W. Dynamic pricing for differentiated PEV charging services using deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1415–1427. [Google Scholar] [CrossRef]
Yan, L.; Chen, X.; Zhou, J.; Chen, Y.; Wen, J. Deep reinforcement learning for continuous electric vehicles charging control with dynamic user behaviors. IEEE Trans. Smart Grid 2021, 12, 5124–5134. [Google Scholar] [CrossRef]
Ye, Z.; Gao, Y.; Yu, N. Learning to Operate an Electric Vehicle Charging Station Considering Vehicle-grid Integration. IEEE Trans. Smart Grid 2022, 13, 3038–3048. [Google Scholar] [CrossRef]
Fang, X.; Wang, J.; Song, G.; Han, Y.; Zhao, Q.; Cao, Z. Multi-agent reinforcement learning approach for residential microgrid energy scheduling. Energies 2019, 13, 123. [Google Scholar] [CrossRef]
Abdullah, H.M.; Gastli, A.; Ben-Brahim, L. Reinforcement learning based EV charging management systems—A review. IEEE Access 2021, 9, 41506–41531. [Google Scholar] [CrossRef]
Bellman, R. Dynamic programming. Science 1966, 153, 34–37. [Google Scholar] [CrossRef] [PubMed]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Karatzinis, G.; Korkas, C.; Terzopoulos, M.; Tsaknakis, C.; Stefanopoulou, A.; Michailidis, I.; Kosmatopoulos, E. Chargym: An EV Charging Station Model for Controller Benchmarking. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Crete, Greece, 17–20 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 241–252. [Google Scholar]
Škugor, B.; Deur, J. Dynamic programming-based optimisation of charging an electric vehicle fleet system represented by an aggregate battery model. Energy 2015, 92, 456–465. [Google Scholar] [CrossRef]
Lund, H.; Kempton, W. Integration of renewable energy into the transport and electricity sectors through V2G. Energy Policy 2008, 36, 3578–3587. [Google Scholar] [CrossRef]
NordPool. 2016. Available online: http://www.nordpoolspot.com/Market-data1/ (accessed on 16 July 2024).
Korkas, C.D.; Baldi, S.; Michailidis, I.; Kosmatopoulos, E.B. Intelligent energy and thermal comfort management in grid-connected microgrids with heterogeneous occupancy schedule. Appl. Energy 2015, 149, 194–203. [Google Scholar] [CrossRef]
Deb, K. Multi-objective optimisation using evolutionary algorithms: An introduction. In Multi-Objective Evolutionary Optimisation for Product Design and Manufacturing; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–34. [Google Scholar]
Kapoutsis, A.C.; Chatzichristofis, S.A.; Kosmatopoulos, E.B. A distributed, plug-n-play algorithm for multi-robot applications with a priori non-computable objective functions. Int. J. Robot. Res. 2019, 38, 813–832. [Google Scholar] [CrossRef]
Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Kuba, J.G.; Chen, R.; Wen, M.; Wen, Y.; Sun, F.; Wang, J.; Yang, Y. Trust region policy optimisation in multi-agent reinforcement learning. arXiv 2021, arXiv:2109.11251. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]

Figure 1. Charging station architecture.

Figure 2. (a) Electricity pricing profiles. (b) The charging station’s solar production under typical weather conditions.

Figure 3. Pareto analysis.

Figure 4. MARL algorithm flowchart.

Figure 5. Local DDPG network.

Figure 6. Reward evolution for 10-spot EVCS.

Figure 7. User satisfaction penalty and energy cost for the 20-spot case. (a) EVCS grid cost; (b) EVCS penalty cost.

Figure 8. Reward evolution for differently sized stations. (a) Reward evolution for 20-spot EVCS; (b) reward evolution for 35-spot EVCS; (c) reward evolution for 50-spot EVCS.

Table 1. Mean reward over 10 different random days/episodes.

	RBC	DDPG	SAC	DR-MARL
Price 1	−35.29	−29.45	−31.41	−30.79
Price 2	−36.96	−25.91	−27.11	−26.09
Price 3	−36.06	−29.53	−31.43	−30.34
Price 4	−34.50	−29.58	−30.59	−30.86
Price Random	−35.88	−30.21	−30.84	−30.94

Table 2. Standard deviation of the mean reward presented in Table 1.

	RBC	DDPG	SAC	DR-MARL
Std	-	2.84	2.33	3.32

Table 3. Comparison results over 10 different random days/episodes.

	RBC	DDPG	SAC	DR-MARL
$E V C S_{20}$	−71.83	−66.76	−61.24	−61.69
$E V C S_{35}$	−141.45	−250.89	−152.33	−123.45
$E V C S_{50}$	−207.34	−380.76	−344.69	−175.12

Table 4. Number of iterations required to overcome RBC performance.

	DDPG	SAC	DR-MARL
$E V C S_{10}$	110k	60k	20k
$E V C S_{20}$	400k	200k	20k
$E V C S_{35}$	-	-	25k
$E V C S_{50}$	-	-	27k

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Korkas, C.D.; Tsaknakis, C.D.; Kapoutsis, A.C.; Kosmatopoulos, E. Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling. Energies 2024, 17, 3694. https://doi.org/10.3390/en17153694

AMA Style

Korkas CD, Tsaknakis CD, Kapoutsis AC, Kosmatopoulos E. Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling. Energies. 2024; 17(15):3694. https://doi.org/10.3390/en17153694

Chicago/Turabian Style

Korkas, Christos D., Christos D. Tsaknakis, Athanasios Ch. Kapoutsis, and Elias Kosmatopoulos. 2024. "Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling" Energies 17, no. 15: 3694. https://doi.org/10.3390/en17153694

APA Style

Korkas, C. D., Tsaknakis, C. D., Kapoutsis, A. C., & Kosmatopoulos, E. (2024). Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling. Energies, 17(15), 3694. https://doi.org/10.3390/en17153694

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed and Multi-Agent Reinforcement Learning Framework for Optimal Electric Vehicle Charging Scheduling

Abstract

1. Introduction

1.1. Related Work

1.2. Main Contributions

2. Materials and Methods

2.1. Markov Decision Process Framework

2.2. State

2.3. Action

2.4. Transition Function

2.5. Reward

3. Proposed DR-MARL Method

3.1. Problem Transformation

3.2. DR-MARL—Global Coordination

3.3. Local Agent—DDPG

4. Results

4.1. Setup of Numerical Study

4.2. Cost Comparison with Baseline Methods

4.3. Scalability Analysis

5. Discussion

5.1. Limitations

5.2. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Notations and Symbols

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI