2.2. EMP as a Markov Decision Process
The challenge at hand can be formalized as a Markov decision process (MDP), aiming to derive the optimal EMS for the simulated DGMG. The objective is to achieve optimal cost-efficiency and demand coverage amidst uncertainties associated with renewable energy generation and grid disruptions.
In the MDP framework, defined by a tuple consisting of states S, actions A, transition probabilities P from one state to another, and immediate rewards for actions taken, the system’s state comprises variables representing various aspects: the current renewable energy generation based on the month m and time step t, denoted as , energy storage levels (state of charge), and energy demanded by assigned facilities .
At each time step, an agent selects an action representing charging, discharging, or idling of the ESS. This action, chosen in state based on , and , determines the ESS charge level for the next time step, thereby influencing subsequent actions available to the RL agent. Transition dynamics are influenced by the stochastic nature of , , and grid disruptions.
The projected PV energy output
is modeled as a distribution based on month, time of day, and unit size (Equation (
1)). Similarly, energy demands
of assigned facilities are modeled as a distribution linked to different times of day (see Equation (
2)). These distributions characterize the probability of PV output meeting or falling short of projected energy demands (high scenario labeled H) or being insufficient (low scenario labeled L).
The subsequent state of the MG, concerning its ability to supply supplemental energy or store excess energy, hinges on the action taken at the start of the current state
, affecting the ESS state of charge for the next state
. Constraints dictate that if the ESS is fully discharged (
) or fully charged (
), it can only charge or discharge in the following state, respectively. The action chosen in any state, along with the probability
of high PV output, determines the next state of the system (Equation (
3)), illustrating a Markov process where actions in the current state influence subsequent states.
When modeling the process as a Markov chain decision tree (see
Figure 3), each node labeled
H or
L denotes the beginning of an hour during which the ESS unit undergoes charging, discharging, or remains idle based on the chosen action. At each node, there exists a probability
representing whether the PV output is sufficient to meet demand or falls short. In cases where demand exceeds PV output, power must be procured from the grid at a cost or supplemented using the ESS unit to bridge the gap between supply and demand. Conversely, when the PV and ESS units’ combined output exceeds demand when discharging, the surplus can be sold back to the grid for a marginal profit.
The decision on how to operate the unit hinges on the state of charge (
S) at the time of decision making, as depicted in Equation (
17) described in
Section 3. The overarching objective is to derive an optimal policy that recommends actions in a given state, maximizing not only the reward in the subsequent state but also the cumulative rewards across all future states. While tools like mixed linear and stochastic programming can address such problems, they have inherent limitations.
As depicted in the figure, the complexity of modeling increases exponentially with the number of time steps or stages considered. For each time step scenario pair, there are two potential new grid states, labeled as H and L, at each node. Equation (
4) illustrates how the number of nodes
or stages to be modeled and solved using methods like stochastic programming grows with the time horizon
T. For short planning horizons, such as 1 or 2 h, the number of nodes remains manageable, with only 3 and 7 nodes, respectively. However, when attempting to model a single day for an hourly energy management system (EMS), the number of nodes exceeds thirty-three million, rendering the problem intractable. This scalability issue limits the feasibility of long-term modeling using tools like linear or stochastic programming.
In contrast, RL solutions do not suffer from this scalability limitation and can provide optimal policies over large time windows by leveraging adequate models and simulations of the systems to be optimized. RL operates within the framework of MDPs and employs machine learning techniques to discover optimal policies without requiring explicit formulation. These aspects of RL methods make them well-suited for complex, uncertain problem optimizations.
The subsequent section delineates the two RL methods employed in addressing the energy management problem in this study and elucidates how RL methods are grounded in the MDP framework.
2.3. A2C and PPO Reinforcement Learning Optimal EMS
RL is a form of machine learning that learns in a manner similar to how biological organisms learn through interaction with their environment. RL consists of the learning agents having experiential episodes of some task for which they are to produce an optimal policy, which provides the action to take based on the state of the system. The agents store sample state, action, and reward experiences within a limited memory and utilize these “memorize” in a stochastic mini-batch manner to learn. The learner must discover the appropriate actions to take in an environment to maximize some future reward provided by the environment due to the action. The goal is to learn the optimal mapping of states to action or the optimal policy. Reinforcement learning requires a formulation following the MDP framework consisting of the environment, a method of describing potential states the environment can take, a set of actions an actor can take to affect the state of the environment, and some function that determines the values related to states, or states and action pairs based on some reward. The reward is some value that indicates the improvement or deterioration of the state or the achievement of some objective. Objectives such as playing video games, moving robots in more naturalistic manners, and even large language models utilize RL to great success. The following sections will provide brief descriptions of the basic concepts of RL and the two methods of DRL utilized in the work.
There are various methods, such as dynamic programming or temporal difference RL, that are value-focused. Methods such as these focus on approximating the values of specific states or state-action pairs based on the returns of some series of actions. The application of deep learning allows for deep reinforcement learning and policy gradient methods that seek to estimate the policy directly. Policy gradient methods use the tool of gradient descent to estimate the probability of each action leading to the greatest return in the current state. The methods used in these works, A2C and PPO, are a combination of both. Both methods use a dual network structure where one network is trained to produce a probability distribution that, when an input state is provided, indicates the probability of each action leading to the largest future return, and another that “critics” the advantage gained from the action taken by the former network by estimating the value of the resulting state.
This network structure helps to alleviate a few of the harder-to-handle aspects of DRL. First, due to the experiential nature of RL, the learning agents may attain misrepresentative or misleading sample data with which to learn. DRL consists of the learning agents having experiential episodes of the task for which they are to produce optimal policy. The agents store sample state, action, and reward experiences within a limited memory and utilize these “memorize” in a stochastic mini-batch manner to learn and update their weights to relate which set of probability estimates to produce for the potential actions that indicate which actions to take to achieve the optimal future return. This future return can be at the end of some overall task or some finite set of future rewards indicated by the discount factor. The discount factor controls how much attention the agent pays to the future rewards for a given action. These data can lead to the RL agent “learning” bad policy behaviors that, when applied to the bad samples, produce optimal future returns or the sum of future rewards but produce suboptimal results when applied to the general system.
The reinforcement learning algorithms used in this work are the A2C and the PPO algorithms. The A2C network is a type of RL architecture that combines elements of both value-based methods and policy-based methods. As the name suggests, two components of the network structure are linked in the gradient learning step of the algorithm. The “actor” component of the network is what provides the policy by learning the probability estimate for the actions (discharge, charge, idle) leading to the greatest future reward. The goal of the actor is to find an optimal policy such that in any given state s it will provide the action most likely to lead to the greatest return. In this formulation of the problem, the return reward for a given action is the current cost/profit in the connected scenario, and the amount of demand met in the islanded case.
The “critic” component evaluates the actions chosen by the actor by estimating the value function (Equation (
5)) for the state value produced by the action. The value function represents how good a particular state is in terms of expected future rewards. The goal of the critic is to help the actor learn by providing feedback or critiques on the chosen actions.
When training an A2C network, the actor and critic are updated based on the rewards for actions and state value estimates obtained from interactions with the environment. The critic provides a temporal difference (TD) error signal, which indicates the difference between the predicted value of a state and the actual observed value of a resulting state due to an action. This error is used to update the critic’s parameters, typically using gradient descent methods to adjust the predicted likelihood of the action, leading to greater returns. For the actor, the policy is improved using the critic’s evaluations. During training, the actor will begin to prefer actions that lead to higher estimated future rewards, according to the critic.
Like in other RL methods, actor–critic networks face the challenge of balancing exploring new actions to discover potentially better ones and exploiting current knowledge of the best-known or experienced actions. This balance can be achieved through exploration strategies or techniques like adding exploration noise to the actor’s policy by adding some randomness to the actions taken during training. This work uses the policy as a probability of action selection during training to allow for exploration, and a deterministic max value selection method after. Actor–critic networks often see faster convergence compared to pure policy gradient methods because they learn both a policy and a value function. The value function provides a more stable estimate of the expected return, which can guide the actor’s learning process more efficiently.
PPO is a popular reinforcement learning algorithm known for its stability and robustness in training neural network policies. PPO belongs to the family of policy gradient methods, which directly optimize the policy function to maximize the expected cumulative reward. PPO aims to improve the policy while ensuring the changes made to the network during training are not too large, which could lead to instability. To achieve this, PPO introduces a clipped surrogate objective function. This objective function constrains the policy updates to a small region around the old policy, preventing large policy changes that could result in catastrophic outcomes. The RL agent could adjust the weights in a drastic manner, leading to the agent becoming trapped in local optima. PPO offers several advantages, including improved stability during training, better sample efficiency, and straightforward implementation compared to other policy gradient methods. By constraining the policy updates, PPO tends to exhibit smoother learning curves and is less prone to diverging or getting stuck in local optima. PPO can be used along with actor–critic architectures as described above which is the methodology taken in this work.
In actor–critic with PPO, the actor’s policy is updated using the clipped surrogate objective, ensuring that the policy changes are within a safe range. The critic’s role remains the same, providing feedback to the actor based on the estimated value of the state arrived at due to actions. The stability of PPO combined with the value estimation of actor–critic architecture allows for the leveraging of the advantages of both approaches. This hybrid approach often leads to efficient and stable training of neural network policies in reinforcement learning tasks. Four total RL networks were trained and tested for this work to analyze if grid-connected only or connected with intermittent monthly disruptions allow for more disruption-resilient policies. The networks have slightly different reward structures based on the scenarios they are trained. The always fully grid-connected model focuses on maximizing daily energy cost coverage while meeting maximal demands. The reward function for the two grid-connected trained networks (PPO-G, A2C-G) uses the hourly cost of purchasing or profit from the sale of energy of on and off-peak energy costs based on the utility company. In contrast, the networks trained with intermittent disruptions have a modified reward structure that considers cost when connected and focuses solely on demand when in an isolated state.
The RL networks for this work are coded in Python (v3.9) using PyTorch (v2.2.1) and utilize an AnyLogic (v8.8.6) simulation of an MG for training and testing. The simulation models the MG at the hour level with the EMS agents being tasked with providing an action for the ESS unit to take (charge, discharge, idle) over the hour based on the current state. To train the networks, simulations run for a total of 72 months. During each simulation, the months are treated as episodes, and the two RL models had the network layers, sizes, learning, momentum, and learning decay weights tuned through experimentation.
2.3.1. EMP as an RL Formulation
Formulating the EMP as an RL problem entails treating the energy grid as the environment, comprising both the MG and the main utility. The state of the environment reflects the dynamics of the MG, encompassing interactions among energy sources and loads. In the EMP context, actions pertain to operations affecting the ESS, while the reward corresponds to energy costs or profits resulting from these actions, or the overall load coverage depending on the chosen objective. Optimal action selection depends on the current MG state, influenced by the probabilities of energy output levels () and demanded loads () for a given hour h.
The RL agent learns from experience gained by operating the ESS within this environment. In this work, a simulation is developed to replicate the energy grid, comprising distributed generation units comprising solar power generation and energy storage systems. The energy grid model is conceptualized at a high level, focusing on simplifying energy interactions. Emphasis is placed on exploring capabilities, training strategies, and policies, particularly in scenarios involving connectivity uncertainty and inherent uncertainties in renewable energy generation and load demands.
2.3.2. Training and Tuning the RL Networks
The PPO and A2C models are tested with two training methods. First, the two methods (PPO-G, A2C-G) are trained in only a grid-connected scenario where the option to buy energy from the grid is always available. This scenario assumes a buyback program where any excess energy produced by the DG unit is sold back to the grid for a profit. The next two models (PP0-D and A2C-D) are trained in a simulation where there is a one in four chance each hour of a month that a seven-hour outage will occur. If the outage occurs during the month, no other outages will occur during that month. Each model is given a total of 72 months of training.
The training and testing of the models are performed using a simulation developed in AnyLogic as described above. The RL agents are tasked with providing ESS actions at the beginning of each hour of the simulation. A learning interval is set so that after a set number of episodes or months for this problem, the learner uses the experience to optimize the policy and state advantage estimations to find the optimal EMS. Various network structures, learning intervals, and hyperparameter settings were explored to find those that allowed the models to learn.
Figure 4 shows a visualization of the PPO grid-connected, or PPO-G, trained model average cost returns over the course of 72 training months.