1. Introduction
Battery storage is emerging as a key component of intelligent green electricity systems [
1]. The investment into the battery needs to be justified by using the battery to provide services of financial value [
2,
3]. Four categories of such services can be identified: The first category is energy arbitrage, in which participants store energy in the battery during low prices and sell during high prices [
4]. The key electricity markets for energy arbitrage are the day-ahead, intraday, balancing, and real-time markets. Energy arbitrage can involve a standalone battery or a battery paired with other energy resources such as photovoltaic (PV) generation [
5]. The second category of services provided by batteries is reserves. Reserve market participants are compensated for adjusting power consumption or generation in response to power grid imbalances, such as grid frequency deviations from a nominal value. Reserve markets have different names and specifications in different parts of the world. In the literature, some commonly used names for these markets are frequency reserves [
6], frequency regulation [
7], ancillary services [
8] and spinning reserves [
9]. The third category is local markets such as peer-to-peer trading [
10] and the recently emerging nodal markets, in which prices vary according to the geographic location of the energy producing or consuming units, which gives the market tools to avoid congestion in the power grid [
11,
12]. The fourth category of services involves coordinating a battery and controllable loads [
13] for the provision of demand response.
The above-mentioned services are monetized through market participation, which usually involves bidding. Bidding is a multi-objective optimization problem, involving targets such as maximizing market compensation and minimizing penalties for failing to provide the service and costs for battery aging. A subset of research in this field includes the minimization of battery aging costs by embedding those into the multi-objective optimization problem (e.g., [
14,
15]). However, the optimization problem is challenging to solve, because the aging phenomenon is non-linear [
16] and dependent on the type of service provided, as some services, such as fast regulation services, may involve frequent charge–discharge cycling [
17].
Recently, numerous researchers have proposed reinforcement learning (RL) as a multi-objective optimization technique for monetizing battery storage, and several of them consider aging in the reward formulation [
18,
19]. The RL problem formulation involves a reward, which gives the RL agent feedback about how advantageous or disadvantageous its actions have been. The reward does not need to be derived from physics, so the reward formula may include a term that is a simplification of the aging phenomena that ignores the non-linear dynamics of the battery. Such simplifications are commonly used by RL practitioners, and various formulations have been proposed by different authors, such as [
20]. The shortcoming of such an approach is that any benefits for reducing battery aging are not demonstrated either with a physical battery or with an accurate battery model. This prevents researchers from validating effective formulations of battery aging in the RL reward function. Simplified aging cost models also prevent making direct comparisons between results reported by different research groups, preventing the identification of seminal works with respect to battery aging management.
The contribution of this article is a methodology involving a realistic battery simulation to assess the performance of the trained RL agent with respect to battery aging in order to inform the selection of the weighting of the aging term in the RL reward formula. Our research is conducted in the context of bidding battery storage on frequency reserves. A few works have investigated using RL for bidding a battery on frequency reserves, but without considering battery aging effects [
21,
22] or through simple approaches such as penalizing the agent for exceeding minimum and maximum state of charge (
SoC) limits for the battery [
23].
3. Methodology
The methodology is elaborated in this section. As frequency reserve market rules vary between countries, the Finnish Frequency Containment Reserve for Normal operation (FCR-N) is taken as the case market. The symbols used to define the methodology are presented in
Table 1.
We propose to model the reward for the FCR-N market using three components: market revenue, market penalty, and aging penalty. The reward can then be defined as:
where the first two terms are the FCR-N market revenue and market penalty, respectively [
21]. The market penalty is due to the battery
SoC exceeding minimum or maximum limits, in which case the battery is not available for reacting to grid frequency deviations. The third term is an aging penalty
A and a weight
w. The value of
w determines the weight of the aging penalty relative to the net revenues in the reward function. It is notable that the market prices in one hour can be much higher than in another hour, so a high reward can be due to good decisions by the agent or due to high prices on that particular hour or both.
For the aging penalty in reward, we model aging as a linear approximation of the non-linear battery dynamics. The aging can be modelled as:
where the coefficient
scales the sum to same level with other components of reward.
The approximation in (2) builds on top of the previous research reviewed in
Section 2.2, but introduces two key differences. Firstly, the works discussed in
Section 2.2 identify either
SoC or
i as significant factors impacting aging. Our formulation recognizes that aging depends on the
SoC level as well as the magnitude of the charging/discharging current. Secondly, the step of our RL agent is the market interval, which for FCR-N is 1 h at the time of writing. Power grid frequency can change numerous times during this interval, resulting in a corresponding change in the current
i, which will impact
SoC and
DoD. Thus, a much more accurate approximation of aging can be obtained by performing the approximation once per second and taking the sum over the market interval, which is the duration of the RL step. The reason for performing the approximation once per second is that a control step of one second is sufficient for meeting the dynamic and stability requirements of the FCR-N market [
74].
Figure 1 presents an architecture for training the RL agent. Since the step of the RL agent is 1 h, the state and reward are updated once per hour, and the RL agent determines the bid capacity
C once per hour. However, the calculation of
pen in (1) requires a once per minute resolution [
21], and the calculation of
A in (2) requires a once per second resolution. Thus, the environment requires simulation at a 1st time step. As presented in
Figure 1, this is driven by time series data of the power grid frequency
f, which has been obtained from the transmission system operator (TSO) Fingrid, which also operates the FCR-N market. The frequency data have been preprocessed to obtain one data point per second. A ‘Power calculator’ module determines the required momentary charging/discharging power
P based on
f,
C and the stationary requirement of FCR-N [
74,
75]. The battery simulation determines the required current
i according to
P and the
u, which is not assumed to be constant as it is affected by the
SoC. This current is fed as an input to a battery simulation model, which outputs the
SoC and
i. This information is sufficient for calculating according to (1) and (2). These calculations are done based on the actual market price
FCRact. However, this is not known at the time of making the bid, so it cannot be used as state information for the RL agent. Thus, the state includes the forecasted price
FCRfcast, which is obtained using the machine learning time series forecasting method for FCR-N [
76]. In addition to this forecast, the state information includes R, which is an integer specifying the number of hours since the battery last rested. Resting is defined as not participating in the market, which occurs when the bid capacity
C is 0 MW. During the rest, the battery is charged or discharged so that the
SoC will reach 50%, reducing the likelihood of
SoC out of bounds events that result in market penalties.
Several RL algorithms are available for optimizing the agent. Many of the RL applications for battery management define continuous action spaces, which motivates the selection of algorithms capable of handling continuous as well as discrete spaces, such as Advantage Actor-Critic (A2C) [
77], Proximal Policy Optimization (PPO) [
22], Deep Deterministic Policy Gradient (DDPG) [
78,
79] and Twin Delayed DDPG (TD3) [
80] have been applied in the context of batteries. However, the task of the RL agent in this paper is to select the value for C. This selection must be made from a discrete set of possible values, due to the rules of the FCR-N market. The range of bids is between 0.1 MW and 5 MW with a resolution of 0.1 MW [
21]. Since our state and action spaces are discrete and not large, computationally heavy methods such as DDPG and TD3 will not be investigated. In this article, the suitability of the REINFORCE [
81], A2C [
81], and PPO algorithms will be experimentally evaluated.
Equation (2) is proposed as a reasonable approximation of battery aging cost for the purpose of training the RL agent. As it does not capture a battery’s non-linear dynamics, it cannot be used for an accurate evaluation of how the trained RL agent mitigates battery aging. A modification of the architecture of
Figure 1 will be used for performance evaluation, in which the battery
age is obtained from a battery simulation. The modified architecture is presented in
Figure 2. No reward calculation is conducted at this stage since the training process of the agent has already been completed. The setup in
Figure 2 will determine the net revenue in EUR and the battery aging in terms of full equivalent cycles when the trained agent is run against historical market and grid frequency data for a period of several days. In our RL formulation, an episode is one day. Net revenue is defined as market compensation minus market penalties, and this calculation is performed in the ‘Revenue calculator’ of
Figure 2. The ‘Battery simulation’ in
Figure 2 includes the Matlab Simulink battery model, which implements the aging behavior modeled in [
82].
In the training phase (
Figure 1), Equation (2) is used instead of the aging output of the Matlab Simulink battery model (
Figure 2). The reason for this is that the Simulink battery model is not intended for applications in which the
age needs to be updated frequently, such as every time the RL environment is stepped forward. The
age output of the Simulink battery is updated once every half cycle, and it cannot be assumed that these updates would occur at the same time as the RL environment is stepped forward.
4. Implementation
The implementations of the two architectures in
Figure 1 and
Figure 2 are presented in
Figure 3 and
Figure 4, respectively. In both implementations, the battery voltage is dependent on the extracted capacity [
83], which affects
SoC directly. The voltage of the battery is used as an input variable for the
CurrentCalculator function, which generates the control signal to the controllable current source. The
CurrentCalculator function is also responsible for preventing the
SoC from exceeding the 5% and 95% limits. The
PowerCalculator function is responsible for controlling the power in case of rest action (i.e., no bid). If the rest action is taken, the battery will be charged or discharged to 50%
SoC at constant power. The blocks between the
penaltyIn and
penaltyOut variables keep track of penalty minutes, which are used to calculate
pen.
The setup in
Figure 3 is used to compute the reward function in (1). The aging output of the Simulink battery is not used in this context since it is updated only every half cycle, so in general, the
age output is not up to date at the end of each RL step. For this reason, the simplified approximation of aging behavior was used as defined in (2). However, it is important to assess how well the RL agent trained with this reward will perform against the more realistic battery dynamics. The performance evaluation setup in
Figure 4 is used for this purpose. The
age output of the Simulink battery in
Figure 4 is used to quantify the aging in equivalent full cycles in the performance evaluation phase. The aging dynamics of the Simulink battery model are based on the [
82]. Equations (3)–(5) are from the Mathworks ‘Battery’ documentation [
84]. The aging output is calculated as
where
is the number of cycles when battery is fully charged and discharged at nominal charge and discharge current. The
is an input parameter to determine how many full cycles battery lasts. The
is the battery aging factor. The aging factor is calculated as:
The half-cycle update occurs when battery starts to discharge after charging or when the battery is full, i.e.,
SoC = 100. The
DoD values are from previous three timesteps. The
is the maximum number of cycles and it is calculated as:
The maximum number of cycles is dependent on the average currents during latest half cycle duration, previous
DoD and ambient and reference temperatures. The symbols in (5) are presented in
Table 2. The constant values of
are set by Matlab battery model and not available from documentation.
The parameters of the assessed algorithms for training and validation are presented in
Table 3. The parameters of the battery are presented in
Table 4. The additional parameters for performance evaluation are presented in
Table 5.
For training and validation, the predicted and actual prices of the Finnish FCR-N market and the Finnish power grid frequency data from 2020 were used. One episode is one day, and one step of the RL agent is one hour since the market interval is one hour. The RL environment was reset at the beginning of each episode. The days were shuffled and then split into training and validation datasets with a ratio of 9:1. In the data preprocessing phase, any days with missing data were excluded, resulting in 315 training days and 35 validation days; 10 random seeds were used to train 10 agents for each RL algorithm.
The tunable weight term
w in the reward restricts the learning of the agents. The teaching is meaningful only if the aging penalty is significant, but it does not dominate the net revenue. If the net revenues dominates the reward, the agent is expected to ignore aging penalties, and if the aging penalty dominates, then there is no business case since costs outweigh the revenues. The different components of the reward function were plotted for several values of
w and it was determined that a value of 2.63 was in the meaningful range as described above. The methodology is presented in detail using this value of
w. The performance evaluation is performed for several values of
w in
Section 5.2.
Since the state space has only two variables, the mapping from the state space to the actions learned by the RL agent can be visualized as a heatmap, in which the horizontal and vertical axes are the values of the state variables and the color is the value of the bidding action.
Figure 5 shows this mapping for each of the three algorithms: REINFORCE (a), A2C (b), and PPO (c). Since 10 random seeds are used, the action values are the mean of actions selected by the 10 agents. The learned policies of the three algorithms display a triangular pattern, which can be intuitively explained. The longer the time since the previous rest, the higher the likelihood of the
SoC going out of bounds and incurring market penalties, and the higher the market price, the higher the revenue. The agent learns to capitalize on this phenomenon by using higher bids toward the top right corner of the heatmap.
6. Discussion
The validation rewards in
Figure 8 and
Figure 9 show that most of the learning occurs in the first 100 days. Only very minor improvements may be expected by continuing the training beyond the 1575 episodes used in this paper. The standard deviations in
Figure 9 show that it is not possible to make any statistically significant statements about the superiority of any of the three RL algorithms.
Figure 10 and
Figure 11 illustrate the performance obtained by our agents with a realistic lithium-ion battery model. The dots are on a diagonal from the lower left to the upper right corner. This illustrates the tradeoff involved in adjusting the aging penalty
w. A lower penalty results in higher net revenues and faster aging. This is to be expected intuitively since a lower value of
w will decrease the negative aging penalty term in the reward function without affecting the positive market compensation term. As
w is lowered, the positive market compensation term dominates the reward function and encourages the agent to take actions that increase the compensation. In other words, the agent is encouraged to bid higher capacities. According to
Figure 4, higher capacities result in higher charging and discharging currents, which cause faster aging.
Figure 10 shows that a straight line could be fitted to the dots with
w values in the range 1.1–3.3. For higher
w values, a significant drop in net revenues is observed. Our RL agent is not intended to be used in situations in which the aging penalty is very large compared to the market revenues. In such a situation, the business case for participating in the frequency reserve market is questionable.
The standard deviations in
Figure 3 show that it is not possible to make statistically meaningful statements about the superiority of any of the RL algorithms. However, the diagonal trend discussed in the previous paragraph is also evident when the shaded areas are considered. Further, it is noted that when
w is larger than 3.3, the shaded boxes are much larger. In the previous paragraph, 1.1–3.3 was identified as a relevant range. This can be due to the fact that with large
w-values, the aging cost dominates the reward.
The relevant value for
w depends on the actual cost of an equivalent full cycle of a particular battery. This cost depends greatly on raw material costs, supply chain disruptions, and government subsidies, which in turn can change drastically in response to global events such as pandemics and military conflicts. Thus, in this article, the aging penalty
w is a parameter. If the aging cost value for a specific battery is known in terms of EUR per equivalent full cycle, the horizontal axis of
Figure 10 can be converted to EUR by multiplying with this aging cost value. The
w value can then be adjusted so that the difference between net revenue and aging cost is maximized.
The results are specific to the lithium-ion battery chemistry and the parameters of our case study battery presented in
Table 3 and
Table 4. The methodology can be readily adapted to another lithium-ion battery by updating these parameters. The methodology can also be easily adapted to other battery chemistries by replacing the battery simulation block in
Figure 3 and
Figure 4.
The results are specific to the Finnish FCR-N market. It is straightforward to generalize the approach to other auction-based frequency reserve markets in Finland or another country. The following modifications are needed. In the simulation environment, the current calculation should be according to the technical specification of the market, and the calculation of penalty minutes should be according to the market specification. The market price data and power grid frequency data used in our study have been obtained from the Finnish transmission system operator Fingrid’s open data portal, so such data need to be obtained from the relevant TSO in another country. It is notable that the RL problem formulation does not need to be changed. Batteries can also be traded on other kinds of electricity markets. For example, a battery can perform energy arbitrage on day-ahead markets. It is not straightforward to generalize beyond frequency reserves markets to other auction-based electricity markets since significant changes to the RL problem formulation will be needed.
It is notable that RL practitioners generally use unique reward formulations, so it is not possible to make performance comparisons between different works. In this article, a physics-based performance evaluation environment has been proposed which enables direct comparisons even with different reward formulations.
7. Conclusions
For each of the algorithms, learning was observed in the form of a reward that increased and eventually plateaued (
Figure 6 and
Figure 8). The main statistical findings are summarized in
Table 6. It is noted that for each of the algorithms, the reward is within the standard deviation of the other algorithms. It is concluded that each of the algorithms was successfully trained and that none of them was statistically superior to the others.
As stated in
Section 1, the contribution of this article is a methodology involving a realistic battery simulation to assess the performance of the trained RL agent with respect to battery aging in order to inform the selection of the weighting of the aging term in the RL reward formula. The results presented in
Section 5.1 demonstrate that learning occurs and that none of the investigated RL algorithms is statistically superior to the others. As there is no statistically significant difference between the performance of the different algorithms, we conclude that the optimization problem was successfully addressed with all the algorithms. These kinds of results are frequently presented in the RL literature. Stopping the investigation at this point has two weaknesses. Firstly, it is not known if the RL agent trained in the RL environment could generalize to realistic battery dynamics if it would be tasked with managing a real battery. Secondly, since the results are quantified in terms of reward, they do not permit direct comparisons to other RL investigations of the same phenomenon, even if the battery parameters are identical if the RL reward formulations are different. Further, it is not possible to make performance comparisons to non-RL methods since the results are expressed in terms of the reward value.
To overcome these two weaknesses, a performance evaluation has been performed after confirming the learning on the validation dataset. The concept for the performance evaluation is presented in
Figure 2, its implementation is presented in
Figure 4, and the results are presented in
Figure 10 and
Figure 11. The implementation in
Figure 4 uses a realistic battery model and conforms to the technical specification of the Finnish FCR-N market. The results in
Figure 10 and
Figure 11 are expressed in terms of net revenue and aging (equivalent full cycles). The battery parameters are presented in
Table 3 and
Table 4 and the FCR-N market data and power grid frequency data are from the year 2020. Thus, any researcher is able to develop a battery FCR bidding optimizer, using either RL or non-RL methods, run them against this battery model and openly available market data and power grid data, and obtain results in terms of net revenue and aging that are directly comparable to the results that we have presented in
Figure 10 and
Figure 11.