3.1.2. Action

*A* represents the set of eligible actions of MGOs, which are the variation of bidding price and quantity in each bidding round. As most of the previous works aim at increasing market participants' profits via the dynamic adjustment of bidding pricing, we propose a two-dimensional formulation of action for Q-learning. By covering both bidding price and quantity, the action space is extended to a two-dimensional space rather than a set of single price actions, formulated as Equation (9):

$$a^n = (p^n, q^n) \qquad n = 1, 2, \cdots, N. \tag{9}$$

The basic idea on actions in this paper is that each MGO always optimistically assumes that all other MGOs behave optimally (though they often will not, due to their exploration and exploitation nature). In addition, all the MGOs play fair competition in the bidding process. Considering the particularity of energy trading market and agent-based simulation environment, the concept of 'Basic Action' is created to describe the rational and conventional action of each MGO. One point needs to be emphasized is that 'Basic Action' is just a point in the action space, showing the general choices of bidding price and quantity for MGOs. The mathematical expressions of basic price action are presented as follows:

$$p\_i^{n,busic} = p\_{i,step}^T \cdot (1 + TP^n) \cdot SDRF^n,\tag{10}$$

$$p\_{i,step}^T = \frac{\left| price\_i^{initial} - price\_i^{history,T} \right|}{\beta \cdot N} \,\tag{11}$$

$$TP^n = 1 - (1 - \frac{n}{N})^{\varepsilon^{-1}},\tag{12}$$

where *pT i*,*step* represents the price changing step of MGO *i*, determined by MGO *i*'s initial bidding price *priceinitial i* and historical trading price *pricehistory*,*<sup>T</sup> i* in time slot *T* as shown in Equation (11). *β* is a regulation factor for the price changing step. As the QLCDA reaches the time deadline, both buyer and seller MGOs are willing to make a concession on the bidding price to make more deals. The setup of time pressure *TPn* as presented in Equation (12) describes the degree of urgency over bidding rounds. Discussions on the choice of time pressure function have already been made in previous research [19]. In this paper, we adopt a simplified form in which the time pressure of each microgrid is only related to the bidding round index. The historical trading records of each microgrid are ignored in the description of time pressure. *SDRFn* is a modified factor based on real-time SDR. Different calculation expressions are adopted for buyer and seller MGOs as follows, inside which *π* is an adjustment coefficient in the range of [0.3,0.5]. The setting of *π* measures the influence of SDR on the basic bidding price:

$$SDRF^n\_i = \begin{cases} \pi \cdot (1 - SDR^n) + 1 & \text{for buyers,} \\ \pi \cdot (SDR^n - 1) + 1 & \text{for sellers.} \end{cases} \tag{13}$$

Accordingly, the basic quantity action is calculated as follows:

$$q\_i^{\text{n,basic}} = \begin{cases} \ q\_i^{\text{n}} \cdot \left( SDR^{\text{n}} \cdot PS\_i^{\text{n}} - 1 \right) & \text{for buyers,} \\\ q\_i^{\text{n}} \cdot \left( \left( 2 - SDR^{\text{n}} \right) \cdot PS\_i^{\text{n}} - 1 \right) & \text{for sellers,} \end{cases} \tag{14}$$

$$PS\_i^n = \rho + 2 \cdot (1 - \rho) \cdot N(PR\_i^n, \mu, \sigma), \tag{15}$$

$$PR\_i^n = \lambda \cdot \frac{p\_i^{\text{history},T} - p\_i^{\text{reference},n}}{p\_{\text{hl}}^t - p\_{\text{II}}^t},\tag{16}$$

where the *PRn i* is a reference price factor calculated as a parameter of normal distribution, *λ* = 1 when MGO *i* is a buyer, while *λ* = −1 when MGO *i* is a seller. *pre f erence*,*n i* is the reference price of MGO *i* in round *n*, calculated as the average price of potential transactions in the market. The values of *μ* and *σ* in Equation (15) are the same as those in Equation (7). *ρ* is a pre-defined adjustment coefficient located in the range of [0.95,1] for coordination with the change rate of SDR in Equation (14).

Since the action space is a continuous one, it is impossible to explore the whole action space in this problem. The idea of 'local search' in heuristic algorithms is applied in the proposed Q-learning algorithm: we intend to explore the neighborhood space of basic action on price and quantity dimensions for better bidding performance in the QLCDA process. Based on the basic action obtained in the former process, we search two directions of the price and quantity dimensions symmetrically, therefore the number of actions in each dimension is odd. Supposing that we choose more than two neighborhoods of the basic action in one direction, the total number of actions in this problem will be at least 25 actions, which is impractical and meaningless in both modeling and simulation. To limit the number of bidding actions and reduce computational complexity, only the closest neighborhoods are taken into account. The neighborhood actions are calculated as follows, where *ξ* and *τ* indicate the proximity of bidding price and quantity according to bidding experiences, respectively. *ξ* and *τ* are independent variables that only describe the neighborhood relationship of bidding price and quantity. Thus, a 3×3 action matrix is created as alternative behaviors of one MGO under a certain state. One factor, in particular, needs highlighting: the nine actions under a certain state represents nine bidding preferences and tendencies of each microgrid. Given that the SDR in one state might be different, the nine actions are also SDR-based and not totally the same for one state:

$$p\_i^{n,-} = p\_i^{n,basic} - \mathbb{J} \cdot p\_{i,step}^n \tag{17}$$

$$p\_i^{n,+} = p\_i^{n,basic} + \xi \cdot p\_{i,step}^n \tag{18}$$

$$q\_{i}^{n,-} = q\_{i}^{n, \text{basic}} \cdot (1 - \tau),\tag{19}$$

$$q\_{i}^{n,+} = q\_{i}^{n, \text{basic}} \cdot (1 + \tau). \tag{20}$$

## 3.1.3. Q-Values and Rewards

The goal of the Q-learning algorithm for bidding strategy optimization is to choose the appropriate actions under different states for each MGO, and the Q-Values indicate the long-term values of state-action pairs. In the former Q-learning process, the Q-values for state-action pairs are arranged in the so-called Q-table. However, based on the action space, we mention, in the former chapter, a Q-cube framework of Q-learning algorithm is proposed as shown in Figure 6, in which the colors of state slices are corresponding to Figure 5.

**Figure 6.** A Q-cube framework designed for the Q-learning algorithm.

The Q-value of taking one bidding action under one certain state is distributed in this Q-cube as shown with a small blue cube inside. Generally speaking, the proposed Q-cube is a continuous three-dimensional space, but, for practical purposes, we discrete the problem domain by taking eight states, three bidding prices and three bidding quantities under consideration in this paper. Each MGO has a unique Q-cube showing the Q-value distribution in the proposed problem domain. The Q-values in the Q-cube are not cleared to zero at the end of each time slot but will be applied as initial configuration of the next time slot. The rolling iteration of Q-cube accumulates bidding experience in the energy trading market.

*<sup>r</sup>*(*<sup>s</sup>*, *a*) is the reward function for adopting action *a* in state *s*. The selection of reward function is crucial, since it induces the behavior of MGOs. Seeing that we consider the dual effects of bidding price and quantity in QLCDA, both contributions of adopting one certain action should be taken into account in the reward function. The mathematical expression of reward function is presented in Equation (21). *ω* represents the weighted factor on bidding price and quantity. As price is the decisive factor in deciding whether a deal is closed, we pay more attention to the bidding price, therefore *ω* is usually set to be greater than 0.5:

$$r(\mathbf{s}, a) = \omega \cdot r\_p(\mathbf{s}, a) + (1 - \omega) \cdot r\_q(\mathbf{s}, a). \tag{21}$$

*rp*(*<sup>s</sup>*, *a*) and *rq*(*<sup>s</sup>*, *a*) represent the contributions of bidding price and quantity update on the reward function, which are calculated as follows. All of the variable definitions are the same as those in Equations (10)–(16):

$$r\_p(s, a) = \frac{\left| \left| p\_i^u - p\_i^{histary, T} \right| - \left| p\_i^{reference, n} - p\_i^{histray, T} \right| \right|}{p\_{i, step}^T} \tag{22}$$

$$r\_q(s, a) = \frac{\lambda \cdot (q\_i^n - q\_i^{initial, T})}{q\_i^{initial, T} \cdot (SDR^n - 1)}.\tag{23}$$

···

···

#### *3.2. Update Mechanism of the Proposed Q-Cube Framework*

In the proposed Q-learning-based continuous double auction energy trading market, as two dimensions of MGOs' action, bidding price is the key factor in deciding whether to close a deal or not, bidding quantity affects the real-time SDR of the overall market. Meanwhile, the SDR (as the MGOs' states) has a decisive influence on MGOs' actions by updating Q-Values. The coupling relationship between MGOs' actions and market SDR is modeled in this chapter, as shown in Figure 7. One MGO takes *an*−<sup>1</sup> in round *n* − 1 and the state transfers from *sn*−<sup>1</sup> to *<sup>s</sup>n*. After calculating rewards and updating Q-value, the probability of choosing any action in the action space is modified. Afterwards, given the new Q-cube and market SDR, the MGO might choose *a<sup>n</sup>* as the action in round *n* and repeat the above process. Therefore, the state-action pair of one MGO in each bidding round is formulated in a spiral iteration way, considering both local private information and public environment. The Q-cube framework is a connector of the state perception process and a decision-making process, which is the core innovation of this paper.

**Figure 7.** The coupling relationship between microgrid operators' actions and market supply demand relationship.
