3.2.2. Action Update

In each round of QLCDA, for each MGO, firstly the basic action is calculated based on market SDR and historical trading records as shown in Figure 8 with red balls in the action space. The colors of action space slices represent the market state. The neighborhood actions are formed in the action space as shown with blue blocks. A selection process is carried out by creating the probability matrices of nine optional actions.

**Figure 8.** The update process of bidding action in the proposed Q-learning algorithm.

In Equation (25), the action matrix A*ni* is composed by combining the optional bidding price and quantity. Correspondingly, the elements the probability matrix Pro*ni* are formed according to '*ε*-greedy' strategy. The probability (*xbb*) of the basic action (*abbi* = (*pbasic i* , *qbasic i* )) is given preferential treatment and equals *ε*. For each microgrid, the setting of *ε* represents its degree of attention on optimal bidding action choice in theory, which is diverse from each other. The probability of other neighborhood actions are calculated by weighted sharing of the remaining probability according to their Q-value (as Equation (26)). The sum of nine probabilities on actions equal to 1:

$$\mathbf{A}\_{i}^{n} = \begin{bmatrix} (p\_{i}^{-}, q\_{i}^{+}) & (p\_{i}^{\text{basic}}, q\_{i}^{+}) & (p\_{i}^{+}, q\_{i}^{+}) \\ (p\_{i}^{-}, q\_{i}^{\text{basic}}) & (p\_{i}^{\text{basic}}, q\_{i}^{\text{basic}}) & (p\_{i}^{+}, q\_{i}^{\text{basic}}) \\ (p\_{i}^{-}, q\_{i}^{-}) & (p\_{i}^{\text{basic}}, q\_{i}^{-}) & (p\_{i}^{+}, q\_{i}^{-}) \end{bmatrix} \quad \Rightarrow \quad \text{Proj}^{n} = \begin{bmatrix} \mathbf{x}^{-+} & \mathbf{x}^{b+} & \mathbf{x}^{++} \\ \mathbf{x}^{-b} & \mathbf{x}^{bb} & \mathbf{x}^{+b} \\ \mathbf{x}^{--} & \mathbf{x}^{b-} & \mathbf{x}^{+-} \end{bmatrix}. \tag{25}$$

For example, *x*<sup>−</sup><sup>+</sup> represents the probability of choosing action *a*<sup>−</sup><sup>+</sup> *i* = (*p*<sup>−</sup> *i* , *q*+ *i* ) under the current state, which is calculated as follows:

$$\mathbf{x}^{-+} = (1 - \varepsilon) \cdot \frac{\mathbf{Q}^{\rm n}(s\_i^{\rm n}, a\_i^{-+})}{\sum\_{\forall a/a^{bb}} \mathbf{Q}^{\rm n}(s\_i^{\rm n}, a\_i)}.\tag{26}$$

This selection mechanism means that all MGOs have a higher possibility of choosing actions with higher Q-values in each round of QLCDA. By putting the MGOs' best possible local actions together, the most suitable actions for the current global state are generated in a distributed non-cooperative way.

#### **4. Case Studies and Simulation Results**

In this section, we investigate the performance of the Q-learning algorithm for continuous double auction among microgrids by Monte Carlo simulation. The proposed algorithm is tested on the realistic case in Guizhou Province of China. The distribution network near Hongfeng Lake consists of 14 microgrids with different scales and internal configurations. Detailed topology of the networked microgrids are given in Figure 9. As power flow calculation and safety check are not the focus of this paper, distance information and transmission price in this distribution network are not provided here. The interested reader may refer to [24] for more details.

We simulate this non-cooperative energy trading market within a scheduling cycle of 24 h. The QLCDA is performed every Δ*t* = 0.5 h. A scheduling cycle starts at 9:00 a.m. The internal coordinated dispatch of each microgrid is accomplished in advance, from which the dispatch results are treated as initial bidding information in QLCDA. BESS properties of the 14 microgrids are provided in Table A1, including capacity, initial SOC, charge and discharge restriction and charge and discharge efficiency. Guizhou Grid adopts the peak/flat/valley pricing strategy for energy trading, which divides a 24-hour scheduling cycle into three types of time intervals. The surplus energy injected to the grid is paid at 0.300 CNY for each kWh in the whole day. In addition, buying energy from the grid is charged at the price 1.197/0.744/0.356 CNY, respectively (see Table A2).

**Figure 9.** Network topology of 14 microgrids in Guizhou Province, China.

In order to simulate the microgrids' preferences in decision-making, different risk strategies are adopted by setting diverse Q-learning parameters. Fourteen microgrids' values of learning rate, discount rate and greedy degree are given in Table A3. Three risk strategies are defined and discussed according to different Q-learning parameter choices:


• Risk-Taking Strategy: the MGO is not greedy about new bidding information (low value of *α* ranged in [0,0.4]) but likes to explore more potential actions (low value of *ε* ranged in [0,0.4]) as a risk-taker. In the meantime, he is eager for more future profits (high value of *γ* ranged in [0.6,1]).

Other hyper parameters in the proposed Q-learning algorithm are given in Table A4.

The proposed energy trading market model and QLCDA algorithm are implemented and simulated using MATLAB R2019a on an Intel Core i7-4790 CPU, 3.60GHz. Three case studies on bidding performances and profit comparisons are discussed in this section. All the three case studies are simulated repeatedly for 30 times, among which the bidding result of one certain Monte Carlo simulation is analyzed in detail in Case Studies 1 and 2, and the average values of bidding profits are adopted to compare with the profits of two other energy trading mechanisms in Case Study 3.

#### *4.1. Case Study 1: Bidding Performance of the Proposed Continuous Double Auction Mechanism*

#### 4.1.1. Bidding Performance of the Overall Energy Trading Market

The proposed continuous double auction energy trading mechanism achieves significant effects on the energy trading among microgrids. Figure 10 shows the bidding process of price in Time Slot 12. In Figure 10a, the bidding price of all microgrids in the whole time slot is presented. Starting with different initial bidding prices, the slopes of price curves indicate different bidding strategies of the MGOs. Due to the fact that bidding price is the key factor in deciding whether to close a deal or not, different intersection points of the pricing curves represent deals under various market conditions. Buyer/Seller MGOs with stronger willingness of reaching deals prefer to raise/drop their prices quickly, expecting that their energy demand/supply is satisfied in the early stage of a time slot. Although patient MGOs would like to wait until the deadline for a better trading price, they have to experience fierce price competition near the deadline and face the possibility of no energy to trade.

**Figure 10.** The bidding process of price in time slot 12.

Figure 10b shows the bidding price details in rounds 105–135 of time slot 12. MG 11 hadn't traded energy with other microgrids for a long time according to historical records. With stronger willingness of selling energy, MG 11 drops its bidding price quickly and reaches a deal with MG 4 at the price 0.530 CNY/kWh. Unmet energy demand of MG 4 is satisfied by MG 13 with a higher price (0.579 CNY/kWh). MG 6 raises its bidding price slowly and closes deals with MG 9 and MG 10 at the price of 0.481 CNY/kWh and 0.489 CNY/kWh, respectively. However, 27.016 kWh of energy demand has to be bought from the grid with a higher price (0.744 CNY/kWh) as all the energy supply from other microgrids is sold out. This shows a trade-off between price and trading opportunity: one MGO might be eager for closing a deal, but the trading price might not be satisfactory. On the other hand, the energy trading market follows the principle of 'First, Bid, First, Deal', which means the closer the time to the deadline, the less energy one is able to trade.

Comparison on clear power curves before and after CDA is presented in Figure 11. Enhanced by the proposed CDA mechanism, the distribution network achieves better performance on the balance of energy supply and demand. As a result of more balanced energy trading market conditions, more energy is transacted within the distribution network rather than trading with the grid, which reduces long-distance energy transmission loss. With the help of BESS, an alternative form of 'demand response' is performed among microgrids by exerting the potential capacity of elastic loads, which expands the concept of demand response from time-slot-based to multi-agent-based by CDA. In addition, trading prices are more reasonable and profitable, taking care of each MGO's personal preferences.

**Figure 11.** Clear power of the overall energy trading market before/after continuous double auction.

The comparison of trading quantity before and after the proposed CDA is given in Table 1. A significant effect could be obtained by adopting CDA as the trading quantity with grid decrease by different degrees. For example, only 10.8% of the energy demand of MG 3 is provided by the grid, while microgrids with heavy demand like MG 4 and MG 6 still depend on the grid to a large extent, holding 65.5% and 57.1%, respectively. Seller microgrids' dependency of the grid is obviously less than that of buyer microgrids with an average percentage of 26.1% as they prefer to sell energy within the distribution network. The BESS storage change and (dis)charge energy loss are also presented in Table 1, from which we could find that most of the microgrids' BESS obtain higher SOC at the end of one scheduling cycle. The larger BESS capacity and the more active the participation in the trading market, the more BESS (dis)charge energy loss will be caused.


**Table 1.** Comparison of trading quantity before/after continuous double auction.

#### 4.1.2. Bidding Results of Specific Microgrids with Different Roles

The bidding results of specific microgrids with different roles are presented in this chapter, including bidding price and quantity. Figure 12 gives the energy trading price of MG 4 and MG 12. MG 4 plays the role of buyer in the whole scheduling cycle, and it successfully reaches deals with other microgrids in most of the time slots as shown in Figure 12a. On no-deal time slots, it buys energy from the grid at higher prices. During the valley interval, although the grid purchase price is low enough (0.356 CNY/kWh), there are still plenty of opportunities to trade with other microgrids in consideration of the real-time SDR. MG 4 succeeds at buying energy at lower prices in almost all the time slots in this interval. Different from MG 4, MG 12 plays two roles in different time slots. The detailed trading prices of MG 12 in time slots 9 to 32 are presented in Figure 12b. Good performance is obtained in both roles that MG 12 plays: during buyer intervals, it reaches deals with other microgrids at prices lower than the grid's, while, in seller intervals, it sells energy in every time slot for higher profits. The overall profit of MG 12 raised by 33.9% after joining the CDA energy trading market.

 cycle (**b**) Trading price of MG 12 between time slot 9

**Figure 12.** Energy trading price of MG 4 and MG 12.

For MG 7, the bidding performance on quantity is presented in Figure 13a. As a buyer microgrid in the whole scheduling cycle, the gaps between original bidding quantity curve and actual trading quantity curve correspond to the real-time SDR. When *SDR* ≥ 1 (the original clear power ≥ 0 as shown in Figure 13a above the blue horizontal line) in former and later time slots, MG 7 raises its trading

quantity and stores more energy into its BESS to absorb the surplus energy in the market. During the middle time slots when *SDR* < 1 (the original clear power < 0), part of the energy demand is provided by its own BESS, which helps to balance the excessive energy demand in market. The two curves coincide at the end of the scheduling cycle as the BESS stores enough energy in time slot 32 to 38 and SOC is near 1. The same characteristics could be found in the bidding performance of MG 12. In Figure 13b, when energy demand exceeds supply as shown below the purple horizontal line, the BESS of MG 12 discharges to satisfy the energy demand. More energy is sold in these time slots to reach a better market SDR performance, while, during the nighttime, MG 12 charges the surplus energy to its BESS rather than selling to the grid. It is obvious that the actual trading quantity curves cohere better with the real-time SDR than the original bidding quantity curves in both the standpoints of buyer and seller microgrids.

**Figure 13.** Bidding performance on quantity of MG 7 and MG 12.

The BESS SOC of MG 7 and MG 12 is presented in Figure 14, from which we could find the trend of SOC curves coheres with that of the SDR. When *SDR* < 1, both MG 7 and 12 discharge their BESS to compensate the lack of energy supply. The BESS of MG 12 releases all its energy and the SOC reaches 0 since time slot 16. However, when the energy supply exceeds demand during the nighttime, the BESS starts to charge and save surplus energy for later use. The SOC of MG 7 reaches 100% since time slot 40. Different from former research by [25], the charge and discharge behaviors of BESS are restricted by ramp constraints, which makes the simulation results closer to reality. Due to BESS capacity and (dis)charge energy loss, the regulatory ability of BESS on the energy trading market is limited. When *SOC* = 0 or *SOC* = 1, internal re-scheduling of each microgrid could be developed for greater bidding potential.

#### *4.2. Case Study 2: Effectiveness Verification of the Proposed Q-Cube Framework*

The Q-cube of a MGO is updated in each round of the whole scheduling cycle. Q-values are iteratively accumulated following the proposed update rules. In order to display this distribution in the three-dimension space, bidding actions are abstracted to nine actions. MG 6 and MG 13 are chosen as the examples of risk-taking strategy and conservative strategy, respectively. The Q-value distributions of these two microgrids are shown in Figure 15.

**Figure 14.** Battery energy storage system's SOC on MG 7 and MG 12.

As a risk-taker, the Q-value distribution in MG 6's Q-cube is a non-uniform distribution with a slight trench in the middle of action dimension as shown in Figure 15a. According to the Q-cube framework proposed in this paper, the low value of MG 6's greedy degree (*ε* = 0.1680) results in its curiosity on the neighborhood actions of basic action (action 5) for all the states. Neighborhood actions are given more opportunities to accumulate Q-values based on the action selection mechanism. The eagerness of obtaining more future profits aggravates this phenomenon as the discount factor (*γ* = 0.6721) is high. A low value of learning rate (*α* = 0.2617) indicates that new bidding information in the real-time market has little impact on the choice of actions.

**Figure 15.** The Q-value distribution of microgrids adopting two risk strategies.

On the contrary, MG 13 chooses to be conservative in the QLCDA process, whose Q-value distribution in the Q-cube is presented in Figure 15b. MG 13 likes to keep in touch with the latest market information and prefers to choose the basic action under states near *SDR* = 1, which leads to high values of learning rate (*α* = 0.6812) and greedy degree (*ε* = 0.8462). He is satisfied with current revenues and doesn't have much interest in exploring new actions, so the discount factor of MG 13 is at a low level (*γ* = 0.333). Therefore, there is an obvious hump on the surface of Q-value plane around the middle part (near Q (state 4, action 5) and Q (state 5, action 5)), showing that MG 13 is rational and not greedy.

The iteration results of Q-values of different microgrids prove that the proposed Q-Cube framework for Q-learning algorithm is capable and effective in reflecting the microgrids' characteristics.

#### *4.3. Case Study 3: Profit Analysis on Different Energy Trading Mechanisms*

To verify the performance of the proposed QLCDA, a profit analysis on different energy trading mechanisms is carried out. Previous work of [19] on peer-to-peer energy trading mechanism is introduced here for comparison. As shown in Table 2, three energy trading mechanisms are simulated on the same case from Guizhou Grid for 30 times and the average values of energy trading profits are calculated and analyzed for statistically significance. Negative values indicate the cost paid to peer microgrids and the DNO. The proposed QLCDA mechanism is proved to have superior performance over tradition energy trading mechanism as expected. In addition, for most microgrids, a certain degree of increase on profits could be obtained compared to P2P mechanisms. The profits of seller microgrids are commonly raised as clean energy generated during valley intervals could be stored until the needed time rather than selling to the grid at lower prices. A 65.7% and 10.9% rise in the overall profits of the distribution network can be achieved by the QLCDA mechanism compared with that of the tradition energy trading mechanism and P2P mechanism, respectively.

However, for some buyer microgrids (particularly for MG 6), the profits by adopting the QLCDA mechanism is less than that of the P2P mechanism. This could be explained by the following reasons: (1) as presented in Table 1, the trading quantity is adjustable in the QLCDA mechanism, most of the microgrids obtain higher BESS SOC at the end of one scheduling cycle, inside which MG 6 stores the largest quantity of energy (151.1kWh). The profits by selling this part of stored energy are not calculated in Table 2, while, in a P2P mechanism, the effect of applying BESS and changes in bidding quantity is not taken into consideration. (2) MG 6 is a risk-taker based on its Q-learning parameters. The low value of greedy degree (*ε* = 0.1680) and learning rate (*α* = 0.2617) indicate that MG 6 cares less about new bidding information and wants to explore more potential actions rather than sticking to the basic action. A high value of discount rate ( *γ* = 0.6721) proves his eagerness for more future profits, therefore it prefers to keep its BESS at a high SOC and seek deals with lower trading prices near the deadline. From another standpoint of view, the profits analysis proves the effectiveness of the proposed Q-Cube framework for the Q-learning algorithm on energy trading problems.


**Table 2.** Contrast of profits among three energy trading mechanisms 1.

1 TM: Traditional Mechanism; P2PM: Peer-to-Peer Mechanism; QLCDAM: Q-learning based Continuous Double Auction Mechanism.

Considering the equipment and operation costs of BESS, the proposed QLCDA mechanism might not be the best choice for energy trading among microgrids, but the simulation results prove its potential in increasing profits for microgrids with different configurations and preferences.
