**1. Introduction**

The power system has experienced the evolution from a traditional power grid to the smart grid and then to the Energy Internet (EI), driven by economic, technological and environment incentives. Distributed energy resources (DERs) including distributed generation (DG), battery energy storage system (BESS), electric vehicle (EV), dispatchable load (DL), etc. are emerging and reconstructing the structure of power systems. In future EI, renewable energy sources (RESs) are regarded as the main primary energy owing to the wide application of solar panels, wind turbines and other new energy technologies [1]. According to a recent report from the U.S. Energy Information Administration (EIA), the U.S. electricity generation from RESs surpassed coal this April for the first time in history, providing 23% of the total electricity generation compared to coal's 20%. Meanwhile, the proportion of RES generation in Germany has already reached 40% in 2018. The considerable increase of RESs encourages a significant decrease in energy prices, which drives the reform of energy trading patterns and behaviors in the power system. In addition, flexible location and bi-direction energy trading ability of DERs lead to the transformation of managemen<sup>t</sup> mode from centralized to decentralized [2]. In this process, the introduction of economic models to this decentralized system makes the power grid truly equipped with market characteristics [3].

As the aggregators of DERs in certain geographical regions, microgrids are important participants in the power market [4]. By implementing internal dispatch, microgrids can provide economic benefits through applying demand response projects and avoiding long distance energy transmission [5]. Moreover, microgrids give solutions for emergencies when power from the grid is disrupted. Energy trading among networked microgrids in the distribution network form the local energy market [6]. In early research and realistic practice, cooperative energy trading mechanisms have been proposed to achieve better performances on profit and managemen<sup>t</sup> for the overall market. Models and algorithms have been investigated to describe features of the multi-microgrid system [7,8] and solve the optimization problem [9,10]. However, given a diverse internal network topology and device configurations, the microgrids' willingnesses of joining this cooperative energy trading market differ from each other. Though some mix-strategy Nash equilibrium points have been found by theoretical proofs, the freedom of energy trading have to be sacrificed in exchange for global optimum, as the solutions to this NP-hard problem often fail to satisfy everyone. In addition, a cooperative energy trading mechanism requires detailed information on power prediction and operating data of every device in the microgrids. This will expose residential energy consumption habits and behavioral preferences, causing privacy protection issues. Non-cooperative energy trading mechanisms are urgently needed. The development of information and communication technologies (ICTs) provides ideas for solving the above problems.

With the application of advanced ICT in the energy market, the degree of informatization has been greatly improved. Smart meter, mobile internet, blockchain and 5G, etc. help to extend the traditional power system to a three-layer architecture [11,12]: the bottom layer is the network of power devices and transmission lines. The middle layer is the network of information nodes, in which the ICTs play a very important role. Software-based negotiation agents participate in the energy trading market in the top layer [13]. In the energy trading market of networked microgrids, microgrid operators (MGOs) are set to trade energy with each other and the grid under the regulations formulated by distribution network operators (DNOs). Different economic models are implemented in this layer based on personalized behaviors of the participants, which is an emerging topic in both academic and practical fields. As a common method for allocating resources, continuous double auction (CDA) is frequently used to address the bidding problem in energy markets among multi-buyers and multi-sellers [14]. The authors in [15] discussed the efficiency of applying CDA in a computational electricity market with the midpoint price method. In [16], an adaptive aggressiveness strategy was presented in the CDA market to adjust bidding price according to market change. A stable CDA mechanism was proposed in [17], which alleviated the unnecessarily volatile behavior of normal CDA. Furthermore, peer-to-peer (P2P) energy trading mechanisms are drawing attention as ICTs like blockchain are making P2P energy trading in real time possible [18]. Wang et al. [19] proposed a parallel P2P energy trading framework with multidimensional willingness, mimicking the personalized behaviors of microgrids. In [20], a canonical coalition game was utilized to propose a P2P energy trading scheme, which proves the potential to corroborate sustainable prosumer participation in P2P energy trading. To summarize, the literature mentioned above is mainly concerned with the bidding price in the energy trading market, as the intersection of price sequences decides whether to close a deal or not. However, with the wide use of advanced ICT in the power grid, the uncertainty of DERs can be compensated by real-time behavior adjustment; meanwhile, DER responses to price signal become faster than ever before. Not only does the bidding price have impacts on the bidding results, but bidding quantity also simultaneously affects the real-time supply and demand relationship. Meanwhile, the capacity of BESS is only taken into consideration in the internal scheduling of each microgrid, neglecting the potential of BESS to participate in energy market dispatching.

At the same time as research on energy trading mechanisms, significant efforts have been devoted to model the complex bidding behaviors of negotiation agents in energy trading markets, among which the interest of applying reinforcement learning (RL) algorithms to solve power grid problems is emerging [21]. Reinforcement learning is a formal framework to study sequential decision-making problems, particularly relevant for modeling the behavior of financial agents [22]. The authors in [23] made a comprehensive review on the application of RL algorithms on electric power system decision and control. A few research works have begun to pay attention to this problem and made an effort to

establish better bidding mechanisms [15,24–28]. Nicolaisen et al. [15] applied a modified Roth–Erev RL algorithm to determine the bidding price and quantity offers in each auction round. The authors in [25] presented an exact characterization of the design of adaptive learning rules for contained energy trading game concerning privacy policy. Cai et al. [26] analyzed the performance of evolutionary game-theory based trading strategies in the CDA market, which highlighted the practicability of the Roth–Erev algorithm. The authors in [27] presented a general methodology for searching CDA equilibrium strategies through the RL algorithm. Residential demand response enhanced by the RL algorithm was studied in [28] by a consumer automated energy managemen<sup>t</sup> system. Particularly, Q-learning (QL) stands out because it is a model-free algorithm and easy to implement. The authors in [29] considered the application of QL with temperature variation for bidding strategies. Rahimiyan's work [30,31] concentrated on the adaptive adjustment of QL parameters with the energy market environment. Salehizadeh et al. [32] proposed a fuzzy QL approach in the presence of renewable resources under both normal and stressful cases. The authors in [33] introduced the concept of scenario extraction into a QL-based energy trading model for decision support.

The existing literature shows the potential of combining QL algorithms and energy trading mechanisms in obtaining better market performance. However, suitable answers to the following three issues are still unsettled, which are the motivations for this paper's research:


To tackle the above issues, we formulate the energy trading problem among microgrids as a Markov Decision Process (MDP) and investigate the potential of applying a Q-learning algorithm into a continuous double auction mechanism. Taking inspiration from related research on P2P trading and heuristic algorithms, a Q-cube framework of Q-learning algorithm is proposed to describe the Q-value distribution of microgrids, which is updated in each bidding round iteratively. To the best of the authors' knowledge, none of the previous work has proposed a non-tabular formation of Q-values for decision-making of the power grid.

The contributions of this paper are summarized as follows:


The rest of this paper is organized as follows. In Section 2, the overview of a non-cooperative energy trading market is presented, along with a description of the proposed Q-learning based continuous double auction mechanism. A Q-cube framework of the Q-learning algorithm is introduced in Section 3. Case studies and analyses are demonstrated in Section 4 to verify the efficiency of the proposed Q-cube framework for a Q-learning algorithm and continuous double auction mechanism. Finally, we draw the conclusions and future works in Section 5.

#### **2. Mechanism Design for Continuous Double Auction Energy Trading Market**

In this section, we provide the overview of non-cooperative energy trading market and the analytical description of Q-learning based continuous double auction mechanism.

#### *2.1. Non-Cooperative Energy Trading Market Overview*

In a future distribution network, the DNO is the regulator of local energy trading market as it provides related ancillary services for market participants: (1) By gathering and analyzing the operation data from ICT, the DNO monitors and regulates the operation status of distribution network; (2) By carrying out centralized safety check and congestion management, the DNO guarantees the power flow in every transmission line is under limitation; (3) By adopting reasonable economic models, the DNO affects energy trading patterns and preferences of market participants. With the reform of the traditional energy market, along with the application of advanced metrology and ICT, the trend of peer-to-peer energy trading pattern is emerging. As peers in this energy market, we assume that MGOs have no information on their peers' energy trading preferences and internal configurations, which addresses the concern on privacy protection. In addition, each peer in this energy market is blind about the bidding target, it joins this energy trading market to satisfy its own needs for energy to the greatest extent rather than seeking cooperation. Each MGO can adjust its bidding price and quantity according to public real-time market information and private historical trading records. Accordingly, the energy trading among microgrids in the distribution network could be formulated as a non-cooperative peer-to-peer energy trading problem. Figure 1 shows the process of the non-cooperative energy trading market discussed in this paper.

Consider a distribution network containing a number of networked microgrids in a certain area. In the hour-ahead energy trading market before Time Slot *N*, each MGO deals with the internal coordinated dispatch (ICD) of local DERs and residents based on DERs' power prediction and BESS's state of charge (SOC) restriction information. Meanwhile, the DNO makes the distribution network scheduling for further procedures. A Q-learning based continuous double auction among microgrids is implemented according to ICD results and BESS's SOC status; detailed descriptions are presented in the following chapter. After the safety check and congestion managemen<sup>t</sup> made by DNO, energy trading commands are confirmed and transmitted to each MGO. As the MGOs are empowered to set real-time price for regional energy, internal pricing for DER power and charge and discharge scheduling for BESS are completed in this period.

**Figure 1.** The process of the proposed non-cooperative energy trading market.

Energy is exchanged according to the pre-agreed trading contracts in Time Slot *N* under the regulation of DNO. Sufficient back-up energy supply and storage capacity are provided in case of the impact of extreme weather and dishonesty behaviors of the market participants. A market clear process is carried out after Time Slot *N* to ensure the accurate and timely settlement of energy transactions. Punishments are also made for the above abnormal market behaviors. Security and timeliness of the market clear process could be guaranteed by advanced ICT such as blockchain, smart meters, 5G, etc.

#### *2.2. Q-Learning Based Continuous Double Auction Mechanism*

This paper proposes a Q-learning based continuous double auction (QLCDA) mechanism for the energy trading market. Figure 2 presents the process of the proposed QLCDA in one time slot.

Before the QLCDA start in one time slot, each MGO tackles the ICD problem and generates the initial bidding information. The SOC check and charge and discharge restriction of the BESS are also completed in this initialization stage. In each round of CDA (indexed by *n*), each MGO reports its energy trading price and quantity to the DNO. Note that the trading quantity would be updated in each round; it is possible that one MGO changes its role as buyer or seller in the bidding process. Thus, an identity confirmation is made as the first step in CDA and the number of buyers (*nb*)/ sellers (*ns*) are obtained. Then, the DNO calculates and releases the overall supply and demand relationship (SDR) to these networked microgrids. Meanwhile, the reference prices for buyer and seller microgrids are calculated and released, which are the average price of selling and buying energy in the real-time market. MGOs update their bidding price and quantity according to real-time SDR and historical trading records based on the Q-Learning algorithm; the SOC restrictions are also taken into consideration to limit the behaviors of BESS in each microgrid. The bidding price of sellers and buyers are sorted in increasing order by the DNO; we have *pricebnb* < *pricebnb*−<sup>1</sup> < ··· < *priceb*1 and *prices*1 < *price*2 < ··· < *pricesns*. Once the price sequences of seller and buyers are intersected, i.e., *prices*1 < *priceb*1, MGOs whose bidding prices are in this interval are chosen to join the energy sharing process. Actual trading price and quantity are decided in this step and the bidding quantity of each microgrid is updated based on the sharing results. If there is still untraded energy in the market, the QLCDA will repeat until the deadline of bidding rounds (*N* represents the maximum bidding round in one time slot). If energy demand or supply are fully satisfied before the deadline, QLCDA will be stopped in the current round. Results of QLCDA are confirmed by MGOs and sent to the DNO for further energy allocation and safety check. Detailed descriptions on initialization and the energy sharing mechanism are presented in the following chapters.

**Figure 2.** The process of Q-learning based continuous double auction in one time slot.
