3.1.1. State Space

*S* represents the state space, which describes the state of MGOs in a real-time energy market. As a multi-agent system, it is impossible and senseless to select different state descriptions for each agent, whereas a common formulation is preferred. We propose to choose the real-time supply and demand relationship to form the state space for the following reasons: (1) the SDR has a decisive impact on bidding results. When the energy supply exceeds demand in a distribution network, seller microgrids are more willing to cut their bidding prices to make more deals, and exceeded supply is preferred to be stored in the BESS rather than selling to the grid at lower prices. In the meantime, buyer microgrids are not eager to raise their bidding prices quickly, but they tend to buy more energy for later use as the trading prices are much cheaper than those of the grid. The interactions between price and quantity on two roles of the energy market participants still exist when the energy demand exceeds supply. (2) The SDR reflects external energy transactions status of the networked microgrids. The more balanced the supply and demand relationship is, the less energy networked microgrids interacted with the distribution network. (3) The SDR describes the bidding situation as a public information of the energy trading market, which addresses the issue of privacy protection.

In this paper, the real-time SDR of round *n* in time slot *T* is formulated as a normal distribution with *μ* = 0 and *δ* = 0.3, whose value is extended to the interval of [0,2].

$$SDR^n = 2 \cdot \frac{1}{\sqrt{2\pi}\delta} \exp\left(-\frac{\left(CP^n - \mu\right)^2}{2\delta^2}\right),\tag{7}$$

$$\text{C}P^{\text{n}} = \frac{\sum q\_{seller}^{\text{n}} - \sum q\_{buyer}^{\text{n}}}{A},\tag{8}$$

where *CPn* is the clear power index, representing the clear power of the energy market in round *n* divided by a pre-defined constant *A*.

A pre-selection on the value of *δ* is performed and the results are shown in Figure 5a. A small choice of *δ* value (*δ* = 0.1) will cause a sharp increase of SDR during the interval of [−0.25, 0.25], which makes the SDR meaningless in a large clear power index range. Meanwhile, a large *δ* value (*δ* = 0.5) will reduce the sensitivity of SDR when the energy supply and demand are close to equilibrium. Therefore, a compromise choice of *δ* value (*δ* = 0.3) is preferred.

**Figure 5.** Supply and demand relationship function and state division of the proposed Q-cubeframework.

The blue curve in Figure 5a shows the SDR under a different clear power index. When ∑ *qnseller* = ∑ *<sup>q</sup>nbuyer*, *SDRn* = 1, the energy supply and demand attain equilibrium. When ∑ *qnseller* ≥ ∑ *<sup>q</sup>nbuyer*, *SDRn* ≥ 1, vice versa. The SDR is sensitive in the interval close to 1 as the equilibrium between energy supply and demand is dynamic and in real time. In view of the fact that the SDR of energy trading market is a continuous variable, it is impossible to consider this MDP problem in an infinite space. In addition, it is impractical to model and simulate the energy trading market with limitless state descriptions. As a common method of applying Q-learning algorithm in practical problems, the state space should be divided into limited pieces for a better characterization of the SDR. For the Q-learning algorithm proposed in this paper, the number of states should be even-numbered as the SDR function is symmetrical. In addition, the probability of falling into each state should be equal. Without loss of fairness, the probability density distribution of the SDR function is divided into eight blocks with equal integral areas as shown in Figure 5b. These eight SDR intervals are defined as eight states in the proposed state space *S* for all the MGOs. The clear power index is also divided into eight intervals, corresponding to eight intervals of the SDR. When the clear power index is close to 0 (the market is near the equilibrium between energy supply and demand), the interval length of state is small as in most time slots the SDR experiences minor oscillation in the bidding rounds near the deadline. However, for the states whose clear power index is far from 0, the interval length is large as the SDR isn't sensitive, which means that the microgrids in the distribution network want to escape from these states.
