3.2.1. Q-Value Update

The common Q-Value update rule for the model-free Q-learning algorithm with learning rate *α* and discount factor *γ* is given as follows:

$$\mathcal{Q}^{n+1}(\mathbf{s}\_i^n, a\_i^n) = (1 - a) \cdot \mathcal{Q}^n(\mathbf{s}\_i^n, a\_i^n) + \mathfrak{a} \cdot \left[ r(\mathbf{s}\_i^n, a\_i^n) - \gamma \cdot \max\_{\forall a\_i} \mathcal{Q}^n(\mathbf{s}\_i^{n+1}, a\_i^{n+1}) \right],\tag{24}$$

where *Qn*<sup>+</sup><sup>1</sup>(*sni* , *ani* ) represents the updated Q-value for MGO *i* adopting action *ani* under state *sni* in the *n*th bidding round. When observing the subsequent state *sn*+<sup>1</sup> *i* and reward *r*(*sni* , *ani* ), the Q-value is immediately updated. We adopt this common Q-value update rule for Q-learning in this paper.

The learning rate *α* and discount factor *γ* are two critical parameters of MGOs as they reflect each MGO's bidding preference. The learning rate defines how much the updated Q-value learns from the new state-action pair. *α* = 0 means the MGO will learning nothing from new market bidding information, while *α* = 1 indicates that the Q-value of a new state-action pair is the only important information. The discount factor defines the importance of future revenues. The MGOs whose *γ* near 0 are regarded as a short-sighted agen<sup>t</sup> as it only cares about gaining short-term profits, but, for the MGOs whose *γ* is close to 1, they tend to wait until the proper time for more future revenues.
