*4.1. Q(*λ*) Learning*

Multi-step backtrack Q(λ) learning is a conventional algorithm of RL, in which Q-learning combines the idea multi-step TD(λ) returns [38] and introduces the eligibility trace, such that the convergence speed of the algorithm can be improved to a certain extent. The eligibility trace can be described as [38]

$$c\_k(s, a) = \begin{cases} \gamma \lambda c\_{k-1}(s, a) + 1, \text{ if } (s, a) = (s\_k, a\_k) \\ \gamma \lambda c\_{k-1}(s, a), \text{ otherwise} \end{cases} \tag{7}$$

where *ek*(*<sup>s</sup>*, *a*) stands for the eligibility trace under a state-action pair (*s*, *a*) corresponding to the *k*th iteration; (*sk*, *ak*) denotes the actual state-action pair of the *k*th iteration; γ means the discount factor; and λ represents the trace-decay factor.

The eligibility trace (λ) uses the "backward estimation" mechanism to approximate the optimal value function matrix *Q*\*, and sets *Qk* as the *k*th iterative value of the estimated value *Q*\*, thus the value function of the algorithm can be updated iteratively as follows [39]:

$$
\rho\_k = R(s\_k, s\_{k+1}, a\_k) + \gamma Q\_k(s\_{k+1}, a\_{\mathfrak{k}}) - Q\_k(s\_k, a\_k) \tag{8}
$$

$$\delta\_k = R(s\_k, s\_{k+1}, a\_k) + \gamma Q\_k(s\_{k+1}, a\_{\mathbb{K}}) - Q\_k(s\_k, a\_{\mathbb{K}}) \tag{9}$$

$$Q\_{k+1}(\mathbf{s}, a) = Q\_k(\mathbf{s}, a) + a \delta\_k e\_k(\mathbf{s}, a) \tag{10}$$

$$Q\_{k+1}(s\_k, a\_k) = Q\_{k+1}(s\_k, a\_k) + a\rho\_k \tag{11}$$

where α is the learning factor; *<sup>R</sup>*(*sk*,*sk*+1, *ak*) is the reward function value of the *k*th iterative time environment from state *sk* to *sk*+1 through the selected action *ak*; and *<sup>a</sup>*g is the greedy action strategy, which also represents the action corresponding to the highest *Q*-value in the current state, which can be written by [39]

$$a\_{\mathfrak{K}} = \underset{a \in A}{\operatorname{argmax}} Q\_k(s\_{k+1}, a) \tag{12}$$

where *A* represents the action set, which is also the alternative action set for each variable.
