*4.2. MCR-Q(*λ*) Learning*

#### 4.2.1. Reduced-Dimension of Solution Space

As shown in Figure 2, the traditional single-objective Q(λ) algorithm does not decompose the action space of all the variables. Assume that the *i*th variable *xi* has *mi* alternative solutions, the number of action set elements |*A*| = *m*1*m*2 ··· *mn*, when the number of variables *n* is large, the alternative action combination will increase accordingly, which leads to a slow convergence and di fficulties in the iterative calculation. Up to now, the most usual way to work out this "dimension disaster" issue is hierarchical reinforcement learning (HRL) [40]. However, it is di fficult to determine the hierarchical design and connection, which usually leads to the convergence of the algorithm to the local optimal solution.

**Figure 2.** Di fference between Q(λ) and MCR-Q(λ).

Under the framework of the proposed MCR-Q(λ) learning algorithm, each variable has a corresponding value function *Qi* matrix, and the action set is respectively divided into (*<sup>A</sup>*1, *A*2, ··· , *An*) with |*Ai*| = *mi*. In the iterative optimization of each *Q* matrix, the di fficulty of optimization is greatly reduced due to the action space being obviously smaller. Meanwhile, the action space of each variable is the state space of the next variable, which enhances the internal relationship between variables, as can be illustrated in Figure 2. The state space of the first variable is divided according to the load scenario.

## 4.2.2. Multi-Agent Cooperative Search

In the iterative optimization of Q(λ) learning, which only employs a single agen<sup>t</sup> for exploration and exploitation, the *Q* matrix is less e fficient at updating just one element per iteration. On the contrary, in MCR-Q(λ) learning, there are multiple agents for exploration and exploitation at the same time, in which multiple elements of the *Q* matrix can be updated at each iteration, and the update speed of the *Q* matrix is greatly improved. Here, the value function of MCR-Q(λ) learning can be updated iteratively as follows [23]:

$$\rho\_k^{i\bar{j}} = \mathcal{R}^{i\bar{j}}(\mathbf{s}\_k^{i\bar{j}}, \mathbf{s}\_{k+1'}^{i\bar{j}}, a\_k^{i\bar{j}}) + \gamma \mathcal{Q}\_k^i(\mathbf{s}\_{k+1'}^{i\bar{j}}, a\_{\mathcal{S}}^i) - \mathcal{Q}\_k^i(\mathbf{s}\_{k'}^{i\bar{j}}, a\_{\mathcal{S}}^i) \tag{13}$$

$$\boldsymbol{\delta}\_{k}^{ij} = \boldsymbol{R}^{ij} \Big( \mathbf{s}\_{k}^{ij}, \mathbf{s}\_{k+1}^{ij}, \boldsymbol{a}\_{k}^{ij} \big) + \gamma \mathbf{Q}\_{k}^{i} \big( \mathbf{s}\_{k+1}^{ij}, \boldsymbol{a}\_{\mathcal{K}}^{i} \big) - \mathbf{Q}\_{k}^{i} \big( \mathbf{s}\_{k'}^{ij}, \boldsymbol{a}\_{\mathcal{K}}^{i} \big) \tag{14}$$

$$Q\_{k+1}^{i}\left(\mathbf{s}^{i},a^{i}\right) = Q\_{k}^{i}\left(\mathbf{s}^{i},a^{i}\right) + a\delta\_{k}^{ij}e\_{k}^{i}\left(\mathbf{s}^{i},a^{i}\right) \tag{15}$$

$$\mathcal{Q}\_{k+1}^{i} \left( s\_{k}^{i\bar{j}}, a\_{k}^{i\bar{j}} \right) = \mathcal{Q}\_{k+1}^{i} \left( s\_{k}, a\_{k} \right) + \alpha \rho\_{k}^{i\bar{j}} \tag{16}$$

where the superscript *i* represents the *i*th variable or the *i*th *Q*-value matrix; the superscript *j* represents the *j*th objective; *<sup>e</sup>ik<sup>s</sup>i*, *ai* and *aig* are similar to Equations (7) and (12), respectively.

As with the Ant-Q algorithm, MCR-Q(λ) does not calculate the global reward function after each individual selects all the variables, i.e., from the start to the end, as shown in Figure 2. The reward function value can be calculated as follows [24]:

$$R^{ij} \{ s\_{k'}^{ij}, s\_{k+1'}^{ij}, a\_k^{ij} \} = \begin{cases} \begin{array}{c} \frac{\mathbb{W}}{L\_{\text{Rot}}}, \text{ if } \left( \mathbf{s}\_{k'}^{ij}, a\_k^{ij} \right) \in SA\_{\text{Rst}} \\ 0, \text{ otherwise} \end{array} \tag{17}$$

where *LBest* represents the function value of an individual (i.e., the best individual) that has the lowest value of the objective function value at the *k*th iteration; *W* is a positive constant; *SABest* denotes the state-action pair set of the optimal individual executed at the *k*th iteration.
