*3.3. Exploration and Exploitation*

In general, a wide exploration will enhance the possibility of searching a global optimum, but will also consume additional computation time. In contrast, a deep exploitation will enhance the convergence speed, but will easily result in a local optimum in low quality. In order to keep exploitation and exploration in balance, the ε-Greedy rule [31] is adopted to select actions on the basis of the current knowledge matrices, which yields

$$a\_{i,k+1}^{l,m} = \begin{cases} \underset{a\_i^l \in A\_i^l}{\operatorname{argmax}} \mathbf{Q}\_{i,k}^l \{ s\_{i,k+1}^{l,m}, a\_i^l \} \text{ if } q\_0 < \varepsilon\\ a\_{\text{rand}} \text{}\_{\prime} \qquad \text{otherwise} \end{cases} \tag{10}$$

where *q*0 is a uniform random number between 0 and 1; is the rate of exploitation, i.e., the possibility of selecting the greedy action; and *a*rand represents a stochastic action in the action space, i.e., the global search for avoiding a low-quality local optimum, respectively.
