*4.1. Q-Learning Model*

*Q*-Learning model, a milestone in reinforcement learning research, is an enhanced learning method that is not constrained by the problem model. The optimal policy of *Q*-Learning is generated by executing the action with the highest expected *Q*-values, which is the action of maximizing the cumulative benefits with a discount. Control strategy of the optimal step size can be transformed into the optimal action for the agent. The *Q* function is defined as discounted. In general, the environment is the current state in which the agen<sup>t</sup> makes decisions. The agen<sup>t</sup> includes a set of feasible actions which affect both next state and reward. In fact, the *Q*-Learning is a mapping from state–action to prediction. The output for state vector *s* and action *a* are denoted by *Q*-value *Q*(*<sup>s</sup>*, *a*):

$$Q(s\_t, a\_t) \leftarrow (1 - a)Q(s\_t, a\_t) + a \left[ r\_{t+1} + \gamma \max\_{a\_{t+1}} Q(s\_{t+1}, a\_{t+1}) \right] \tag{8}$$

where *Q*(*st*, *at*) represents the cumulative reward of action in the state of *s* at time *t*. *Q*(*st*+1, *at*+<sup>1</sup>) indicates the cumulative reward of action in the state vector *s* at time *t* + 1. *rt*+1 is the reward received for the action *a* at time *t* + 1. When *st*+1 is terminal, *Q*(*st*+1, *at*+<sup>1</sup>) goes to zero, where *a* and γ represent learning factors and discount factors, respectively (0 < *a* < 1, 0 ≤ γ < 1). γ determines the impact of lagging returns on optimal action. *Q-*Learning provides strong proof of convergence. The *Q* value will converge with probability 1 to *Q* when each state–action pair is repeatedly visited. The error of *Q* (*s*, *a*) must be reduced by γ whenever it is updated. When each state–action pair is visited infinitely, the estimates of *Qn*(*<sup>s</sup>*, *a*) converge to real values of *Q*(*<sup>s</sup>*, *a*) as *n* → ∞.

#### *4.2. Step Size Control Model by Using Q-Learning*

In CS algorithm, the most important parameter is step size scaling factor with the typical characteristics of <sup>L</sup>évy flight, in addition to the population size, the number of iterations, and the probability of discovery. Step size scaling factor is as suitable action that is selected to control an individual search process. The accuracy of selected parameter can be improved by predicting before making an action decision. When an individual selects an action, the advantages and disadvantages of various actions can be evaluated by the multi-step effect of individual. *Q*-Learning is helpful to learn the optimal step size control strategy and transform optimal step size control strategy into optimal action selected of agent.

During the iteration of CS algorithm, the fixed step size strategy cannot meet the dynamic requirements of the algorithm. Considering the aforementioned facts, at the later stage of the CS algorithm, we add three step size control methods in the iterative process: (1) Dynamic linear decreasing strategy (L1) is defined by Equation (9). (2) Dynamic non-linear decreasing strategy (L2) is defined by Equation (10). (3) Adaptive step-size strategy (L3) is defined by Equation (11). Each individual obtains the optimal step size control strategy via learning multiple steps forward, thus becomes close to the optimal solution. Therefore, we try to evaluate the step size control strategy by using multi-step evolution method, which increases the adaptability of individual evolution and improves the performance of the algorithm. The current best step size control strategy is selected to execute the next iteration by using *Q*-Learning method.

$$a = (a\_1 - a\_0) \times (t\_{\text{max}} - t) / t\_{\text{max}} + a\_1 \tag{9}$$

$$a = (a\_1 - a\_0) \cdot (\text{t}/t\_{\text{max}})^2 + (a\_0 - a\_1) \cdot (2 \cdot \text{t}/t\_{\text{max}}) + a\_1 \pm \tag{10}$$

$$a = a\_0 + (a\_1 - a\_0) \cdot d\_i \tag{11}$$

$$d\_i = \frac{||\mathbf{x}\_i - \mathbf{x}\_{\text{best}}||}{d\_{\text{max}}} \tag{12}$$

where *t*max expresses the total number of iterations, *t* is the current number of iterations, and *d*max is the maximum distance between the optimal nest and all other nests. *a*0 < *a*1, *a*0 is the initial value of step size.

In *Q-*Learning algorithm, the agen<sup>t</sup> receives feedback, which is called reward, for each action. When the state is set to *s* and the action is set to *a*, a set of actions is set to *H* = {*<sup>a</sup>*1, *a*2, ... *an*}, the agen<sup>t</sup> has *n* actions to choose from each state, and the maximum reward of discount for the agen<sup>t</sup> is:

$$Q(\mathbf{s}, a) = \ \mathbf{r}(\mathbf{s}\_l, a\_l) + \boldsymbol{\gamma} \cdot \max\_{a'} Q(\mathbf{s'}, a') \tag{13}$$

where *<sup>r</sup>*(*<sup>s</sup>*, *a*) is the immediate benefits for state *s*. max *a Q*(*<sup>s</sup>* , *a* ) is the maximum return value that the agen<sup>t</sup> select different actions at the next state *<sup>s</sup>* . *a* is the action which is selected at the next state *<sup>s</sup>* . γ is the discount factor. The benefits that the agen<sup>t</sup> selecting action *a* receives is:

$$Q(a) = r(a) + \gamma \cdot Q(a^{(1)}) + \gamma^2 \cdot Q(a^{(2)}) + \dots + \gamma^m \cdot Q(a^{(m)}) \tag{14}$$

where *m* represents the number of steps forward, *a*, *a*(*i*) ∈ *A*, 1 ≤ *i* ≤ *m*. When γ = 0, *Q* is reduced to one step forward. When γ is close to 1, the lagging benefits of optimal action increase gradually. *r*(*a*) is the immediate benefit that the agen<sup>t</sup> selects action *a*, which expresses that individuals have evolved once, and new individuals use (*a*(1)) to generate new individuals again. At this time, the benefit is recorded as *Q*(*a*(1)). By analogy, after *m* evolution, a new individual is generated by using (*a*(*m*)), and the corresponding benefit is recorded as *Q*(*a*(*m*)).

*n* offspring will be generated after each evolution. These offspring are evolved again by adopting *n* strategies. *n<sup>m</sup>* offspring will be produced after *m* evolutions. Boltzmann distribution is used to calculate the probability of new individuals retained. Boltzmann distribution can be defined by Equation (15):

$$p(a\_i) = e^{\frac{r(a\_i)}{l}} / \sum\_{i}^{n} e^{\frac{r(a\_i)}{l}} \tag{15}$$

where *r*(*ai*) indicates the immediate benefits of the *i*th step strategy and *T* represents the temperature.

The step size control strategy corresponding to the maximum probability is selected. The results of each generation are simplified by Boltzmann distribution. *fp*(*a*) is defined as the fitness function corresponding to parent individual in the population and *fo*(*a*) is the fitness function corresponding to the individual after adopting the parameter selection strategy. Substituting *r*(*a*) = *fp*(*a*) − *fo*(*a*) into Equation (13) Equation (16) is obtained.

$$Q(a) = f\_p(a) - (1 - \gamma) \cdot f\_o(a^{(1)}) - \gamma \cdot (1 - \gamma) \cdot f\_o(a^{(2)}) - \dots - \gamma^m \cdot f\_o(a^{(m)}) \tag{16}$$

where ∀*<sup>m</sup>*, lim*<sup>m</sup>*→<sup>1</sup>(<sup>1</sup>−) = 0 lim*<sup>m</sup>*→<sup>1</sup> = 1; according to Equation (17), it can be concluded that lim→<sup>1</sup>*Q*(*a*) = *fp*(*a*) − *fo*(*a*(*m*)), *a* = argmax*a* ∈ *A*lim → <sup>1</sup>*Q*(*a*) = argmax*a* ∈ *<sup>A</sup>*(*fp*(*a*) − *fo*(*a*(*m*))). The step size control strategy model with *Q*-Learning is described in Algorithm 2 and Figure 1.

#### **Algorithm 2** Step size with *Q*-Learning.

(1) Each individual is expressed as (*x,* σ), and the number of learning steps *M* is set;

(2) Generate three new o ffspring for each individual by using the given step size control strategy (Linear decreasing strategy, non-linear decreasing strategy, adaptively step-size dynamic adjustment strategy), and set *t* = 1;

(3) Do while *t* < *m*

Each individual generates three o ffspring by using the given step size control strategy, as shown in Equations (9)–(12).

Calculate the probability of the newly generated o ffspring by using the Boltzmann distribution, and an individual is selected according to the probability.

*t* = *t* + 1;

(4) Calculate the corresponding *Q* value of each retained individual according to the three-step selection strategy. The step size corresponding to the step control strategy is retained when *Q* is maximized, the corresponding o ffspring are selected, and other o ffspring will be discarded.

**Figure 1.** Step selection model with *Q-*Learning.
