**Algorithm 1** Reward Function


The two Q networks, such as target Q network and Q network, which is used mandatorily, are employed in the algorithm. Both networks are exactly the same with different weight parameters. In order to smooth convergence in DQN, the target network is updated not continuously, but periodically.

The adaptive learning rate method, RMSProp method, was used as the optimizer and the learning rate was adjusted according to the parameter gradient. This means that, in a situation where the training set is continuously changing, unlike the case of a certain training set, it is necessary to continuously change certain parameters.

The algorithm is simply stated as follows.

When entering the state st, randomly selects a behavior a that maximizes Q (st, a; θt).

Obtain state sn<sup>+</sup><sup>1</sup> and a complement rt by action a.

Enter state st<sup>+</sup><sup>1</sup> and find Max Q (st+1, a; θ).

Update the variable θ<sup>t</sup> with the correct answer.

$$\text{Yi} = \text{r}\_{\text{l}} + \text{Y} \,\text{Max} \,\text{Q}(\text{s}\_{\text{l}+1}, \text{a}; \theta) \tag{6}$$

$$\mathcal{L}(\boldsymbol{\Theta}\_{\rm t}) = 1/2 \,\,\mathcal{X} \left( \mathbf{y}\_{\rm i} - \mathbf{Q}(\mathbf{s}\_{\rm t}, \mathbf{a}; \boldsymbol{\Theta}) \right)^2 \tag{7}$$

$$\mathbf{Q}\_{\rm i+1} = \mathbf{Q}\mathbf{i} - \alpha \nabla\_{\rm @i} \mathbf{L}(\theta\_{\rm i}) \tag{8}$$
