**2. Reinforcement Learning**

Reinforcement learning is the training of machine learning models to make a sequence of decisions through trial and error in a dynamic environment. The robots learn to achieve a goal in an uncertain, potentially complex environment through programming the object by reward or penalty.

Figure 1 shows a general reinforcement learning, a robot which selects the action a as output recognizes the environment and receives the state, s, of the environment. The state of the behavior is changed, and the state transition is delivered to the individual as a reinforcement signal called a reinforcement signal. The behavior of the individual robot is selected in such a way as to increase the sum of the enhancement signal values over a long period of time.

**Figure 1.** Q Learning concept.

One of the most distinguishable differences of reinforcement learning from supervised learning is that it does not use the representation of the link between input and output. After choosing an action, it is rewarded and knows the following situation. Another difference is that the online features of reinforcement learning are important since the computation of the system takes place at the same time as learning. One of the reinforcement learning features, Deep q learning, is characterized by the temporal difference method, which allows learning directly by experience and does not need to wait until the final result to calculate the value of each state like dynamic programming.

When a robot is moving in a discrete, restrictive environment, it chooses one of a set of definite behaviors at every time interval and assumes that it is in Markov state; the state changes to the probability s of different.

$$\Pr[\mathbf{s}\_{t+1}] = \mathbf{s}'[\mathbf{s}\_{t\prime}\;\mathbf{a}\_t] = \Pr[\mathbf{a}\_t] \tag{1}$$

At every time interval t, a robot can get status s from the environment and then take action at. It receives a stochastic prize r, which depends on the state and behavior of the expected prize Rst to find the optimal policy that an entity wants to achieve.

$$R\_{st}(a\_i) \;= E\left\{\sum\_{i=0}^{\infty} \mathbf{Y}^j r\_{t+j}\right\} \tag{2}$$

The discount factor means that the rewards received at t time intervals are less affected than the rewards currently received. The operational value function *Va* is calculated using the policy function π and the policy value function *Vp* as shown in Equation (3). The state value function for the expected prize when starting from state s and following the policy is expressed by the following equation.

$$V\_a(s\_t) \equiv R\_s(\pi(s\_t)) + Y \sum\_u Pxy[\pi(s\_t)] V\_p(s\_t) \tag{3}$$

It is proved that there is at least one optimal policy as follows. The goal of Q-learning is to set an optimal policy without initial conditions. For the policy, define the Q value as follows.

$$Q\_p(s\_t, a\_t) = \,^\*R\_s(a\_t) \, + \,\, ^\*Y \sum\_{\underline{u}} P\_{\underline{xy}}[\pi(\mathbf{s}\_t)] \, V\_p(y) \tag{4}$$

*Q*(st, at) is the newly calculated *Q*(st−1, at−1), and *Q*(st−1, at−1) corresponding to the next state by the current Q(st−1, at−1) value and the current Q(st−1, at−1).

#### **3. Proposed Algorithm**

Figure 2 is the form of the proposed. The proposed algorithm uses empirical representation technique. The learning experience that occurs at each time step through multiple episodes to be stored in the dataset is called memory regeneration. The learning data samples are used for updating with a certain probability in the reconstructed memory each time. Data efficiency can be improved by reusing experience data and reducing correlations between samples.

Without treating each pixel independently, we use the CNN algorithm to understand the information in the images. The transformation layer transmits the feature information of the image to the neural network by considering the area of the image and maintaining the relationship between the objects on the screen. CNN extracts only feature information from image information. The reconstructed memory to store the experience basically stores the agent's experience and uses it randomly when learning neural networks. Through this process, which prevents learning about immediate behavior in the environment, the experience is sustained and updated. In addition, the goal value is used to calculate the loss of all actions during learning.

**Figure 2.** The concept of the proposed algorithm.

The proposed algorithm uses empirical data according to the roles assigned individually. In learning how to set goal value by dividing several situations, different expectation value for each role before starting to learn is set. For example, assuming that the path planning should take less time than the A\* algorithm in any case and that it is a success factor for each episode, the proposed algorithm designates an arbitrary position and number of obstacles in the map of a given size and use the A\*. This time is used as a compensation function of Deep q algorithm.

The agent learns so that the compensation value always increases. If the searching time increases more than the A\* algorithm, the compensation value decreases and learning is performed so that the search time does not increase. Figure 3 shows the learning part of the overall system, which consists of preprocessing part for finding outlier in video using CNN and post-processing part for learning data using singular point.

**Figure 3.** Simulation environment.

In the preprocessing part, the features of the image are searched using input images, and these features are collected and learned. In this case, the Q value is learned for each robot assigned to each role, but the CNN value has the same input with a different expected value. Therefore, the Q value is shared when learning, and used for the learning machine. In order to optimize the updating of Q value, it is necessary to define an objective function as defined as error of target value and prediction value of Q value. Equation (5) describes an objective function.

$$\mathcal{L} = \frac{1}{2} [r + \max\_{a'} Q(s', a') - Q(s, a)]^2 \tag{5}$$

The basic information for obtaining a loss function is a transition <s, a, r, s'>. Therefore, first, the Q-network forward pass is performed using the state as an input to obtain an action-value value for all actions. After obtaining the environment return value <r, s'> for the action a, the action-value value for all action a' is obtained again using the state s'. Then, we get all the information to get the loss function, which updates the weight parameter so that the Q-value updates for the selected action converges; that is, as close as possible to the target value and the prediction value. Algorithm 1 is a function of compensation, which greatly increases compensation if the distance to the current target point is reduced before, and decreases the compensation if the distance becomes longer and closer.
