*3.4. Improvements Based on Action Selection*

The USV explores the map under the strategy of the initial neural network. Since the parameters of the initial neural network are not trained or trained only a few times, the initial neural network is likely to give a strategy to collide with obstacles or go out of the map boundary. When the USV collides with an obstacle or walks out of the map boundary, the full-coverage task fails, and the USV returns to the initial position to try the next full-coverage path planning task, so the USV needs to undergo much training to avoid collision obstacles. The worst behavior is to avoid objects or go outside the boundaries of the map. Moreover, during the training process, even if the USV is in the same position, the state matrix of the map is different, due to the different previous behaviors. When the USV does the worst behavior, the reinforcement learning model will only make the state corresponding to this time. The Q-value of the action is reduced without extending it to all similar states, as shown in Figure 10.

**Figure 10.** Dangerous actions in different states.

As shown in Figure 10, state *s*<sup>1</sup> and state *s*<sup>2</sup> are two different states, black squares are obstacles, white squares are movable paths, gray squares are the paths traveled by USV, black circle is USV, and *a* represents the action taken by USV. When the USV makes a dangerous action only in state *s*1, deep reinforcement learning will only train the neural network, so that *Q*–*eval*(*s*1, *a*) becomes smaller without making the value of *Q*–*eval*(*s*2, *a*) smaller, so that when the USV encounters the *s*<sup>2</sup> state, there is still a probability to choose the dangerous action *a*, which leads to the failure of the task.

When humans perform similar tasks, they will actively avoid the above dangerous actions, so this paper proposes a method that can greatly reduce the probability of such dangerous actions. The policies for USVs are derived from current networks in deep reinforcement learning. The output of the current network is the Q-value of each action in the current environment. If you want to reduce the selection of dangerous actions, you only need to subtract a large number from the Q-value of the dangerous action. The specific implementation method is as follows.

Step 1: Assuming the size of the state map is n × n, first add −1 around the state map to represent the map boundary, as mentioned in the previous chapter. Then, convert the state map to the initial state, that is, change the values of all covered grids to 0.

Step 2: The position value in the state matrix of the USV is 7—find the position of the USV, and then multiply it by the surrounding elements based on the action value. For example, if the action value is up, it is multiplied by the element directly above the USV to obtain the danger of the action in this state. The four actions derive their corresponding dangers.

Step 3: Multiply the risk by the risk coefficient *t*, and then add it to the final output layer of DQN to form the final output. Its structure diagram is shown in Figure 11.

**Figure 11.** Framework for reducing the probability of choosing a dangerous action.

#### *3.5. Design of the Reward Function*

In order to improve the efficiency of full-coverage path planning, the USV should try to avoid repeatedly covering the covered area during the full-coverage path planning task and try to select the area that can cover the uncovered area every time an action is selected. So, the reward function initially defined by the text is shown in Equation (11).

$$r = \begin{cases} 1, m(i, j) = 0 \\ 0, m(i, j) = 1 \text{ and } m\_{-1}(i \pm 1, j \pm 1) = 1 \\ -1, m(i, j) = 1 \text{ and } \exists m\_{-1}(i \pm j \pm 1) = 0 \\ -20, m(i, j) = -1 \text{ or } i, j < 0 \text{ or } i, j > N \\ 20, donc \end{cases} \tag{11}$$

where *m*(*i*, *j*) represents the position in the grid map after the USV takes an action, and *m*−1(*i* ± 1, *j* ± 1) is the grid situation around the grid before the USV takes an action. When the USV enters an uncovered grid, it will obtain reward 1. When there is no uncovered grid around the USV, that is, all the grids around the USV have been covered or are obstacles, the USV takes no collision obstacles or actions that go out of the map's borders and are rewarded with 0. When there is an uncovered grid around the USV, the vehicle still takes action to make itself enter the covered grid, and the reward is −1. When the vehicle takes an action and collides with an obstacle or goes out of the map boundary, it will obtain a reward of −20.

#### **4. Simulation Results and Discussion**
