*2.3. Velocity Obstacle*

A crucial problem in exploration is how to avoid static and dynamic obstacles in real time. The known static obstacles are usually considered in global path planning, while unknown or dynamic obstacles are the focus of local path planning. The common collision avoidance methods include artificial aperture method (APF) [42], dynamic window method (DWA) [43] and behaviour method [44]. These methods have strong adaptability and high efficiency, so many researchers often combine the intelligent control algorithms with these methods for obstacle avoidance [45,46]. Besides, lazy rapidly-exploring random tree method (RRT) [47] method is also used for local path planning. However, these methods above cannot avoid collisions completely with moving obstacles or have certain randomness which leads to low efficiency of obstacle avoidance such as [47]. Alternatively, Velocity Obstacle (VO), first proposed by Fiorini et al. [48], is a simple and efficient algorithm that can avoid static and moving obstacles completely. It generates a conical velocity obstacle space in the agent velocity space. As long as the current velocity vector is outside the VO space, the agent will not collide with obstacles at any time in the future. However, the basic VO has many disadvantages. First, if the agent and moving obstacles or other agents use VO for local path planning at the same time, it will lead to oscillatory motion on both sides [49]. Secondly, the VO space excludes every velocity that may lead to collision, that is, a velocity that can cause a collision after a long time will also be excluded. This leads to the reduction of the range of optional collision-avoidance velocities in some scenarios, or even no optional velocity. To overcome these problems, Abe and Matsuo [50] proposed a common velocity obstacle (CVO) method, which provides collision detection between moving agents and enables agents to share collision information without explicit communication. This information allows agents to use the general VO method for implicit cooperation, so as to achieve the effect of avoiding collision. Guy et al. [51] proposed the finite time velocity obstacle algorithm (FVO), which expands the optional velocity vector of the traditional VO algorithm by calculating the collision velocity cone within a certain time. In order to solve the local oscillation problem Fulgenzi et al. [49] proposed reciprocal velocity obstacles (RVO) by considering the velocity change of both sides of the agents.

The proposed model in this paper combines the DRL, intrinsic motivation and velocity obstacle. Due to the features of exploration in 2D dynamic spaces, we reshape the generating paradigm of intrinsic reward. In order to ensure the safe and fast movement of the agent, we propose another hierarchical approach that combines a variation of the A\* path planning method (called optimistic A\*) and improved FVO (called self-adaptive FVO, SFVO).

#### **3. Problem Formulation**

Before giving the details of the proposed model, this section first formulates the exploration problem in a 2D environment.

**Definition 1.** *A Working Space, denoted as WSM, represents a 2D grid map of the size M* × *M. Any element in WSM can be represented as* (*x*, *y*)*,* 1 ≤ *x*, *y* ≤ *M. Each cell in the grid is represented by T*(*x*, *y*)*: T*(*x*, *y*) = 0 *means a free cell while 1 is for a location occupied by an obstacle. Besides, we assume that the area of each cell is 1.*

**Definition 2.** *Definition 2 Observation Range (ObsR) of an agent is the set of any point whose vertical and horizontal distance to the current position of the agent is not more than the observation radius (n):*

$$\text{ObsR}(\mathbf{x}\_i, y\_i) = \{ (\mathbf{x}, y) | (|\mathbf{x} - \mathbf{x}\_i| \le n, |y - y\_i| \le n) \}\tag{1}$$

**Definition 3.** *Exploration Range (ExpR) of an agent is the set of any point whose vertical and horizontal distance to the current position of the agent is not more than the exploration radius(m), and it can be covered more than half of the area by the 'radar wave' emitted by agent:*

$$\text{ExpR}(\mathbf{x}\_i, y\_i) = \left\{ (\mathbf{x}, y) | (|\mathbf{x} - \mathbf{x}\_i| \le m, |y - y\_i| \le m), \mathbf{S}((x\_i, y\_i) \to (\mathbf{x}, y)) > \frac{1}{2} \right\} \tag{2}$$

*S*((*xi*, *yi*) → (*x*, *y*)) means the covered area by the "radar wave" emitted from (*xi*, *yi*) to (*x*, *y*). A specific example is shown in Figure 1.

**Figure 1.** An example of Exploration Range. (**a**) shows the obstacles around the agent, where the red solid circle represents the agent, and the black squares represent two obstacles. (**b**) shows the range that can be covered by the "radar wave" emitted from the agent. The gray shaded areas indicate that these areas are not covered by the "radar wave". (**c**) shows whether each cell in this scenario can be regarded as an explored area under Definition 3 when *m* = 3. The blue cells are the areas that the agent has been explored, while the agent has not explored the white areas.

Note that the region observed by the agent does not represent where it has been explored.As a simple example, imagine that we are searching for gold that cannot be seen from the earth's surface, so that we should use a gold detector to explore the region as far as it can extend into. We cannot find gold using our eyes, but the detector can. In general, the "detection range" (*m*) should not be greater than the "length of field of view" (*n*), i.e., *m* ≤ *n* and *ExpR*(*xi*, *yi*) ⊆ *ObsR*(*xi*, *yi*).

#### **4. The Proposed Model**

This paper combines the advantages of DRL algorithms, traditional non-learning planning algorithms and real-time collision avoidance algorithms, and propose a novel approach to solve the exploration problem in the 2D dynamic grid. The proposed model is modular and hierarchical so that it cannot only exploit the structural regularities of the environment but also improve the training efficiency of DRL methods. The overall structure of our model is shown in Figure 2. GEM determines the next long-term target point to be explored based on a spatial map *mt* maintained by the agent. LMM takes the next target point as input and computes the specific action to reach the target point. We use *tg* to index the step of selecting the next target in only GEM. For example, we select a target point at initial time, *t* = *tg* = 1, and we assume the agent takes 10 steps to reach this target and select a next target point, then *t* = 11 and *tg* = 2 at this moment.

**Figure 2.** The overview of our IRHE-SFVO. In Global Exploration Module (GEM), the agent uses the current location and observation to build a spatial map *mt*, then input *mt* into Global Policy and output the next target point that will be explored. The Local Movement Module (LMM) determines the specific action to reach the target point quickly and safely based on the agent's current location, the next target point, and the obstacle map maintained by the agent.

#### *4.1. Global Exploration Module*

We want to learn an exploration policy *π<sup>g</sup>* that enables the agent to select a location to explore so that the information gain about the environment can be maximized. For this purpose, we design an intrinsic reward function, favouring states where the agent can increase its exploration range at a fastest speed. Proximal Policy Optimization (PPO) [52] is used for training *πg*. Importantly, *π<sup>g</sup>* is learned on a set of training maps and tested on another set of unseen maps. This setting is to demonstrate the desirable generalization of our method across different environments.

### 4.1.1. Spatial Map Representation

First, as shown in the top block in Figure 2, GEM maintains a four-channel spatial map, *mt*, as the input of the global policy. Then, the policy network outputs a next target point (*gtg* ) that will be explored. To be specific, the spatial map contains four matrices of the same size, i.e., *mt* <sup>=</sup> {0, 1}4×*M*×*M*, where *<sup>M</sup>* is the height and width of the explored maps. Each element in the first channel represents whether the location is an obstacle (*OMt*): 0 is for a free cell and 1 is for a blocked one. In the beginning, *OM*<sup>0</sup> <sup>=</sup> {0}*M*×*<sup>M</sup>* based on the assumption of free space. Each element in the second channel represents whether the location has been explored (*EMt*). The third channel encodes the current location (*Pt*) in a one-hot manner, i.e., the element corresponding to the agent location is set to be 1, and the others are 0. The fourth channel labels the visited locations (*P*1:*t*) from the initial time to the current time. The rationality of establishing these four channels is that the agent can fully exploit all spatiotemporal information useful for target decision-making. In particular, this elegant design is: (a) to enable the agent to use the structural regularities of the spatial environment to make correct decisions, (b) to prevent the agent from selecting the points that have already been explored when choosing the next target point, and (c) to make the agent select the best next target point based on the current location, considering the time cost and exploration utility comprehensively.

#### 4.1.2. Network Architecture

The policy network takes *mt* as input and outputs a *gtg* → *πg*(*mt*; *θg*), where *θ<sup>g</sup>* are the parameters of the global policy. As shown in Figure 3, the spatial map *mt* is first passed through an embedding layer and the layer outputs a four-dimensional tensor of size 4 × *N* × *M* × *M* × *M*, where *N* represents the length of each embedding vector. Then, add the four constituent 3D tensors along their first dimension and we get a tensor with

rich information whose size is *N* × *M* × *M* × *M*. Then, this 3D tensor is passed through three convolution layers and three fully connected layers successively, and finally outputs a next target point: *gtg* . Note that the embedding layer is essential for preserving information embedded in the input because its input is 0–1 matrices of size 4 × *M* × *M* × *M*, which are all very sparse. Although the convolutional and pooling operations can extract spatial structure information, they will result in loss of many valuable information, and ignore the association between the overall and part as well if we send a matrix to the CNN and pooling layer directly. Therefore, to ensure the integrity of the information, it is necessary to map the *mt* to a higher-dimensional vector first.

**Figure 3.** The structure of the actor-critic network in GEM. N represents the size of each embedding vector.

#### 4.1.3. Intrinsic Reward

The effectiveness of DRL relies on rewards heavily. However, the exploration task is a reward-sparse RL problem. To alleviate the problem, we design an intrinsic reward (denoted by *r<sup>i</sup> tg* ) and combine it with the external rewards (denoted by *r<sup>e</sup> tg* ) given by the environment, i.e., *rtg* = *r<sup>i</sup> tg* + *<sup>r</sup><sup>e</sup> tg* , so that the rewards along the exploration trajectory becomes denser. This is critically helpful to speed up the convergence of the policy and for the emergence of directed exploration. In literature, possible IM formulations include "curiosity" [34], "novelty" [53] or "empowerment" [40] to generate intrinsic rewards as described in Section 2. However, these approaches use blackbox models that cannot be initialized at each episode because the weights of neural networks cannot be reset in different episodes, resulting in the intrinsic reward getting smaller and smaller after each episode under the same scenario. To solve this problem, we design a simple yet effective intrinsic reward function that resets *r<sup>i</sup>* at each episode. We use the increase of the explored area deduced from *EMt* when the agent arrives at a new target point as the intrinsic rewards *r<sup>i</sup> tg* .

### *4.2. Local Movement Module*

To be able to explore in dynamic spaces, the agent needs both to reach the target point quickly and avoid colliding with moving obstacles. To achieve this goal, we design another hierarchical framework in local movement module including two levels: planning and controlling. In the planning stage, we use optimistic A\* algorithm to plan an optimal path under partial observability, and then divides the path into several segments according to some rules. The end point of each segment is called a key point. In the controlling stage, we design an SFVO (Self-adaptive FVO) for the agent to reach these key points sequentially, and finally completes the movement along the path.

#### 4.2.1. Planning Stage

There are many global path planning algorithms such as breadth first search , depth first search and Dijkstra. Instead of using the less efficient Generalized Dijkstra's algorithm to solve the Shortest Path Problem (SPP) in [54], we use A\* algorithm which has better search efficiency to plan the optimal global path. The basic A\* algorithm performs well in fully observable environments, but it does not work directly in our task since the *OMt* dose not reflect the whole map. So we use a variation of A\*, called optimistic A\* algorithm. We assume that all unknown cells of the obstacle map are traversable and then plan a path between the current position of the agent and the target point. If the agent observes some static obstacles while moving, then it will replan the path using A\* algorithm.

Once an optimal path is computed, we select several key points on this path to guide the agent reach the target point. For the motion controller, presented below, to drive the agent to move between them. As shown in Figure 4, this paper categorizes three types of key points: (a) turning points on the path, (b) boundary points on the path that crosses the known and unknown region and (c) the destination point of the path, i.e., the target point.

**Figure 4.** Examples of key points. The left figure shows the first and third types of key points, while the right figure shows the second and third types of key points. The green squares are the current locations of the agent. The blue squares represent the target points, which is also the third type of key points. The red lines represent the optimal path that was generated by A\* algorithm and the orange squares are the first or second type of key points. The shaded area represents the unknown range of the agent while the other area represents the known range that has been observed by the agent.

Note that, the second type of key points are selected in the known area. Otherwise, if we select the boundary point in unknown area (the neighbour square above the orange square in Figure 4b, an obstacle might be selected as the key point.

In particular, the rationality of the selection strategy is that: (a) each segment of the path between key points is straight without considering dynamic obstacles, so that it is convenient for the controller to control the movement; (b) it is applicable to unknown spatial exploration problems under the partial observation conditions. We always choose the locations known for the agent as the key point, making its performance more similar to human exploration behaviour.
