4.2.2. Controlling Stage

To avoid colliding with moving obstacles, we propose Self-adaptive Finite-time Velocity Obstacle algorithm (SFVO) built on FVO. Let A be the agent and B be an obstacle. For ease of calculation, we assume the agent and obstacles are dish-shaped. We use *D*(**p**,*r*) to represent a circular region with center **p** and radius *r*:

$$D(\mathbf{p}, r) = \{\mathbf{q} \mid \|\mathbf{q} - \mathbf{p}\| < r\} \tag{3}$$

The finite-time velocity obstacle *FVO<sup>τ</sup> <sup>A</sup>*|*<sup>B</sup>* represents the set of relative velocity values between A and B that will cause the collision in time *τ* in the future:

$$FVO\_{A|B}^{\mathbf{r}} = \{ \mathbf{v} | \exists t \in [0, \mathbf{r}], \forall \mathbf{v} \in D(\mathbf{p}\_B - \mathbf{p}\_{A'} r\_A + r\_B) \}\tag{4}$$

The collision-avoidance velocity (*CA*) of the agent is:

$$\mathbb{C}A^{\pi}\_{A|B}(V\_B) = \{ \mathbf{v} | \mathbf{v} \notin FVO^{\pi}\_{A|B} \oplus V\_B \}\tag{5}$$

According to the features of autonomous spatial exploration in dynamic environments, we change the fixed time *τ* into into adaptive dynamic time *τd*, i.e., *τ<sup>d</sup>* <sup>0</sup> = *τmax* and *<sup>τ</sup>d*(*n*) = *<sup>τ</sup>max* <sup>−</sup> <sup>Δ</sup>*<sup>τ</sup>* · *<sup>n</sup>*, n represents the number of rounds of a cycle, <sup>Δ</sup>*<sup>τ</sup>* represents the reduction of finite time. This method is called Self-adaptive Finite-time Velocity Obstacle (SFVO). Figure 5 tells that the larger *τ<sup>d</sup>* we set, the larger range of *FVOτ<sup>d</sup> <sup>A</sup>*|*<sup>B</sup>* <sup>⊕</sup> *VB*, and the smaller range of *CAτ<sup>d</sup> <sup>A</sup>*|*B*(*VB*). Therefore, we will adaptively adjust the velocity obstacle range of the agent by decreasing *τ<sup>d</sup>* gradually. Specifically, at the beginning, the agent calculates the *CAτ<sup>d</sup> <sup>A</sup>*|*B*(*VB*) under the condition of *<sup>τ</sup><sup>d</sup>* <sup>=</sup> *<sup>τ</sup>max*. If *CAτ<sup>d</sup> <sup>A</sup>*|*B*(*VB*) = <sup>∅</sup>, decrease the *τ<sup>d</sup>* by a fix time interval Δ*τ*, and then calculate the collision-avoidance velocity again. If there is still no alternative velocity when *τ<sup>d</sup>* = 0, the agent stays idle until the next time step to continue the process above. The pseudo-code of SFVO is shown in Algorithm 1.

**Figure 5.** FVO algorithm diagram. In the left figure, the shaded area shows the relative velocity that will cause collision in time in the future. The right figure shows the collision velocity (shaded area) and collision-avoidance velocity (white area) of the agent given the velocity of the obstacle.

#### **Algorithm 1** SFVO

1: *<sup>τ</sup><sup>d</sup>* <sup>←</sup> *<sup>τ</sup>max* 2: **for** *τ<sup>d</sup>* > 0 **do** 3: *FVO<sup>τ</sup> <sup>A</sup>*|*<sup>B</sup>* <sup>=</sup> {**v**|∃*<sup>t</sup>* <sup>∈</sup> [0, *<sup>τ</sup>*], *<sup>t</sup>***<sup>v</sup>** <sup>∈</sup> *<sup>D</sup>*(**p***<sup>B</sup>* <sup>−</sup> **<sup>p</sup>***A*,*rA* <sup>+</sup> *rB*)} 4: *CA<sup>τ</sup> <sup>A</sup>*|*B*(*VB*) = {**v**|**<sup>v</sup>** <sup>∈</sup>/ *FVO<sup>τ</sup> <sup>A</sup>*|*<sup>B</sup>* <sup>⊕</sup> *VB*} 5: **if** *CAτ<sup>d</sup> <sup>A</sup>*|*B*(*VB*) = <sup>∅</sup> **then** 6: *<sup>τ</sup><sup>d</sup>* <sup>←</sup> *<sup>τ</sup><sup>d</sup>* <sup>−</sup> <sup>Δ</sup>*<sup>τ</sup>* 7: **continue** 8: **else** 9: **return** *CA<sup>τ</sup> <sup>A</sup>*|*B*(*VB*) 10: **end if** 11: **end for** 12: **return** ∅

Based on the SFVO, the agent can avoid static and dynamic obstacles in real time, and its collision-avoidance velocity is **<sup>v</sup>***<sup>A</sup>* <sup>∈</sup> *CAτ<sup>d</sup> <sup>A</sup>*|*B*(*VB*). However, if there is more than one element in set **v***A*, how can we choose an optimal velocity that not only drives the agent to reach the target point quickly, but also minimizes the risk of collision.

Inspired by Kim et al. [55], we design an optimal velocity evaluation function which consists of two parts: Expected Velocity Direction Evaluation Function (*fv*) and Relative Vertical Distance Evaluation Function (*fd*). As shown in Figure 6. The target point of the agent is known, so the direction of its expected velocity (**v***v*) is the direction from the agent's current position points to the target point. So, *fv* can be expressed as Equation (6).

$$f\_{\upsilon} = k\_{\upsilon}|\mathbf{v}\_{\upsilon} - \mathbf{v}\_{A}| \Rightarrow k\_{\upsilon}\cos\langle \mathbf{v}\_{\upsilon}, \mathbf{v}\_{A} \rangle = k\_{\upsilon} \frac{\mathbf{v}\_{\upsilon} \cdot \mathbf{v}\_{A}}{|\mathbf{v}\_{\upsilon}||\mathbf{v}\_{A}|} \tag{6}$$

Note that, the action space of the agent in our task is discrete, and the agent moves one unit at each step, i.e., the length of its velocity is fixed. So *cos***v***v*, **v***A* is equivalent to |**v***<sup>v</sup>* − **v***A*|.

**Figure 6.** Schematic diagram of the expected velocity and the relative vertical distance. The green disks represent the agent and the obstacle, respectively, and the red point represents the target point.

The relative vertical distance (*dv*) is defined as Equation (7).

$$d\_{\upsilon} = |\mathbf{P}\_A - \mathbf{P}\_B| \sin \theta = \frac{\mathbf{v} \times (\mathbf{P}\_A - \mathbf{P}\_B)}{|\mathbf{v}|} \tag{7}$$

**v** is the relative velocity of the agent and the obstacle, i.e., **v** = **v***<sup>B</sup>* − **v***A*. A smaller *dv* means that the obstacle is prone to collide with the agent. Note that *dv* can be negative according to Equation (7), indicating that the obstacle is moving away from the agent and there is no danger for the agent. And the large |*dv*| is, the safer the agent is. According to the analysis, *fd* is designed as Equation (8):

$$f\_d = \begin{cases} -d\_v & d\_v \le 0 \\ -\frac{1}{d\_v} & d\_v > 0 \end{cases} \tag{8}$$

Finally, an overall evaluation function can be defined to be a weighted sum of *fv* and *fd*, as shown in Equation (9). The importance of each part can be regulated by the weights *k*<sup>1</sup> and *k*2.

$$f = k\_1 f\_v + k\_2 f\_d \tag{9}$$

When increasing *k*<sup>1</sup> and decreasing *k*2, *fv* dominates and the agent is more inclined to approach the target. In contrast, the agent takes priority in obstacle avoidance.

#### 4.2.3. The Hybrid Algorithm

The optimistic A\* algorithm can calculate the shortest global path in the partially known environment, but it cannot timely avoid the moving obstacles in dynamic environments. The SFVO with optimal velocity evaluation function can avoid collision in real time, but it lacks global guidance and has only one expected direction. More specifically, in a space with many moving obstacles, it is easy to fall into local minima, resulting in planning longer paths or even failing to plan. With this problem in mind, we combine the planning and controlling methods described above. The workflow of the hybrid algorithm is shown in Figure 7.

**Figure 7.** Flowchart of the hybrid algorithm.

At the time *t*, the algorithm puts the target point (*gtg* ), which is produced by the Global Exploration Module *Pt* and *OMt* into A\* algorithm, plans a global path: *patht* = *fA*∗(*Pt*, *gtg* |*OMt*). And extract the key points on the *patht*: {*K*1, *K*2, ..., *Kn*}. Then, put the set of collision-avoidance velocities *VA* which is calculated by the SFVO algorithm (Algorithm 1) *VA* = *CA<sup>τ</sup> <sup>A</sup>*|*B*(*VB*), the observation (*Ot*) of the agent and the sequence of key points into evaluation function (*f*), and calculate an optimal velocity of the agent **v***opt* ← *f*(*VA*, *K*1, *K*2, ..., *Kn*|*Ot*,*OMt*, *Pt*). Then, the agent moves one step at this velocity, and updates *OMt* → *OMt*+<sup>1</sup> and *Pt* → *Pt*+<sup>1</sup> at the same time. At time *t* + 1, if there is an obstacle on *patht*, i.e., ∃(*i*, *j*) ∈ *patht*,*OM*[*i*, *j*] = 1, conduct the A\* path planning again: *patht*+<sup>1</sup> = *fA*∗(*Pt*+1, *gtg* |*OMt*+1), and continue with the above process. Otherwise, *patht*+<sup>1</sup> = *patht* and determine whether the agent has reached the key point *K*1. If so, the sequence of key points is updated. Otherwise, the agent continues to use SFVO algorithms for local movement control. When the agent reaches the target point (*gtg* ), the LMM stops running. Then the GEM chooses a new next target point (*gtg*+1).

In addition, to make the motion trajectory smoother and reduce unnecessary local oscillation, the agent can be regarded as reaching the key points as long as it reaches the eight adjacent cells around the key point.

These are all the functional modules of the IRHE-SFVO above. As we can see in Figure 2, we use the Global Policy in GEM for generating next target points which the agent will go to, then plan a best path between the current position and the target point and extract the key points in the planning stage. Then, in the controlling stage, we use the SFVO algorithm (Algorithm 1) and evaluation function (Equation (9)) to decide a specific action of the agent, and then update the current knowledge of the spatial structure which decides whether to replan the A\* algorithm or update the key points sequence. We run the above functional modules sequentially until the exploration is completed.

#### **5. Empirical Evaluation**

The goal of this paper is to build agents that can autonomously explore novel 2D dynamic environments with moving obstacles. To verify the effectiveness of the proposed method, we implement mentioned components for performance evaluation.

#### *5.1. Experimental Setup*

In order to evaluate the effect of the proposed model, we construct 2D grid maps by referring to reference [56] to represent the layout of indoor scenes such as offices. The maps are sized of *M* = 40, as shown in Figure 8. The first six maps make up the training set, and the rest are test maps. These maps have different spatial layouts and there is no intersection between the training set and the test set.

We use the ratio of the explored region as the metrics, which is calculated by dividing the coverage area by the total area that can be explored. It is defined as:

$$ExpRatio = \frac{C(EM\_t)}{\sum\_{x,y=0}^{x,y=M} (1 - T(x,y))} \tag{10}$$

where *C*(*EMt*) represents the total area that is explored.

**Figure 8.** The different 2D grid maps without moving obstacles.

In these 2D grid maps, we set several obstacles which move independently of each other. The initial positions and moving directions of the moving obstacles are randomly selected, their movement mode is similar to that of the intelligent sweeping robots. As shown in Figure 9, the obstacles move straight until they collide with the obstacle or touch the boundary of the map, and then change the moving direction randomly.

**Figure 9.** The diagram of the initial distribution of moving obstacles and their movement trajectory. The red cells represent the dynamic obstacles, and the yellow dotted line represents the possible movement trajectory of this obstacle in left figure. When the environment is initialized, moving obstacles are randomly generated in the blank area of the map, and each obstacle is independent of each other. The right figure shows the initialization of the moving obstacles.

To simplify the calculation, both the agent and the moving obstacle are regarded as circles with a radius of 0.5. Table 1 shows the parameters used in this experiment.

**Table 1.** Dynamic environment parameter details.


**Training Details**. We use multi-process paralleled PPO to train the global policy in GEM, with a different process for each map. The hyperparameters of PPO and global policy network are shown in Tables 2 and 3, respectively.

**Table 2.** PPO hyperparameter details.



**Table 3.** Global policy network details.

**Baselines**. We use some classical methods and end-to-end DRL methods as baselines, and all methods were tested 15 times on four test maps with random initial positions:


The RND-PPO is an end-to-end method, taking the observation as input and outputting a specific action of the agent. This kind of methods are hard to train for a desirable policy. Compared with RND-PPO and Random, the Straight is more stable, as it changes the velocity of the agent only when the agent will collide with an obstacle. Besides, the frontier-based method is also hierarchical as ours, whose workflow is still to select a position and then move to it, and we find that the SOTA DRL exploration methods are also difficult to achieve its performance in terms of exploration ratio [16].

#### *5.2. Local Real-Time Obstacle Avoidance*

We first verify the effectiveness of SFVO and the optimal velocity evaluation function for real-time obstacle avoidance, as shown in Figures 10–12. The green squares represent the positions of the agent at the current time, and the blue squares represent the target points. The red squares represent the moving obstacle with downward velocity, and the orange squares represent key points. The task of the agent is to reach the target point quickly while avoiding static and dynamic obstacles at the same time.

Because of the existence of key points and the four-direction (up, down, left and right) action space, each part of the path that between two key points forms in a straight line, so the agent has to move perpendicular to the line or in the opposite direction in order to avoid the moving obstacles. As a result, the expected velocity direction evaluation function *fv* of the agent during obstacle avoidance is not greater than 0. Then, the agent completely depends on the relative vertical distance evaluation function *fd* to select the optimal velocity. On other words, if the theory described above is correct, no matter how large *k*<sup>2</sup> is, it will always play a role in obstacle avoidance. Then, after obstacle avoidance, the agent needs to change its velocity to approach the key point, and the velocity selection

at this time completely depends on the expected velocity direction evaluation function *fv*. That is to say, no matter how large *k*<sup>1</sup> is, it always plays a role in the velocity selection of approaching the key point after completing obstacle avoidance. Therefore, the combination of weighting coefficients in the evaluation function set in this experiment (Equation (9)) is relatively single. We set three groups of different weighting coefficients: *k*<sup>1</sup> = 0, *k*<sup>2</sup> = 1; *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 0; *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 1. The trajectories of the agent under the three groups of coefficients are shown in Figures 10–12.

**Figure 10.** The trajectory of the agent under the condition of *k*<sup>1</sup> = 0, *k*<sup>2</sup> = 1 set in *f* .

As shown in Figure 10, when *k*<sup>1</sup> = 0, *k*<sup>2</sup> = 1, the moving trajectory of the agent is more and more away from the obstacle, but does not move towards key points. In essence, the agent is still taking random movements on the basis of avoiding collision with obstacles.

**Figure 11.** The trajectory of the agent under the condition of *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 0 set in *f* .

When *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 0, the agent will ignore the risk of collisions while moving. As shown in Figure 11, at *t* = 3, the agent judges that it will collide with the moving obstacle in the next two steps with the current motion direction through SFVO algorithm, so that it needs to make obstacle avoidance action. The agent is close to the static obstacle on the right and far from the one on the left. Therefore, the best obstacle avoidance action of the agent at this time should be to move left, which can avoid colliding with the moving obstacles and reduce the risk of collision with other obstacles, as shown in Figure 12. However, the agent does not consider the relative vertical distance *dv* from the obstacle under the condition of

*k*<sup>1</sup> = 1, *k*<sup>2</sup> = 0, nd has a 50% probability of moving right, as shown in Figure 11, which increases the risk of collision with other obstacles.

**Figure 12.** The trajectory of the agent under the condition of *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 1 set in *f* .

Figures 11 and 12 not only show the effectiveness of SFVO algorithm and evaluation function for obstacle avoidance, but also demonstrate the efficaciousness of the hybrid algorithm for path planning. At *t* = 0, the orange square which close to the agent is the second type of key points, the square that is far from the agent is the first type, and the blue square is the third type. When the agent reaches a key point or one of its adjacent eight squares, it is deemed to have reached the key point, so it continues to select the subsequent key points for local path planning. Finally, it guides the agent to the target point.

### *5.3. Comparison with Baselines on Spatial Exploration*

We test the IRHE-SFVO with weighting coefficient *k*<sup>1</sup> = 1, *k*<sup>2</sup> = 1 of the evaluation function and compare it with the baselines on the test maps. The results are shown in Figure 13 and Table 4. It is worth noting that, from the perspective of safety, when the agent will collide with an obstacle, it should stop moving or change the direction immediately. However, the vanilla frontier-based strategy has no such specific design.

**Table 4.** The average exploration ratios of the proposed method and baselines on the four test maps. The brackets indicate the average number of times when the agent driven by the frontier-based strategy collides with moving obstacles in 15 spatial explorations on different test maps.


**Figure 13.** Coverage Performance. Policies are tested in 800 steps (15 random start locations on each of the 4 testing maps). Darker line represents mean exploration ratio and shaded area represents the standard deviation across the 15 runs.

Figure 13 shows the coverage performance of different methods on the four test maps. Combining Table 4, we can see that the order of coverage from low to high is: Random and RND-PPO, Straight, IRHE-SFVO, Frontier. Specifically, we first notice that RND-PPO and Random method have similar poor performance. The reasons are as follows: the core of the dynamics-based curiosity agent such as RND is that if a state is encountered many times, its novelty will continue declining due to network parameter updates during training. So, in these grid maps, the agent needs to come out of its familiar area which is around the initial location at each episode. However, after several episodes of training, the "novelty" of the states around the initial position drop to a low level, and the agent does not touch the high-novelty world outside so that it is difficult for the agent to walk out of its familiar area. Second, the policy learned by the RND agent lacks the ability to explore since the intrinsic reward is to encourage agents to traverse more states rather than teaching the agent to learn how to explore. To be specific, its intrinsic reward will gradually decrease as training times due to the black-box model, so that the learned policy will depend more and more on the external reward, and the policy is dependent on external rewards completely at the end of training. In our experimental setting, RND only has a collision punishment as its external reward, so the agent moves randomly and only learns to avoid collision at the end. This is why the RND-PPO algorithm is slightly better than the Random algorithm.

Then, we notice that the Straight method performs much better than Random and RND-PPO because this method takes random actions only when a collision occurs. It is more stable than the Random and RND-PPO algorithms, which move randomly at every step.

Overall, these methods perform largely worse than IRHE-SFVO because our algorithm has an instructive high-level exploration strategy and an effective local movement module. At the beginning of training GEM, the target points are selected almost randomly. As the training proceeds, the agent, driven by the intrinsic rewards which are generated by itself, gradually choose the points that are distant but easily reachable. Intrinsic reward, generated from intrinsic motivation, can make individual feel satisfied psychologically or emotionally because of increase of obtained knowledge or control [60]. In our exploration task, the intrinsic rewards are calculated by the increase of the explored area, so the agent will fill more cheerful when it chooses the point in an unknown region that is further away from it. Furthermore, the agent will reasonably adjust the distance between itself and the selected target point as the training progresses with the fixed number of target points that will be chosen during an exploration task. For example, the distances will be larger when the number of the target points is 10, while they will be smaller when the number of the target points is 20. In addition, LMM can adapt the agent to the planned path and avoid colliding with moving obstacles according to its perception, which can reach the target points quickly and safely.

Finally, we notice that the frontier-based strategy has achieved the highest exploration ratio in all the test maps. This method selects frontier points that lie on the boundary between the known free space and unknown region according to the maps built by the agent, and the experimental environments in this paper are very realistic, without any perceived noise or action errors, which is highly favourable to the frontier-based exploration agent. In the environment with moving obstacles, although this method may miss some "frontier points" at some time, resulting in that the spaces around them are not explored for a period of time. However, at a later time, this method can always select these "frontier points" again for spatial exploration. Because the motion trajectories of moving obstacles are random, it is impossible for them to stay at the positions where the frontier method always misjudges these "frontier points", so the exploration ratio of the frontier-based method is almost unaffected by dynamic obstacles. However, safety is a crucial problem that we must consider in spatial exploration. And Table 4 shows that the frontier-based method has collided with dynamic obstacles many times during exploration, while the others do not.

In addition, we visualize the paths and coverage areas of IRHE-SFVO and baselines on test map 2, the initial position of the agent is (1, 1) on the bottom left. As shown in Figure 14, in each row of subfigures from left to right are the trajectories at step 0, step 40, step 200, step 400, Step 600 and step 800 respectively.

We can see that IRHE-SFVO algorithm covers almost all space, and its motion trajectory is relatively smoother and more reasonable than those of its competitors. It can be thought that it produces similar exploration strategies as human beings. Although frontier-based method has high exploration coverage, its motion trajectory is mechanical and very zigzag, such as the upper left corner and the blank area in the middle of the map. In addition, combined with Figure 13 we can also see that the exploration efficiency of IRHE-SFVO is slightly higher than that of the frontier in the initial exploration stage, because IRHE-SFVO aims to maximize the information gain of each step, but this is also the reason for its insufficient local exploration. Every time IRHE-SFVO selects a target point, it tends to select the locations where a large number of unexplored areas around it, so that the agent can quickly obtain a large amount of map information, but the exploration is insufficient for local details.

In summary, we use the DRL based on the intrinsic motivation to simulate the human high-level cognitive behavior during exploration, so that the agent always chooses those places that are particularly unknown to explore. And as for the quick and safe movement, we use a hierarchical framework including planning and controlling instead of learning methods that have difficulty in joint training to simulate the human low-level real-time response. Therefore, combining the two modules above, the IRHE-SFVO algorithm could meet the requirements of high efficiency and quasi-humanity of spatial exploration.

**Figure 14.** Sample trajectories of the competing approaches along with the coverage region in Test map 2. The orange point represents the initial location of the agent. The solid green lines represent the trajectories that the agent has traversed and the blue shaded region shows the explored area.
