**1. Introduction**

Spatial cognitive behaviour modelling is the basic content of human cognitive behaviour modelling, and is one of the hottest topics in the field of neuroscience and computer science. At its core, the agent in an AI system needs to explore the environment to gain enough information about the spatial structure. The possible applications include, for example, search and rescue (SAR) missions, intelligence, surveillance and reconnaissance (ISR), and planetary exploration. Therefore, it is important to design an efficient and effective exploration strategy in unknown spaces.

At present, autonomous spatial exploration falls into two main categories: traditional rule-based exploration and intelligent machine-learning-based exploration. The rule-based exploration methods such as frontier-based method [1] is simple, convenient and efficient. This kind of approach rely on an expert feature of maps, expanded the exploration scope by searching for the next optimal frontier point which is between free points and unknown points according to the explored map. However, the locomotion of the agent driven by this method is mechanical and rigid, and it is also difficult to balance between exploration efficiency and computational burden. As an effective tool for autonomous learning strategies, deep reinforcement learning (DRL) has been more and more widely used in spatial exploration. However, DRL suffers much from the inherent "exploration-exploitation" dilemma, resulting in sampling inefficiency if the extrinsic rewards are sparse or even nonexistent. To solve the problem of sparse rewards, many recent DRL approaches incorporate the concept of intrinsic motivation (IM) from cognitive psychology to produce intrinsic rewards to make the rewards denser. However, intrinsic motivation based enhancement

**Citation:** Zhang, Q.; Song, Y.; Jiao, P.; Hu, Y. A Hybrid and Hierarchical Approach for Spatial Exploration in Dynamic Environments. *Electronics* **2022**, *11*, 574. https://doi.org/ 10.3390/electronics11040574

Academic Editor: Jeha Ryu

Received: 21 December 2021 Accepted: 10 February 2022 Published: 14 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is insufficient for efficient exploration in unknown spaces. The main reason is that IM treats all unseen states indiscriminately and ignores the structural regularities of physical spaces. In addition, it is difficult for end-to-end DRL agent to simultaneously learn obstacle avoidance, path planning and spatial exploration from raw sensor data.

To this end, we extend our previous work [2] and propose a three-tiered hierarchical autonomous spatial exploration model, named Intrinsic Rewards based Hierarchical Exploration with Soft-adaptive Finite-time Velocity Obstacle (IRHE-SFVO), to explore unknown static and dynamic 2D spaces. This model consists of two parts: a Global Exploration Module (GEM) and a Local Movement Module (LMM). GEM is used to learn an exploration policy to produce a sequence of target points that will maximize the information gain about the spatial structure based on the location of the agent, the trace of the agent, and the explored portions as its spatial memory. Specifically, to make the motion pattern of the agent more like human beings, GEM is not concerned with the immediate neighbourhood of the agent, but determines a distant yet reasonably reachable target to be explored next. Selected based on intrinsic rewards, such targets are usually those with a lot of unexplored areas around them.

In the local movement phase, this paper designs a hierarchical framework to control the movement to the target point. We separate this phase into two parts: planning and controlling. In the planning stage, an optimistic A\* path planning algorithm, which can conduct self-adaptive path planning in a partially known environment, is used to compute a shortest path between the current location of the agent and the target point. It assumes that unknown areas are freely reachable and decides whether to replan the global path according to the ongoing perception. In the controlling stage, we use the improved Finitetime Velocity Obstacle (FVO), called Self-adaptive Finite-time Velocity Obstacle (SFVO), and design an optimal velocity function to drive the agent to avoid moving obstacles in real-time. This allows the agent to reach the target point quickly while avoiding collision with moving obstacles at the same time.

Working in synergy, the modules in the three levels apply a long-horizon decisionmaking paradigm instead of the step-by-step or state-by-state way used by some other exploration methods [3]. This segmentation not only reduces the training difficulty, but also tends to generate smooth movements between targets instead of unnatural trajectories. In summary, the main novelties and technical contributions of this paper include: (a) a hierarchical framework for spatial exploration that well exploits the structural regularities of unknown environments, (b) an information-maximal intrinsic reward function for determining the next best target to be explored, (c) a hierarchical framework for local movement that combines the global path planning with the local path planning for reaching the target point rapidly and safely and (d) an optimal velocity function for choosing the best velocity in collision-avoidance velocity set.

This paper is organized as follows. Section 2 describes related works in automatic exploration, the DRL based on IM and real-time obstacle avoidance. Section 3 formulates automatic exploration. Then, we present the details of our proposed algorithm and hyperparameter setting in Section 4. In Section 5, we compare our approach against several popular competitors in a series of simulation experiments, showing that IRHE-SFVO is promising for spatial exploration. Finally, in Section 6, we summarize our work this paper and discuss future work.

#### **2. Related Work**

In this section, we will describe and analyse the research status and development trends of autonomous spatial exploration, reinforcement learning based on IM and various velocity obstacle methods in this section.

#### *2.1. Autonomous Spatial Exploration*

At present, the research on autonomous spatial exploration mainly includes two categories: traditional rule-based autonomous spatial exploration and intelligent machine-

learning-based autonomous spatial exploration. The mainstream of rule-based method is frontier-based method proposed by Yamauchi in 1997 [1]. This method detects the "frontier", that is, the edges between the free area and the unknown area, then selects the best "frontier point" by some principles, and the agent moves from the current position to the selected "frontier point" by path planning and locomotion, so as to finally achieve the purpose of exploring the whole map. The frontier-based exploration strategy is similar to the NBV (Next Best View) problem in computer vision and graphics. Similarly, there is a lot of literature on the second step of frontier-based exploration strategy, i.e., evaluating and choosing the best frontier. There are generally three types of metrics: (a) cost-based which select the next target based on the path length or time cost [4–7], (b) utility-based which select the next target based on the information gain [8,9] and (c) the mixture [10]. Another typical traditional rule-based method is associated with information theory. These methods leverage some metrics such as entropy [11] or mutual information (MI) [12] to evaluate the uncertainty of the agent's position and the evidence grid map to control the agent to move in the direction of maximizing the information gain. In general, although the rule-based approach is simple and efficient, the movement mode of the agent driven by them is mechanical and rigid, and it is also difficult to balance exploration efficiency with computational burden.

Due to the recent significant advance in DRL, a number of researchers have tried to solve the exploration problem as an optimal control problem. Tai Lei and Liu Ming [13] proposed an improved DQN framework to train robots to master obstacle avoidance strategies in unknown environments through supervised learning based on convolution neural networks (CNN). However, they only solved the collision avoidance problem and failed to finish the spatial exploration task. Zhang et al. [14] trained an Asynchronous Advantage Actor-Critic (A3C) agent that can learn from perceptual information and construct a global map by combining it with a memory module. Similarly, an A3C network in [15] receives the current map, the agent's location and orientation as input, and returns the next visiting direction, given that the space around the agent is equally divided into six sectors. Chen et al. [16] designed a module of spatial memory and used the coverage area gain as an intrinsic reward, and accelerated the convergence of policy through imitation learning. Razin et al. [17] used Faster R-CNN to avoid collision and used double deep Q-learning (DDQN) model to explore unknown space. However, although DRL can solve the problem of limited dimensions, it has difficulty training in end-to-end control.

To solve these problems, Niroui et al. [18] and Shrestha et al. [19] combined DRL with a frontier-based method to enable robots to learn exploration strategies from their own experience. Li et al. [20] proposed a modular framework for robot exploration based on decision, planning and mapping modules. This framework used DQN to learn a policy for selecting the next exploration target in the decision module and used an auxiliary edge segmentation task to speed up training. Chaplot et al. [21] used the Active Neural SLAM module to address the exploration in 3D environments under the condition of perception noises. We draw some inspiration from these two works but are more interested in exploration in 2D environments.

### *2.2. RL Based on Intrinsic Motivation*

To solve the notorious reward-sparse problem, many recent DRL approaches incorporate the intrinsic motivation from cognitive psychology. Intrinsic motivation is produced from human's natural interest in all kinds of activities that can provide novelty, surprise, curiosity, or challenge [22], without any external rewards such as food, money or punishment.

Applying IM to the RL means that the agent generates an "intrinsic reward" by itself during the interaction with the environment. The formulation of intrinsic rewards can be roughly divided into three categories, (a) visit count and uncertainty evaluation-based methods, (b) knowledge and information gain-based methods, and (c) competence-based methods. The first class of methods, based on upper confidence bound (UCB), estimate the counts of state visitation in high-dimensional feature space and large-scale state space, to encourage the agent to visit poorly known states. This genre includes the density-based methods [23,24], state generalization-based methods [25–28] and inference calculationbased methods [29]. Second, the knowledge and information gain-based methods generally establish a dynamics model of the unknown environment and measures the intrinsic rewards using the model's increased accuracy as the exploration progresses. The specific formal models of this type include predict inconsistencies based model [30–32], prediction error based model [3,33–36], learning process based model [37] and information theory based model [36,38,39]. The third class formulates the intrinsic rewards by measuring the agent's competence to control the environment or the difficulty and cost of completing a task [40]. At present, the DRL based on IM has made great progress relative to the classic RL in applications with complex state spaces and difficult exploration (such as Atari-57 games) [41].
