**1. Introduction**

Navigation in dynamic environments plays an important role in computer games and robotics [1], such as generating realistic behaviors of Non-Player Characters (NPCs) and meeting practical applications of mobile robots in the real world. In this paper, we focus on the navigation problems of nonholonomic mobile robots [2] with continuous control in dynamic environments, as this is an important capability for widely used mobile robots.

Conventionally, sampling-based and velocity-based methods are used to support the navigation of mobile robots [1]. Sampling-based methods, such as Rapidly Exploring Trees (RRTs) [3] and Probabilistic Roadmap (PRM) [4], deal with environmental changes by reconstructing pre-built abstract representations at high time costs. Velocity-based methods [5] compute avoidance maneuvers by searching over a tree of avoidance maneuvers, which require high time consumption in complex environments.

Alternatively, many researchers focus on learning-based methods—mainly those including Deep Learning (DL) and Deep Reinforcement Learning (DRL). Although DL achieves great performance in robotics [6], it is hard to apply DL in this navigation problem, since collecting considerable amounts of labeled data requires much time and energy for researchers. By contrast, DRL does not need supervised samples, and has been widely used in video games [7] and robot manipulation [8]. There are two categories of DRL (i.e., value-based DRL and policy-based DRL). Compared with valued-based DRL methods, policy-based methods are more suitable for us to handle continuous action spaces. Asynchronous Advantage Actor-Critic (A3C), which is a policy-based method, is widely used in video games and robotics [9–11], since it can greatly decrease training time and meanwhile increase performance [12]. However, several challenges need to be tackled if A3C is used to deal with robot navigation in dynamic environments.

First, robots trained by A3C are vulnerable to local minima such as box canyons and long barriers, which most DRL algorithms cannot effectively resolve. Second, rewards are sparsely distributed in the environment, for there may be only one goal location [9]. Third, robots with limited visibility can only gather incomplete and noisy information from current environmental states due to partial observability [13], and in this paper, we are concerned with robots that just know the relative positions of obstacles within a small range.

Given that an individual DRL approach can hardly drive robots out of regions of local minima and navigate them in changing surroundings, this paper proposes a hierarchical navigation algorithm based on a two-level architecture, whose high-level path planner efficiently searches for a sequence of subgoals placing at the exits of those regions, while the low-level motion controller learns how to tackle moving obstacles in local environments. Since JPS+ (P), which is a variant of JPS+, on average answers a path query faster by up to two orders of magnitude over traditional A\* [14], the path planner uses it to find subgoals between which the circumstances are relatively simple and easy for A3C to train and converge to a motion policy.

To tackle two other challenges regarding learning, we propose an improved A3C (IA3C) for the motion controller by making some improvements to A3C. IA3C combines a modified curriculum learning with reward shaping to build a novel reward function framework in order to solve the problem of sparse rewards. In the framework, the modified curriculum learning adds prior experiences to set different difficulty levels of navigation tasks and adjusts the frequencies of different tasks by considering features of navigation. Reward shaping plays an auxiliary role in each task to speed up training efficiency. Moreover, IA3C builds a network architecture based on long short-term memory (LSTM), which integrates current observation and hidden state from historical observations to estimate current state. Briefly, the proposed method integrates JPS+ (P) and IA3C, named JPS-IA3C; this model can realize long-range navigation in complex and dynamic environments by addressing mentioned challenges.

The rest of this paper is organized as follows: Section 2 discusses some related work on model-based methods, DL, and DRL. Section 3 presents our hierarchical navigation approach. In Section 4, its performance is evaluated in simulation experiments. Finally, this paper is concluded in Section 5.
