**5. Conclusions**

In this paper, we proposed a novel hierarchical navigation approach to solving the problem of navigation in dynamic environments. It integrates a high-level path planner based on JPS+ (P) and a low-level motion controller based on an improved A3C (IA3C). JPS+ (P) can efficiently plan subgoals for the motion controller, which can eliminate first-move lag and address the challenge of the local minima trap. To meet the kinetic constraints and deal with moving obstacles, IA3C can learn near-optimal control policies to plan feasible trajectories between adjacent subgoals via continuous control. IA3C combines a modified curriculum learning and reward shaping to build a novel reward function framework, which can avoid learning inefficiency because of sparse reward. We additionally strengthen robots' temporal reasoning about the environments by a memory-based network. These

improvements make an IA3C controller converge faster and more adaptive to incomplete, noisy information caused by partial observability.

In simulation experiments, compared with A\*, JPS-IA3C can plan more essential subgoals placed at the exit of local minima. Besides, JPS-IA3C's first-move lag is between 271–1309 times shorter than A\*'s on large-scale maps. Compared with A3C+ that remove IA3C's memory-based network, IA3C can integrate current observation and abstract history information to achieve higher success rates and mean V-values in the training process. Finally, comparing JPS-A3C+, JPS-IA3C can navigate robots in unseen and large-scale environments with shorter path lengths and less execution time, with more than a 94% success rate.

In future work, we consider the more complex behaviors of moving obstacles in the real world. For example, in large shopping malls, people with different intentions can generate more complicated trajectories that cannot be predicted easily. The above problems can be solved by adding an external memory network [37] and feedback control mechanisms [38] into the network architectures of DRL agents, which estimates state-action values more accurately.

**Author Contributions:** J.Z. and L.Q. conceived and designed the paper structure and the experiments; J.Z. performed the experiments; Y.H., Q.Y., and C.H. contributed materials and analysis tools.

**Funding:** This work was sponsored by the National Science Foundation of Hunan Province (NSFHP) project 2017JJ3371, China.

**Acknowledgments:** The work described in this paper was sponsored by the National Natural Science Foundation of Hunan Province under Grant No. 2017JJ3371. We appreciate the fruitful discussion with the Sim812 group: Qi Zhang and Yabing Zha.

**Conflicts of Interest:** The authors declare no conflicts of interest. The founding sponsors had no role in the design of the study; in the collection, analysis, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
