*3.1. Problem Description*

In this paper, the problem of navigation can be stated as follows: on an initially known map, the mobile tracked robot that is only equipped with local laser sensors autonomously navigate without colliding with moving obstacles via continuous control. Since environments with moving obstacles are dynamic and uncertain, this navigation task is difficult to be dealt with, and is an non-deterministic polynomial-hard problem [27].

In this paper, we adopt the Cartesian coordinate system. The structure of the tracked robot is shown in Figure 1. The position state of the tracked robot is defined as *s* = (*xr*, *yr*, *θr*) by the global coordinate of the map, where *xr* and *yr* are the tracked robot's horizontal and vertical coordinates, respectively, and *θ<sup>r</sup>* is the angle between the forward orientation of the tracked robot and the abscissa axis. Twelve laser sensors equipped at the front of the tracked robot can sense the nearby surroundings 360 degrees. The detection angle of each laser sensor is 30 degrees. The max detection distance is *Dmax*. The robot is controlled by the left and right tracks. The kinematic equations were as follows:

$$
\begin{pmatrix}
\dot{\mathbf{x}}\_r \\
\dot{y}\_r \\
\dot{\theta}\_r
\end{pmatrix} = \mathcal{R} \begin{bmatrix}
\frac{\cos \theta\_r}{2} & \frac{\cos \theta\_r}{2} \\
\frac{\sin \theta\_r}{2} & \frac{\sin \theta\_r}{2} \\
\frac{1}{l\_B} & \frac{-1}{l\_B}
\end{bmatrix} \begin{pmatrix}
\omega\_l \\
\omega\_r
\end{pmatrix},
\tag{1}
$$

$$
\begin{pmatrix}
\dot{\omega}\_l \\
\dot{\omega}\_r
\end{pmatrix} = \begin{pmatrix}
\frac{\partial \omega\_l}{\partial t} \\
\frac{\partial \omega\_r}{\partial t}
\end{pmatrix},
\tag{2}
$$

where R is the radius of the driving wheels in the tracks; *ω<sup>l</sup>* and *ω<sup>r</sup>* denote the angular velocity (rad·s<sup>−</sup>1) of left tracks and right tracks; . *<sup>ω</sup><sup>l</sup>* and . *ω<sup>r</sup>* respectively denote the angular acceleration of left tracks and right tracks (rad·s<sup>−</sup>2), and *lB* denotes the distance between the left and right tracks.

**Figure 1.** Schematic diagram of the tracked robot. Gray dash-dotted lines denote the senor lines of the robot.

## *3.2. The Architecture of the Proposed Algorithm*

As mentioned above, JPS-IA3C resolves the robot navigation in a two-level manner: the first for the efficient global path planner, and the second for the robust local motion controller. Figure 2 shows the architecture of the proposed algorithm.

**Figure 2.** Flowchart of the Jump Point Search improved Asynchronous Advantage Actor-Critic (JPS-IA3C) method.

As shown in Figure 2, at the offline phase of the path planner whose goal is to efficiently find subgoals, we first define a warning circled area for every obstacle by taking the self-defined safety distance and sizes of robots into consideration. It is perilous for robots to touch these areas, since they are close to obstacles. Then, based on the modified map, JPS+ (P) calculates and stores the topology information of a map by preprocessing. In the online phase, JPS+ (P) uses canonical ordering known as diagonal-first to efficiently find subgoals based on preprocessed map data.

At the offline step of the motion controller whose task is planning feasible paths between adjacent subgoals, we firstly extract key information about robots and environments to build a partially observable Markov decision process (POMDP) model, which is a kind of formal description about partially observable environments for IA3C. Then, we quantify the difficulty of navigation and set training plans via a modified curriculum learning. Next, IA3C is used to learn navigation policies with high training efficiencies. At the online step, which is guided by subgoals, the robot receives sensor data about local environmental observations and then plans continuous control by learned policies. Next, the robot executes actions in dynamic environments and simultaneously updates its trajectories by kinematic equations. The online phase of the local motion planner denotes the interaction between robots and environments. Note that IA3C works on the original map, because warning areas cannot be detected by sensors.

There are two advantages of our method for robot navigation. The first advantage is that its high-level path planner can plans subgoals by preprocessing data and canonical ordering averagely within dozens of microseconds, so as to quickly respond to online tasks and avoid the so-called first-move lags [28]. Besides, it decomposes a task of long-range navigation into a number of local controlling phases between every two consecutive subgoals. Actually, traversing each of these segments, where there are no twisty terrains, rids our RL-driven controller of local minima. Therefore, benefitting from the path planner, the proposed algorithm can adapt to large-scale and complex environments.

The second advantage is that IA3C can learn near-optimal policies to navigate agents through adjacent subgoals in dynamic environments with kinematic and task constraints satisfied. There are two strengths of IA3C regarding learning to navigate. (1) To estimate complete environmental states, IA3C takes the time dimension into consideration by constructing an LSTM-based network architecture that can memorize information at different timescales. (2) Curriculum learning is useful for agents to gradually master complex skills, but conventional curriculum learning has been proven to be ineffective at training LSTM layers [29]. Therefore, we improve curriculum learning by adjusting distributions of different difficult samples over time and combine it with reward shaping [30] to build a novel reward function framework that can address the challenge of spare rewards. Moreover, due to the generalization of the local motion controller, the proposed algorithm can navigate robots to unseen environments without retraining.
