3.4.1. Construction of Navigation POMDP

In [26], the problem of local navigation is decomposed into two subproblems (i.e., approaching targets and avoiding obstacles), which easily leads to local optimal policies due to simplifying the problem by adding prior experiences. In JPS-IA3C, to acquire optimal navigation policies, the motion controller directly builds models for the entire navigation problem.

According to [24], current observations are regarded as states, whose representation is non-Markovian in the proposed navigation problem, since current observations including sensor readings do not contain all of the states of dynamic environments, such as the velocities of obstacles. To tackle the above problems, a POMDP is built to describe the process of navigating robots toward targets without colliding obstacles. It can capture dynamics in environments by explicitly acknowledging that sensations received by agents are only partial glimpses of the underlying system state [13].

A POMDP is formally defined as a tuple of (*S*, *A*, *P*, *R*,*O*, Ω), where *S*, *A*, and *O*, are the state, action, and observation spaces, respectively. The state-transition function, *P*(*s*, *a*,*s* ), denotes the probability of transferring the current state *s* to next state *s* by executing action *a*. The reward function, *R*(*s*, *a*,*s* ), denotes the immediate reward obtained by agents, when the current state *s* is transferred to the next state *s* by taking action *a*. The observation function Ω(*o*,*s*, *a*) denotes the probability of receiving observation *o* after taking action *a* in state *s* [33].

In dynamic and partially observable environments, the robot cannot directly obtain exact environmental states. However, it can construct its own state representation about environments, which are called belief states *Bs*. There are three ways of constructing belief states, including the complete history, beliefs of the environment state, and recurrent neural networks. Owing to the powerful nonlinear function approximator, we adopt a typical recurrent neural network: LSTM. The equation of constructing *Bs* via LSTM is described as follows:

$$B s\_t^a = \sigma \left( w\_s \ast B s\_{t-1}^a + w\_o \ast o\_t \right) \tag{3}$$

where σ() denotes the activation function, *ws* and *wo* denote the weight parameters of neural networks, and *ot* denotes the observation at time *t*.

The definitions of this POMDP are described in Table 1.


**Table 1.** Definitions of the Partially Observable Markov Decision Process (POMDP).

In this POMDP, observation *Si* (*i* ∈ (1, 2, 3 . . . , 12)) denotes the reading of sensor *i*. The range of *Si* is [0, 5]. Observation *dg* denotes the normalization value of the distance between the robot and the target. The absolute value of observation *ag* denotes the angle between the direction of the robot toward the goal and the robot's orientation. The range of *ag* is [−π, π]. When the goal is on the left side of the robot's direction, the value of *ag* is positive; otherwise, it is negative. To support the navigation ability of our approach, the observation space of the robot is expanded to include the last angular accelerations and angular velocities. Observations *<sup>ω</sup>l*, *<sup>ω</sup>r*, . *<sup>ω</sup>l*, and . *ω<sup>r</sup>* are described in Section 3.1. The ranges of *ω<sup>l</sup>* and *ω<sup>r</sup>* are both [−0.5, 0.5].

The action space is continuous, including the angular accelerations of the left and right tracks, so that the robot has great potential to perform flexible and complex actions. Besides, taking continuous angular accelerations as actions, trajectories updated by kinematic equations can preserve G2 continuity [34].

In the reward function, *dk* denotes the goal tolerance. *Smin* = *min*(*S*1, *S*2,..., *S*12). *du* denotes the brake distance for the robot. When the distance between the robot and the subgoal is less than *dk*, the robot gets a +10 scalar reward. When the robot collides with obstacles, it gets a −5 scalar reward. Otherwise, the reward is 0. However, the reward function is not suitable for this POMDP, since fewer samples with positive rewards are collected during the initial learning phase, which incurs difficult training and even non-convergence in complex environments. Some improvements in reward functions will be discussed in Section 3.4.3.
