*3.2. Single-Robot End-to-End Navigation*

To accomplish the final goal, we start at the single robot navigation case in this section. In the end-to-end mapless navigation tasks, the relationship between the perception inputs and the output control law can be very complex. In our work, we consider a number of modifications to the Deep Deterministic Policy Gradient (DDPG) and use it as the basic framework.

#### 3.2.1. Basic Reinforcement Learning Framework

Classical Deep Deterministic Policy Gradient (DDPG) [41] is a kind of remarkable reinforcement learning algorithm to address the continuous control problem. As we all know, the trade-off between the exploration and the exploitation is one of the most important problems in the reinforcement learning field. To increase the sample efficiency, the DDPG uses the deterministic target policy *μ* : S←A rather than the stochastic policy *π*, and the corresponding exploration decrease is made up of the off-policy technique. The Bellman equation can be rewritten as Equation (4), where the *γ* term represents the discount factor of the equation:

$$Q^{\mu}\left(\mathbf{s}\_{t},\mathbf{a}\_{t}\right) = \mathbb{E}\_{r\_{t}s\_{t+1}\sim E}\left[r\left(\mathbf{s}\_{t},\mathbf{a}\_{t}\right) + \gamma Q^{\mu}\left(\mathbf{s}\_{t+1},\mu\left(\mathbf{s}\_{t+1}\right)\right)\right].\tag{4}$$

At each timestep, the actor networks and the critic networks are updated by sampling a minibatch uniformly from the memory buffer. The algorithm also creates a copy of the actor and critic networks, *Q <sup>s</sup>*, *<sup>a</sup>*|*θQ* and *μ <sup>s</sup>*|*θμ* , respectively; these target networks are then updated softly by the learned networks: *θ* ← *τθ* + (1 − *τ*)*θ* with *τ* 1. This trick can greatly improve the stability of learning.

The loss function for the critic networks can be formulated as Equation (5):

$$L = \frac{1}{N} \sum\_{i} \left( y\_i - Q\left( s\_i, a\_i | \theta^Q \right) \right)^2,\text{ where } y\_i = r\_i + \gamma Q' \left( s\_{i+1}, \mu' \left( s\_{i+1} | \theta^{\mu'} \right) | \theta^{Q'} \right). \tag{5}$$

In addition, the actor networks can be updated by using the sampled policy gradient as shown in Equation (6):

$$
\nabla\_{\theta^{\mu}} I \approx \frac{1}{N} \sum\_{i} \nabla\_{a} \mathcal{Q} \left( s, a | \theta^{\mathcal{Q}} \right) \Bigg|\_{s=s\_{i}, a=\mu(s\_{i})} \nabla\_{\theta^{\mu}} \mu \left( s | \theta^{\mu} \right) \Bigg|\_{s\_{i}} \tag{6}
$$

#### 3.2.2. Network Structure

Owing to the structure of the 2D lidar data, we add one-dimensional, convolutional neural network (CNN) layers to the feature extraction part. To enrich the perception information of the policy module, some relative parameters about the target are added to our framework. It enables the policy module to simultaneously consider the sensor observation and the robot state information. Both of them are essential for efficient navigation.

As shown in Figure 1, the raw lidar sensor data are filtered into 180 dimensions as inputs. As mentioned before, there are two kinds of neural networks in our module. One is the actor network and the other is the critic network. Both of them use three one-dimensional, CNN layers for feature extraction. After the data feature extraction, the networks use residual building blocks with shortcut connections [42] to reduce the training complexity. Then, we pack the target distance, the target angle and the actions at the last timestep as the state parameter set. The state parameter set and the lidar data feature extraction will be fused together and fed to the actor and critic networks at the same time. For the actor network, the fusion data will pass three fully connected layers and output the linear velocity and angular velocity through different activation functions. Besides controlling the robot navigation, the output of the actor network will be transmitted to the critic network. The output action, the state parameter set and the feature extraction of the critic network will be fused into one data stream. After

passing through four fully connected layers, the data stream will finally become a Q value to evaluate the policy and train the networks.

**Figure 1.** The structure of the modified Deep Deterministic Policy Gradient for single robot mapless navigation tasks.

#### 3.2.3. Reward Shaping

The reward function plays an important role in reinforcement learning: it greatly influences the navigation of the robot system and serves as the truth holder of the learning process. For the single robot mapless navigation task, the reward function shown in Equation (7) consists of three different parts: the arrival award, the collision penalty and the approaching award. If the distance value between the robot and the target point is less than the arrival threshold, the robot can get a positive reward *rarrive* if the robot collides with the obstacles during navigation. In other words, one of the lidar distance observations is less than the safety threshold. The robot will get a negative reward *rcollision*. To enable the intuition of approaching the target, we add the approaching award to the robot, where the reward is noted as *k*(*ρt*−<sup>1</sup> − *ρt*). The *k* term is a constant to adjust the amplitude of the approaching reward. The *ρ* term represents the distance between the target and the mobile robot at timestep *t*. To unify the range of different parameters, they will be normalized before utilizing:

$$r(s,a) = \begin{cases} r\_{\text{array}} & (\rho < d\_{\text{goal}}) \\ r\_{\text{collision}} & (\min(d\_1, d\_2, \dots, d\_{180}) < d\_{\text{collision}}) \\ k(\rho\_{t-1} - \rho\_t) & (At \text{ each time step } t) . \end{cases} \tag{7}$$
