*3.1. Problem Formulation*

This paper aims to provide an end-to-end navigation method for multi-robot systems. We try to find such a learnable policy module π:

$$a\_t^i = \pi(\mathbf{x}\_{t\prime}^i \, q\_{t\prime}^i \, a\_{t-1}^i)\_\prime \tag{1}$$

where *x<sup>i</sup> <sup>t</sup>* is the observation from the raw lidar sensor information of robot *i* at timestep *t*, *ϕ<sup>i</sup> <sup>t</sup>* is the relative parameters about the target, and *a<sup>i</sup> <sup>t</sup>*−<sup>1</sup> is the control action in the last timestep. In the multi-robot reinforcement learning system, these inputs can be regarded as the state of whole system *st* = (*xt*, *ϕt*). For a single robot at each timestep *t*, the robot makes observations *st* ∈ S and selects actions *at* ∈ A with respect to its policy *π*: S→P(A), which maps states to a probability distribution over the actions. Then, the robot can receive the reward *r*(*st*, *at*) and arrive at the next state. The state-action value function describes the expected return of a state-action trajectory according to *π*. It is commonly used to evaluate the quality of a policy as defined in Equation (2):

$$Q\_{\pi}(s, a) = \mathbb{E}\left[\sum\_{t=0}^{\infty} \gamma^{t} r\left(s\_{t}, a\_{t}\right)\right] \text{ where } \text{s}\_{t} \sim p\left(\cdot | s\_{t-1}, a\_{t-1}\right), a\_{t} = \pi\left(\mathbf{s}\_{t}\right). \tag{2}$$

The recursive form of the state-action value function, known as the Bellman equation, can be defined as Equation (3). The policy module π directly maps the state perception to the control law of robots with the collaborative consideration. *E* represents the environment. In addition, the notation ∼ means the former variable follows the distribution of the latter:

$$Q^{\pi} \left( \mathbf{s}\_{t}, a\_{t} \right) = \mathbb{E}\_{\mathbf{r}\_{t}, \mathbf{s}\_{t+1} \sim \mathbb{E}} \left[ r \left( \mathbf{s}\_{t}, a\_{t} \right) + \gamma \mathbb{E}\_{a\_{t+1} \sim \pi} \left[ Q \left( \mathbf{s}\_{t+1}, a\_{t+1} \right) \right] \right] \,. \tag{3}$$

In order to get a good policy module π, the multi-robot system has to overcome several challenges. First of all, it is difficult to extract valuable information from the raw sensor data and merge it properly with the relative parameters of robot state. Then, based on the processed information, a policy module needs to learn a stable transition rule between the perceptions and the decision to make. Additionally, the robots need to avoid the collision and consider the collaboration in some situation.
