**3. Results**

### *3.1. Simulation of a Cooperative Task with Mobile Robots*

We perform a simulation where the linear fuzzy parameterization is applied to a two-dimensional Multi-agent cooperative problem with continuous states and discrete actions. The two agents with mass *m* have to be directed in a flat surface, such that they reach the origin at the same time with minimum time elapsed. The information available to the agents consists of the reward function, the transition function of states and joint actions.

In the simulation environment, the state *x* = [*<sup>x</sup>*1,*x*2, ..., *<sup>x</sup>*8]*<sup>T</sup>* has the coordinates in two dimensions of each agen<sup>t</sup> *six*, *siy* and their velocities in two dimensions *<sup>s</sup>*˙*ix*, *s*˙*iy* for *i* = 1, 2 : *x* = *<sup>s</sup>*1*x*,*s*1*y*,*<sup>s</sup>*˙1*<sup>x</sup>*,*<sup>s</sup>*˙1*<sup>y</sup>*,*s*2*x*,*s*2*y*,*<sup>s</sup>*˙2*<sup>x</sup>*,*<sup>s</sup>*˙2*y<sup>T</sup>*. The continuous state space model of the simulated system is:

˙

$$\begin{aligned} \ddot{s}\_{1x} &= -\eta \left( s\_{1x}, s\_{1y} \right) \frac{\dot{s}\_{1x}}{m\_1} + \frac{u\_{1x}}{m\_1} \\ \ddot{s}\_{1y} &= -\eta \left( s\_{1x}, s\_{1y} \right) \frac{\dot{s}\_{1y}}{m\_1} + \frac{u\_{1y}}{m\_1} \\ \ddot{s}\_{2x} &= -\eta \left( s\_{2x}, s\_{2y} \right) \frac{\dot{s}\_{2x}}{m\_2} + \frac{u\_{2x}}{m\_2} \\ \ddot{s}\_{2y} &= -\eta \left( s\_{2x}, s\_{2y} \right) \frac{\dot{s}\_{2y}}{m\_2} + \frac{u\_{2y}}{m\_2} \end{aligned} \tag{38}$$

where *η six*,*siy* for *i* = 1, 2 is the function friction which depends of the position of each agent, the control signal is **U** = *<sup>u</sup>*1*x*, *<sup>u</sup>*1*y*, *u*2*x*, *<sup>u</sup>*2*y<sup>T</sup>* which is a force and *mi*for *i* = 1, 2 is the mass of each robot.

The system is discretized with a step of *T* = 0.4*s* and the expression that describes the dynamics of the system are integrated between the sampling time. In the task, we select the start points randomly and carry out 50 training iteration, in the case of reaching 50 iterations without accomplishing the final goal, the experiment is restarted.

The magnitude of the state and action variables are bounded. To make more tractable the problem, *six* and *siy* ∈ [−6, 6] meters, *<sup>s</sup>*˙*ix* and *s*˙*iy* ∈ [−3, 3] *ms* , also the force is bounded *uix*, *uix* ∈ [−2, 2] for *i* = 1, 2 , the friction coefficient is taken constant with *η* = 1 *kgs* , the mass of the agen<sup>t</sup> is taken equal for both *m* = 0.5 kg.

The actions control for each agen<sup>t</sup> are discrete with 25 elements *Ui* = [−<sup>2</sup> − 0.2 0 0.2 2] × [−<sup>2</sup> − 0.2 0 0.2 2] for *i* = 1, 2 , they correspond to force in diagonal, left, right, up, down and no force applied. The membership functions used for the position state and velocity state have triangular shape, where the core of each membership function is *xd* . The cores of the membership function for the location domain *s* is centered at [−6, −3, −0.3, −0.1, 0, 0.1, 0.3, 3, 6] and the cores of the membership function for the velocity domain are: [−3, −1, 0, 1, 3], each one for every agent, this is shown in Figure 1. In this way 50625 pairs (*<sup>x</sup>*, **u**) are storage for each agen<sup>t</sup> in the vector parameter *φ*, this amount increases with the number of membership functions. An example of fuzzy triangular partition is showed in Figure 1.

The partition of the state space *x* is determined by the product of the individual membership function for each agen<sup>t</sup> *i*

$$\mu(x) = \prod\_{i=1}^{2} \mu\_{s\_{ix}} \prod\_{i=1}^{2} \mu\_{s\_{iy}} \prod\_{i=1}^{2} \mu\_{s\_{ix}} \prod\_{i=1}^{2} \mu\_{s\_{iy}} \tag{39}$$

The final objective of arriving at the same is shown by a common reward function *ρ*:

$$
\rho\left(\mathbf{x}, \mathbf{u}\right) = \mathbf{5} \text{ if } \|\mathbf{x}\| < 0.1 \tag{40}
$$

$$
\rho\left(\mathbf{x}, \mathbf{u}\right) = 0 \text{ in another way}
$$

As regards the coordination problem, the agents accomplish an implicit coordination, where they learn to prefer one solution about equally good solutions by chance and then overlook the other options [41].

**Figure 1.** Triangular fuzzy partition for velocity state.

After the algorithm is performed and *φ*∗ is obtained, a policy can be derived by interpolation between the best local action for each agent:

$$h\_i^\*(\mathbf{x}) = \sum\_{d=1}^N \phi\_{i,d}(\mathbf{x}) \, u\_{j\_{id}^\*} \quad \text{where } j\_{i,d}^\* \in \arg\max F(\phi^\*) \, (\mathbf{x}, \mathbf{u}) \tag{41}$$

For the simulation, the learning parameters were set *γ* = 0.96 and the *ξ* = 0.05, the initial conditions for the experiment were set *s*0 = [−4, −6, −2, 2, 5, 3, 2, −<sup>1</sup>] , the algorithm shows a convergence after 15 iterations, Figure 2 shows the states and the signal control *U*1 = *<sup>u</sup>*1*x*, *<sup>u</sup>*1*y* for the agen<sup>t</sup> 1, Figure 3 shows the states and the signal control *U*2 = *<sup>u</sup>*2*x*, *<sup>u</sup>*2*y* for the agen<sup>t</sup> 2. The final path followed by both agents are shown in Figure 4.

**Figure 2.** States and signal control for the Agent 1.

**Figure 3.** States and signal control for Agent 2.

**Figure 4.** Final path by Agent 1 and Agent 2.

## *3.2. Experimental Set-up*

Two mobile robots Khepera IV are used to perform a experiment in MAS [42]. They have 5 sensors which are placed around the robot and are positioned and numbered as shown in Figure 5. These 5 sensors are ultrasonic devices compose of one transmitter and one receiver, they are used to detect the physical features of the environment such as obstacles and other nearby agents.

The five Khepera's sonar readings *la*,*<sup>c</sup>* are quantified in three degrees. They represent the amount of closeness to the nearest obstacle or others agents, 0 indicates obstacles or agents which are near, 1 indicates obstacles or agents which are in a medium distance and finally 2 indicates obstacles or agents which are relatively far from the sensors.

**Figure 5.** Position of the Khepera's UltraSonic sensors.

The parameters *da* (the distance to the target or the goal) and *pa* (the relative angle to the target or goal) are divided in eight degrees (0–8). Where 0 represents the smallest distance or angle and 8 represents the greatest relative distance or angle from the current Khepera's position to the target or goal.

The actions available for the robot khepera are:


The ultra-sonic sensors on the Khepera are used to help the mobile robots to determine if there are any obstacles in the environment. The experimental set up reveals that the reinforcement learning algorithm relies strongly in the sensors readings.

Sensor readings in the ideal simulation situation are based on mathematical calculations which are accurate and consistent. In the experimental implementation the readings are inaccurate and fluctuating. During the application of the controller this effect is minimized by permitting a period after performing a joint action, with the above we ensure that the sensor has steady reading before it is recorded. In addition, by moving the robots at relatively slow step during the learning process, the collisions with other objects or agen<sup>t</sup> are reduced. The quantified readings would be enough to represent the current location and velocity when the robots are moving [43].

### *3.3. Experimental Results*

To validate the proposed algorithm, the linear fuzzy approximator of the joint Q-function is applied to a two-dimensional Multi-agent cooperative task. Two mobile robots Khepera IV must be driven in a surface such that both agents reach the origin at the same time with minimum time elapsed, it is shown in Figure 6.

The fuzzy partition and the location of centroids used for the states were the same as the simulation section.

The goal of arriving at the same moment toward the origin in minimum time elapsed is shown by the common reward function *ρ*:

$$
\rho\left(\mathbf{x}, \mathbf{u}\right) = \mathbf{5} \text{ if } \|\mathbf{x}\| < 0.2 \tag{42}
$$

$$
\rho\left(\mathbf{x}, \mathbf{u}\right) = 0 \text{ in another way}
$$

For this experiment the learning parameters were set *γ* = 0.96 and *ξ* = 0.2, the initial conditions for the experiment were set *s*0 = [−4, 5, 0, 0, 5, 3, 0, 0] , the experiment shows a convergence after 27 iterations. Figure 7 shows the states, the signal control *U*1 = *<sup>u</sup>*1*x*, *<sup>u</sup>*1*y* and the rewards for the agen<sup>t</sup> 1 and Figure 8 shows the states, the signal control *U*2 = *<sup>u</sup>*2*x*, *<sup>u</sup>*2*y* and the reward for the agen<sup>t</sup> 2.

**Figure 6.** Experimental test

**Figure 7.** States and Rewards for the Agent 1 in the experimental implementation.

The election of the value for *γ* and *ξ* was set arbitrarily, the vector *φ* converged after 27 iterations when the bounded *φ<sup>l</sup>*+<sup>1</sup> − *φl* ≤ *ξ* was reached. The final path is shown in Figure 9, this trajectory is evidently different from the optimal policy, which would drive both agents in a straight line toward the goal since any initial position. However, with the fuzzy quantization used in this implementation and the effect of the damping, the final path obtained is the best that can be accomplished with this

fuzzy parameterization. The coordination problem was overcame using an indirect method, where the agents learn to choose a solution by chance.

**Figure 8.** States and Rewards for the Agent 2 in the experimental implementation.

**Figure 9.** Final path by Agent 1 and Agent 2 in experimental implementation.

### **4. Comparison with CMOMMT Algorithm for Multi-Agent Systems**

There are other methods for MARL in continuous state space, these proposals are restricted to a limited kind of task, one of this methods is "cooperative multi-robot observation of multiple moving target" (CMOMMT) given in [13], which relies in local information in order to learn cooperative behavior. It allows the application of the reinforcement learning in continuous state space for Multi-agents systems.

The kind of cooperation learned in this proposal is by implicit techniques, in this way this method is useful for reducing the representation of the state space through of a mapping of the continuous state space in a finite state space, where every new state discretized is considered as a region in the continuous state space.

The action space is discrete, this method uses a delayed reward where positive or negative values are obtained at the end of the training. Finally, an optimal joint policy of actions is obtained by a clustering technique in the discrete action space.

We performed the same cooperative task of arriving to the origin point at the same time for 2 agents as in the section above, using the same reward function and the same continuous state space. The initial condition was set *s*0 = [ −4, 5, 0, 0, 5, −3, 0, 0].

The final path obtained by the CMOMMT algorithm is shown in Figure 10, where the path traced is not straight enough and near to the origin point is shown as a persistent oscillation. The signal state and the signal control for the agen<sup>t</sup> 1 and agen<sup>t</sup> 2 are shown in Figure 11 and the Figure 12, respectively.

**Figure 10.** Final path obtained by CMOMMT algorithm.

**Figure 11.** States and signal control for agen<sup>t</sup> 1 CMOMMT algorithm.

**Figure 12.** States and signal control for agen<sup>t</sup> 2 CMOMMT algorithm.

The Table 1 shows a comparison between our proposal and the CMOMMT algorithm for an average of trials conducted under the same initial conditions.

**Table 1.** Comparison between fuzzy partition and CMOMMT.


One possible reason for the results of CMOMMT could be that the Q-functions obtained by this method are less smooth than the presented by the fuzzy parameterization. In this way the method proposed in our paper shows a better performance in the form of the more accuracy representation of the state space and less computational resources used by it.
