**1. Introduction**

Multi-agent systems (MASs) are finding application in a variety of fields where pre-programmed behaviors are not a suitable way to tackle the problems that arise. These fields include robotics, distributed control, resource management, collaborative decision making, data mining [1]. A MAS includes several intelligent agents in an environment, where each one has its independent behavior and should coordinate with the others [2].

MASs could emerge as an alternative way to analyze and represent the systems with centralized control, where several intelligent agents perceive and modify an environment through sensors and actuators respectively. At the same time, these agents can also learn new behaviors to adapt themselves to the new tasks and the goals in an environment [3].

One of the fields where multi-agent systems have emerged are mobile robots, most approaches are based on low level control systems, in [4] a visibility binary tree algorithm is used to generate the mobile robot trajectories. This type of approach is based on the complete knowledge of the dynamics of the robotic system. In this article, we offer a proposal based on reinforcement learning, which will result in high-level control actions.

Reinforcement learning (RL) is one of the most popular methods for learning in a MAS. The objective of a Multi-agent reinforcement learning (MARL) is to maximize a numerical reward; so that, the agents can interact with the environment and modify it [5]. At each learning step, these agents choose an action, which drives the environment to a new state [6]. The Reward function assesses the grade of this state transition [7]. In the RL the agents are not told which tasks should be executed instead they must explore which actions have the best reward. Hence, the RL feedback is less informative than a supervised learning method [8].

MASs are affected by the curse of dimensionality, which is a term given to sugges<sup>t</sup> that the computational and memory requirements increase as the number of states or agents increase in an environment. Most approaches require an exact representation of the state-action pair values in the form of lookup tables, making the solution intractable, hence the application of these methods is reduced to small or discrete tasks [9]. In the real-life applications, the state variables can be a selected from of a large number of possible values or even from the continuous values; so, the problem is manageable if value functions are approximated [10].

Some MARL algorithms have been proposed to deal with this problem using neural networks by making generalizations from a Q-table [11], applying function approximation for discrete and large state-action space [12], applying vector quantization for continuous state and actions [13], using experience replay for MAS [14], using Q-learning and normalized Gaussian network as approximators [15], predictions in systems with heterogeneous agents [16]. In [17] a couple of neural networks is used to represent the value function and the controller, however, the proposed strategy is based on a sufficient exploration, which is a function of the size of the training data of the neural networks. Inverse neural networks have also been proposed to approximate the actions policy, which uses an initial policy of actions to be refined through reinforcement learning [18].

This article presents an approach for MARL in a cooperative problem. It is a modified version of the Q-learning algorithm proposed in [19] which uses a linear fuzzy approximator of the joint Q-function for a continuous state space. An implicit form of coordination is implemented to solve the coordination problem. An experiment is conducted on two robots performing the task of solving a coordination problem to verify the proposed algorithm. In addition to that, two theorems are presented to ensure the convergence of the proposed algorithm.

### **2. MARL with Linear Fuzzy Parameterization**

### *2.1. Single Agent Case*

In a reinforcement learning (RL) for a single agen<sup>t</sup> case, let us define: *xk* as the current state of the environment at the learning step *k*, and *uk* as the action taken by the agen<sup>t</sup> in *xk*.

The reward or the numerical feedback, *rk*, reflects how good was the previous action, *uk*−1, in the state *xk*−1. The single agen<sup>t</sup> RL problem is a Markov decision process (MDP). MDP for a deterministic case is:

$$\begin{array}{l} f: X \times \mathcal{U} \to X \\ \rho: X \times \mathcal{U} \to \mathbb{R} \end{array} \tag{1}$$

where *X* is the state space, *U* is the actions space, *f* is the state transition function which can be known or unknown, and *ρ* is the reward scalar function.

The action policy is used to describe the agent's behavior, which specifies the way in which the agen<sup>t</sup> chooses the action from a state. If the action policy, *h* = *X* → *U*, does not change over time it is considered stationary [20].

The final goal is to find an action policy *h* such that the long-term return *R* is maximized:

$$\mathcal{R}^h = E\left\{ \left. \sum\_{k=0}^{\infty} \gamma^k r\_{k+1} \right| \mathbf{x}\_0 = \mathbf{x}, \pi \right\} \tag{2}$$

where *γ* ∈ [0, 1) is the discount factor. The policy, *h*, is obtained from the state-action value function, called Q-function.

The Q-function

$$Q^h: X \times \mathcal{U} \to \mathbb{R} \tag{3}$$

gives a expected return from the policy, *h*,

$$Q^h(\mathbf{x}, \boldsymbol{\mu}) = E\left\{ \left. \sum\_{k=0}^{\infty} \gamma^k r\_{k+1} \right| \mathbf{x}\_0 = \mathbf{x}, \boldsymbol{\mu}\_0 = \boldsymbol{\mu}, h \right\} \tag{4}$$

The optimal Q-function *Q*∗ is defined as:

$$Q^\*\left(\mathbf{x},\boldsymbol{\mu}\right) = \max\_{\boldsymbol{h}} Q^{\boldsymbol{h}}\left(\mathbf{x},\boldsymbol{\mu}\right) \tag{5}$$

Once *Q*∗ is available, the optimal action policy is obtained by:

$$h^\*\left(\mathbf{x}\right) = \arg\max\_{\boldsymbol{\mu}} Q^\*\left(\mathbf{x}, \boldsymbol{\mu}\right) \tag{6}$$

### *2.2. Multi Agent System Case*

In a Multi-agent case, there is some number of heterogeneous agents with their own set of actions and tasks in an environment. A stochastic game model describes this behavior in which the action performed at any state is a combination of the actions by each agen<sup>t</sup> [21].

The deterministic stochastic game's model is a tuple (*<sup>X</sup>*, *U*1, *U*2, ..., *Un*, *f* , *ρ*1, *ρ*2, ..., *ρn*), where *n* is the number of the agents in the environment, *X* is the state of the environment, *Ui i* = 1, 2, ..., *n* are the sets of actions available to each agen<sup>t</sup> and the joint action set **U** = *U*1 × *U*2 × ... × *Un*. The reward functions *ρi* : *X* × **U** → *R* , *i* = 1, 2, ..., *n* and the state transition function is *f* : *X* × **U** → *X*.

The joint action **u***k* = *uT*1,*k*, *<sup>u</sup>T*2,*k*, ..., *<sup>u</sup>Tn*,*<sup>k</sup><sup>T</sup>*, **u***k* ∈ **U**, *ui* ∈ *Ui* taken in the state *xk*, changes the state to *xk*+<sup>1</sup> = *f*(*xk*, **<sup>u</sup>***k*). A numerical value for the reward is calculated as *ri*,*k*+<sup>1</sup> = *ρ* (*xk*, **<sup>u</sup>***k*) for each joint action **u***k*. The actions are taken according to each agent's own policy *hi* : *X* → *Ui*, where all of them form the joint policy **h**. Similar to a single agen<sup>t</sup> case, the state space and actions space can be continuous or discrete.

The long term reward *R* depends on the joint policy *R***h***i* (*x*) = ∑∞*<sup>k</sup>*=<sup>0</sup> *γk ri*,*k*+<sup>1</sup> due to the numerical feedback *r* of each agen<sup>t</sup> depends on the joint action **u***k*. Thereby the Q-function of each agen<sup>t</sup> relies on the joint action and the joint policy, *Qhi* = *X* × **U** → **R**, with *Qhi* (*<sup>x</sup>*, *u*) = *E* ∑∞*<sup>k</sup>*=<sup>0</sup> *γk ri*,*k*+<sup>1</sup> |*<sup>x</sup>*0 = *x*, **u**0 = **u**, **h**.

Each agen<sup>t</sup> could have its own goals, however, in this article the agents seek a common goal, i.e., the task is fully cooperative. In this way the numerical feedback or reward for any state is the same for all agents *ρ*1 = *ρ*2 = ... = *ρ<sup>n</sup>*, therefore the reward scalar functions and returns are the same for all the agents, *R***h**1 = *R***h**2 = ... = *<sup>R</sup>***h***n*. Hence the agents have the same goal which is maximize the common long term performance (or return).

Determining an optimal joint policy **h**∗ in Multi-agent systems is the equilibrium selection problem [22]. Although establishing an equilibrium is a difficult problem, the structure of the cooperative settings make this problem manageable. Assuming that the agents know the structure of the game in the form of the transition function *f* and the reward function *ρi* makes the searching of the equilibrium point more tractable.

In a fully cooperative stochastic game, if the transition function *f* and the reward function *ρ* for each agen<sup>t</sup> is known, the objective can be accomplished by learning the optimal joint-action values *Q*∗ through Bellman optimal equation: *Q*(*xk*, **<sup>u</sup>***k*) = *ρ* (*xk*, **<sup>u</sup>***k*) + *γ* max*j Q f*(*xk*, **<sup>u</sup>***k*), **u***j* and then using a greedy policy [23]. Once *Q*∗ is available, a policy *h* is:

$$h\_i^\*(\mathbf{x}) = \arg\max\_{u\_i} \max\_{u\_1, u\_2, \dots, u\_n} Q^\*(\mathbf{x}, \mathbf{u}) \tag{7}$$

When several joint actions are optimal the agents could choose different actions and degrade the performance of the search for a common goal. This problem can be solved by: the coordination free methods assume that the optimal join action are unique across learning a common Q-function [24],

the coordination-based methods use the coordination graphs through a decomposition of the global Q-function in local Q-functions [25], or the implicit coordination methods assume the agents learn to choose one joint action by chance and then discard the others [26].

### *2.3. Mapping the Joint Q-Function for MAS*

In a deterministic case, the Q-function, *Q*, is

$$\begin{array}{rcl} Q\_{k+1} & = & H(Q\_k) \\ H(Q)(\mathbf{x}, \mathbf{u}) & = & \rho \left( \mathbf{x}, \mathbf{u} \right) + \gamma \max\_{j} Q \left( f(\mathbf{x}, \mathbf{u}), \mathbf{u}\_j \right) \end{array} \tag{8}$$

with an arbitrary initial value for *Q*. The iterations (8) attracts the *Q* to a unique fixed point at [27]

$$Q^\* = H(Q^\*).$$

In [28] is shown that the mapping *H* : **Q** → **Q** is a contraction with factor *α* < 1 . For any pair of Q-function *Q*1 and *Q*2,

$$\left\| H(Q\_1) - H(Q\_2) \right\| \le \mathfrak{a} \left\| Q\_1 - Q\_2 \right\| \tag{9}$$

and then *H* has a unique fixed point. *Q*∗ is a fixed point of *H* : *Q*∗ = *<sup>H</sup>*(*Q*<sup>∗</sup>), and the iteration converges to *Q*∗ as *k* → ∞. An optimal policy *h*∗*i* (*x*) can be calculated from *Q*∗ using (8). To perform the former iteration, we need a model of the task in the form of the transition function *f* and reward function *ρi*.

This kind of method, based on the Bellman optimality equation, need saving and updating the Q-values for each state-joint action stage. In this way, only tasks with finite discrete set of state and actions are generally treated. The dimensionality problems occur by the growth of the number of agents involved in the task [29], thus this generates an increment on the computational complexity [30].

In the case where the state space or actions space are continuous or even discrete with a grea<sup>t</sup> number of variables, the Q-functions must be depicted in an approximated form [31]. Because an exact representation of the Q-function could be impractical or intractable, therefore, we propose an approximate linear fuzzy representation of the joint Q-function through a vector *φ*.

### *2.4. Linear Fuzzy Parameterization of the Joint Q-Function*

In general, if there is no prior awareness about the Q-function, the only form to have an exact representation is saving distinct Q-values for each state-action couple. If the state space is continuous, the exact representation of the Q-function would need take an infinite number of state-action values. For this reason, the only practical way to overcome this situation is using an approximate representation of the joint Q-function.

In this section, we present a parameterized version of the Q-function through a linear fuzzy approximator which consist in a vector *φ* ∈ *Rz*, this vector relies in a fuzzy partition of the continuous state space. The principal advantage of this proposal is that we only need to save the state-action pair Q-value of the center of every membership function.

There are *N* fuzzy sets, which are depicted by a membership function:

$$
\mu\_d(\mathbf{x}) = X \to [0, 1] \quad d = 1, 2, \ldots, N \tag{10}
$$

where *μd*(*x*) describe the degree of membership of the state *x* to the fuzzy set *d*, this membership functions can be looked as basis functions or features [32]. The number of membership functions increase with the size of the state space, the number of the agents and with the degree of resolution sought for the vector *φ*.

Triangular shapes of fuzzy partitions are used in this paper since they have their maximum value in a single point, namely, for every *d* exist a unique *xd* (the core of the membership function) such that *μd*(*xd*) > *μd*(*x*) ∀*x* = *xd*. Since all the others membership functions take zero values in *xd* , *μ* ˆ*d*(*xd*) = 0 for ∀ *d* = ˆ *d* , we assume that *μd*(*xd*) = 1, which mean that the membership functions are normal. Another kind of Membership Functions shape could diverge when they have too big values in *xd*[33].

We have a number *N* of triangular membership functions for each state variable *xe* = 1, 2, ..., *E*, with dim(*X*) = *E*. A pyramid shape *E* dimensional of membership functions will be the consequence of the product of each single dimensional membership function in the fuzzy partition of the state space.

We assume that the action space is discrete for all the agents and they have the same number of actions available:

$$\mathcal{U}\_{\mathbf{i}} = \left\{ u\_{\mathbf{i}\mathbf{j}} \,|\, \mathbf{i} = 1, 2, \ldots, n \; \; \mathbf{j} = 1, 2, \ldots \mathbf{M} \right\} \tag{11}$$

The parameter vector *φ* is composed by *z* = *nNM* elements to be stored, the membership function-action pair (*μd*, *uij*) for each agen<sup>t</sup> correspond an element of the parameter vector *φ<sup>i</sup>*,*d*,*j*. The approximator's elements *φ<sup>i</sup>*,*d*,*j* are organized in a preliminary way using *n* different matrices with size *N* × *M* , filling the first *M* columns with the *N* elements available. The elements of the *n* matrix are allocated in a vector arrangemen<sup>t</sup> *φ* column by column for the first agent, then follow with the next agent's columns until completing *n* agents.

Denoting the scalar indexes [*i*, *d*, *j*] of *φ* by:

$$[i, d, j] = [d + (j - 1)\,\,\text{N}] + (i - 1)\,\,\text{N} \times \,\,\text{M} \tag{12}$$

where *i* = 1, 2, ..., *n* means the number of the analyzed agent, *d* = 1, 2, ..., *N* is the number of fuzzy partitions for each state variable *xe* = 1, 2, ..., *E*, with dim(*X*) = *E* and *j* = 1, 2, ..., *M* dim(*Ui*) = *M*. In this way we denote the indexes of the parameter approximator by *φ*[*<sup>i</sup>*,*d*,*j*], which means the approximate Q-value for the *d* membership function, performing the action *j* available for the agen<sup>t</sup> *i*.

The state *x* is taken as input by the fuzzy rule base and produces *M* outputs for each agent, which are the corresponding Q-values to each action for every agen<sup>t</sup> *uij* |*i* = 1, 2, ..., *n j* = 1, 2, ..., *M*, the function's outputs are the elements of *φ*[*<sup>i</sup>*,*d*,*j*]. The fuzzy rule base proposed can be considered as a zero order Takagi-Sugeno rule base [34]:

$$\text{if } x \text{ is } \mu\_d(x) \text{ then } q\_{[i,1]} = \phi\_{[i,d,1]}; \dots; \ q\_{[i,M]} = \phi\_{[i,d,M]}$$

The approximate Q-value can be calculated by:

$$\vec{Q}\left(\mathbf{x},\mathbf{u}\right) = \sum\_{i=1}^{n} \sum\_{d=1}^{N} \mu\_d\left(\mathbf{x}\right) \phi\_{[i,d,j]}\tag{13}$$

The expression (13) is a linear parameterized approximation, the Q-values of a specified state-action couple is estimated through a weighted sum, where the weights are generated by the membership functions [35]. This approximator can be denoted by an approximator mapping

$$F = R^z \to \mathbf{Q} \tag{14}$$

where *Rz* is the parameter space, the parameter *φ* represents the approximation of the Q-function:

$$Q\left(\mathbf{x}, \mathbf{u}\right) = \left[F\left(\phi\right)\right]\left(\mathbf{x}, \mathbf{u}\right) \tag{15}$$

Thus we do not need to store a grea<sup>t</sup> amount of Q-values for every pair (*<sup>x</sup>*, **<sup>u</sup>**). Only *z* parameters in *φ* are needed. The mapping approximator *F* only represents a subset of **Q** [36].

From the point of view of reinforcement learning, a linear parameterized approximation of *F* are preferred since they make more suitable to analyze the theoretical aspect. This is the reason for our choosing of using a linear parameterized approximation *φ*, in this way, the normalized membership functions can be considered as basis functions [37].

The Expression (15) provides a *Q* ˜ which is an approximate Q-function, in place of the exact Q-function *Q*, so the approximate *Q* ˜ is supplied to the mapping *H*:

$$Q\_{k+1}\left(\mathbf{x}, \mathbf{u}\right) = \left(H \circ F\right)\left(\phi\_k\right)\left(\mathbf{x}, \mathbf{u}\right) \tag{16}$$

Most of the time the Q-function *Q* ¯ is not able to be stored in a explicit way [38], as alternative, it has to be represented in an approximate form using a projection mapping *P* : **Q** → *Rz*

$$\phi\_{k+1}\left(\mathbf{x}, \mathbf{u}\right) = P(\mathcal{Q}\_{k+1})\left(\mathbf{x}, \mathbf{u}\right) \tag{17}$$

which makes certain that *Q* ˜ (*<sup>x</sup>*, **u**) = [*F* (*φ*)] (*<sup>x</sup>*, **u**) is as near as possible of *Q*¯ (*<sup>x</sup>*, **u**) [39], in the sense of least square regression:

$$P(Q) = \phi^\* \quad \phi^\* = \arg\min\_{\phi} \sum\_{\lambda}^s \left( Q\left(\mathbf{x}\_{\lambda}, \mathbf{u}\_{\lambda}\right) - \left[F\left(\phi\right)\right]\left(\mathbf{x}\_{\lambda}, \mathbf{u}\_{\lambda}\right)\right)^2\tag{18}$$

where a set of state-joint actions samples (*<sup>x</sup>*, **u**) are used. Because of the use of triangular membership function shapes, and the linear parameterized approximation mapping *F*, (18) is a convex quadratic optimization problem where *z* = *nNM* samples are used [40], so the expression (18) is reduced to a designation in the form:

$$\phi\_{[i,d,j]} = P(Q)\_{[i,d,j]} = Q(\mathbf{x}, \mathbf{u}) \tag{19}$$

Recapitulating, the approximate linear fuzzy representation of the joint Q-function begins with an arbitrary value of the vector parameter vector *φ* and actualizes it in each iteration using the mapping:

$$\phi\_{k+1} = \left(P \diamond H \diamond F\right)\left(\phi\_k\right) \tag{20}$$

and stops when a parameter threshold *ξ* is greater than the difference between 2 consecutive parameters vector *φ*

$$\left\|\left|\phi\_{k+1} - \phi\_{k}\right\|\right\| \leq \zeta \tag{21}$$

A greedy policy can be obtained to control the system from *φ*∗ (which is the parameter vector derived when *k* → ∞), for whichever state, the actions are calculated by interpolation between the best local actions for each agen<sup>t</sup> for every membership function center *xd*:

$$h\_i^\*\left(\mathbf{x}\right) = \sum\_{d=1}^N \phi\_{i,d}\left(\mathbf{x}\right) u\_{\bar{j}\_d^\*}\, where j\_{i,d}^\* \in \arg\max\left[F(\phi^\*)\right](\mathbf{x}, \mathbf{u})\tag{22}$$

To implement the update (20), we propose a procedural using a modified version of the Q-learning algorithm [19], where the linear parameterization is added, in this way the algorithm can be extended to Multi-agent problems with continuous state space but with discrete action space. The algorithm starts with an arbitrary *φ* (it can be *φ* = 0) until a threshold *ξ* is reached after several iterations.

### *2.5. Reinforcement Learning Algorithm for Continuous State Space*

The linear fuzzy approximation depicted by (20) can be described by the following algorithm, where is used a modified version of Q-learning algorithm. To set a correspondence among the algorithm and the expression (20), the right hand of step 2 can be seen as (16) and then using the expression (17). Here the dynamics *f* , the reward function *ρ* and the discount factor *γ* are known in the form of a batch sample data.

1. Let *α* ∈ (0, 1], ∈ (0, 1] set

$$
\phi(\mathbf{x}, \mathbf{u}) \longleftarrow 0 \tag{23}
$$

where (1 − ) is the probability of choose a greedy action in the state *x*, and is the probability of choose a random joint-action in **U**.

	- • For state *x*, we select a joint action **U** with a suitable exploration. At each step a random action with probability *ε* ∈ (0, 1) is used.
	- • Applying the linear fuzzy parameterization with membership functions *μd d* = 1, ..., *N* and discrete actions *Uj j* = 1, ..., *M*, the threshold *ξ* > 0

$$\begin{aligned} \label{eq:SDAR-1} \Phi\_{[i,d,j]\_{k+1}} & \longleftarrow \Phi\_{[i,d,j]\_k} + a\_k \left[r\_{k+1} + \beta - \Gamma\right] \Omega\\ \beta &= \gamma \max\_{\stackrel{\scriptstyle\mathcal{I}}{\mathcal{I}}} \sum\_{i=1}^n \sum\_{d'=1}^N \mu\_{d'} \left(f \left(x\_{k+1}, \mathbf{u'}\right)\right) \Phi\_{[i,d',j']} \end{aligned} \tag{24}$$

$$\Gamma = \sum\_{i=1}^{n} \sum\_{d=1}^{N} \mu\_d \left( f \left( \mathbf{x}\_{k'} \mathbf{u} \right) \right) \phi\_{[i,d,j]} \tag{25}$$

$$
\Omega = \sum\_{i=1}^{n} \sum\_{d=1}^{N} \mu\_d \left( f \left( x\_{k\prime} \mathbf{u} \right) \right) \tag{26}
$$

• Until:

> *φ<sup>k</sup>*+<sup>1</sup> − *φk* ≤ *ξ* (27)

• Output:

$$
\phi^\* = \phi\_{k+1} \tag{28}
$$

A greedy policy is obtained to control the system by:

$$h\_i^\* \left( \mathbf{x} \right) = \sum\_{d=1}^N \phi\_{i,d} \left( \mathbf{x} \right) u\_{\hat{j}\hat{i}} \quad \text{where } j\_{i,d}^\* \in \arg\max \left[ F(\phi^\*) \right] \left( \mathbf{x}, \mathbf{u} \right) \tag{29}$$

where *j*∗*i*,*d*∈ arg max *<sup>F</sup>*(*φ*<sup>∗</sup>)(*<sup>x</sup>*, **<sup>u</sup>**), *j*∗*i*,*d*is the optimal action for the center *xd* for the agen<sup>t</sup> *i*.

**Theorem 1.** *The algorithm with linear fuzzy parameterization (20) converges to a fixed vector φ*<sup>∗</sup>*.*

**Proof of Theorem 1.** Since the mapping given by *F*,

$$\mathbb{E}\left[F\left(\phi\right)\right](\mathbf{x},\mathbf{u}) = \sum\_{i=1}^{n}\sum\_{d=1}^{N}\mu\_{d}\left(\mathbf{x}\right)\,\phi\_{[i,d,j]}\tag{30}$$

the convergence of the algorithm is guaranteed through ensuring that the compound mapping *P* ◦ *H* ◦ *F* is a contraction in the infinite norm. [28] shows that the mapping *H* is a contraction, so it remains to demonstrate that *F* and *P* are not expansions. The mapping given by *F* is a weighted linear combination of membership functions

$$\begin{array}{l} \left\| \left[ F \left( \Phi \right) \right] \left( x, \mathbf{u} \right) - \left[ F \left( \Phi' \right) \right] \left( x, \mathbf{u} \right) \right\| \\ = \left\| \left[ \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \Phi\_{[i,d]} \right] - \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \Phi'\_{[i,d,j]} \right\| \\ = \left\| \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \left[ \Phi\_{[i,d]} - \Phi'\_{[i,d,j]} \right] \right\| \\ = \left\| \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \right\| \left\| \Phi\_{[i,d]} - \Phi'\_{[i,d,j]} \right\| \\ \leq \sum\_{i=1}^{N} \sum\_{d=1}^{N} \left| \mu\_{d} \left( x \right) \right| \left\| \Phi\_{[i,d]} - \Phi'\_{[i,d,j]} \right\| \\ \leq \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \left\| \Phi\_{[i,d]} - \Phi'\_{[i,d,j]} \right\| \\ \leq \sum\_{i=1}^{N} \sum\_{d=1}^{N} \mu\_{d} \left( x \right) \left\| \Phi - \Phi' \right\| \\ \leq \left\| \Phi - \Phi' \right\| \Big|\_{\infty} \end{array} \tag{31}$$

where the last step is true because the sum of the standard functions *μd* (*x*) is 1, and the product generated by each agen<sup>t</sup> also is 1. So it shows that the mapping *F* is a non-expansion. Since the mapping *P* is

$$P(Q)\_{[i,d,\vec{j}]} = Q(\mathbf{x}, \mathbf{u}) \tag{32}$$

and the samples are centers of the membership functions *φk* (*xk*, **<sup>u</sup>***k*) = 1, so the mapping *P* is a non-expansion [33]. Since *H* mapping is a contraction with *γ* < 1, so *P* ◦ *H* ◦ *F* is also a contraction by the factor *γ*

$$\left\| \left( \left( P \circ H \circ F \right) \left( \phi \right) - \left( P \circ H \circ F \right) \left( \phi' \right) \right) \right\| \leq \gamma \left\| \left. \phi - \phi' \right\| \right\|\_{\infty} \tag{33}$$

where *P* ◦ *H* ◦ *F* has a fixed vector *φ*<sup>∗</sup>, and the algorithm above converges to this fixed point as *k* → ∞.

**Theorem 2.** *For any choice of ξ* > 0 *and any initial threshold value parameter vector φ*0 ∈ *Rz, the algorithm with linear fuzzy parameterization is completed in a finite time.*

**Proof of Theorem 2.** As shown in Theorem 1, the mapping is a contraction *P* ◦ *H* ◦ *F* with *γ* < 1 and a fixed vector *φ*∗

$$\begin{array}{l} \|\phi\_{k+1} - \phi^\*\|\_{\infty} \\ = \| \left( P \circ H \circ F \right) \left( \phi\_k \right) - \left( P \circ H \circ F \right) \left( \phi^\* \right) \| \\ < \gamma \left\| \left[ \phi\_k - \phi^\* \right] \right\|\_{\infty} \end{array} \tag{34}$$

So, if *φ<sup>k</sup>*+<sup>1</sup> − *φ*∗ ∞ < *γ φk* − *φ*∗ ∞ , for induction *φk* − *φ*∗ ∞ < *γk φ*0 − *φ*∗ for *k* > 0. According to Banach fixed point, *φ*∗ is bounded. Since the vector where the iteration starts is bounded, then *φ*0 − *φ*∗ ∞ is also bounded. Let *Go* = *φ*0 − *φ*∗ ∞ which is bounded and *φk* − *φ*∗ ∞ ≤ *<sup>γ</sup>kG*0 for *k* > 0, applying the triangle inequality:

$$\begin{array}{l} \|\phi\_{k+1} - \phi\_k\|\_{\infty} \le \|\phi\_{k+1} - \phi^\*\|\_{\infty} + \||\phi\_k - \phi^\*\|\_{\infty} \\ \le \gamma^{k+1} \mathcal{G}\_0 + \gamma^k \mathcal{G}\_0 = \gamma^k \mathcal{G}\_0 \left[\gamma + 1\right] \end{array} \tag{35}$$

If *<sup>γ</sup>kG*0 [*γ* + 1] = *ξ*,

$$\gamma^k = \frac{\xi}{\mathcal{G}\_0\left[\gamma + 1\right]}\tag{36}$$

Applying *γ* log base on both side of the above expression

$$k = \log\_{\gamma} \left[ \frac{\mathcal{J}}{\mathcal{G}\_0 \left[ \gamma + 1 \right]} \right] \tag{37}$$

with *Go* = *φ*0 − *φ*∗ ∞ which is bounded and *γ* < 1 implies that *k* is finite. So the algorithm is arrived in the most *k* iterations.
