*3.1. First Layer*

The model fuzzification layer is present in the first layer. It is responsible for fuzzifying the data set entries through a clustering process across clouds that identifies them through data density concepts. Therefore, this procedure is responsible for creating Gaussian neurons representing neurons of the first layer. These neurons have weights in the range [0,1] that are defined according to the separability criteria of the dimensions of the problem. For each input variable *xij*, *L* neurons are defined *Alj*, *l* = 1, . . . *L*, whose activation functions

are composed of membership functions in the corresponding neurons. The Gaussian neurons created in this layer are expressed by:

$$a\_{jl} = \mu\_{A\_{l'}} \ j = 1 \dots \ n, \ l = 1 \dots L. \tag{1}$$

where *ajl* (*μAl*) represents the degree of association related to the inputs submitted to the model, *N* is the number of inputs (features), and *L* represents the number of neurons. For each input variable *xij*, *L* neurons are defined *Alj*, *l* = 1, . . . *L*, whose activation functions are composed of membership functions in the corresponding neurons [38].

The respective weights of each Gaussian neuron formed by the fuzzification process are created using feature weight separability criteria developed by [39]. These weights are expressed by:

$$w\_{il\prime} \text{ : } i = 1 \dots N \text{ } l = 1 \dots L \text{ } \tag{2}$$

The fuzzification process used by the model is described below.

#### 3.1.1. Self-Organizing Direction-Aware Data Partitioning

Self-Organizing Direction-Aware Data Partitioning is a data partitioning approach based on empirical data analysis [40], which can identify peaks/modes of the input frequency and apply them as focal points. Data clouds can be considered a particular variety of clouds that possess an especially peculiar contrast. They are non-parametric, as they do not observe a pre-defined data distribution that is commonly unknown. The technique operates a magnitude component based on a universal distance metric and a directional/angular component based on the cosine similarity [9]. This approach is suitable for the online processing of streaming data.

The definitions of the empirical data analysis [41] operators used in the fuzzification approaches are listed below. SODA, adopted in this paper, acts to update clouds as new information changes its data density. These modifications develop new cloud formats and directly impact the model's formation of the outputs to be acquired. For the SODA approach, consider:


$$\text{--}\qquad \sum\_{\subset=1}^C \mathsf{K}^\varepsilon = K \text{ and } \sum\_{\subset=1}^C \mathsf{U} \mathsf{m}\_K^\varepsilon = \mathsf{U} \mathsf{m}\_K.$$

Based on *un*1, *un*2,..., *un*Ψ and *f*1, *f*2, ... , *f*<sup>Ψ</sup>, it is possible to reconstruct the dataset *x*1, *x*2, ... , *xk* exactly if necessary, regardless of the order of arrival of the data points [41].

The first empirical data analysis operator is cumulative proximity. These elements are defined as the distance between the samples present in the model evaluation. The following is the definition of cumulative proximity [41]:

$$\pi\_K(\mathbf{x}\_i) = \sum\_{j=1}^K d^2(\mathbf{x}\_i, \mathbf{x}\_j); \quad i = 1, 2, \dots, K \tag{3}$$

where d(*xi*, *xj*) denotes the distance between *xi* and *xj*, which can be Euclidean or cosine, among others [42].

The second operator for the determination of data clouds is the unimodal (or local) density, which in turn is determined by [41]:

$$D\_{K}(\mathbf{x}\_{i}) = \frac{\sum\_{l=1}^{K} \pi\_{K}(\mathbf{x}\_{l})}{2K\pi\_{K}(\mathbf{x}\_{i})} = \frac{\sum\_{l=1}^{K} \sum\_{j=1}^{K} d^{2}(\mathbf{x}\_{l}, \mathbf{x}\_{j})}{2K\sum\_{j=1}^{K} d^{2}(\mathbf{x}\_{i}, \mathbf{x}\_{j})}; \qquad i = 1, 2, \dots, K \tag{4}$$

where, for Euclidean distance, ∑*Kl*=<sup>1</sup> *<sup>d</sup>*<sup>2</sup>(*<sup>x</sup>i*, *<sup>x</sup>l*) = ∑*Kl*=<sup>1</sup> " *xi* − *xl* "2 and ∑*Kl*=<sup>1</sup> <sup>∑</sup>*Kj*=<sup>1</sup> *<sup>d</sup>*<sup>2</sup>(*<sup>x</sup>l*, *xj*) = ∑*Kl*=<sup>1</sup> <sup>∑</sup>*Kj*=<sup>1</sup> " *xl* − *xj* "2. It can be simplified using the average of {*x*}*<sup>K</sup>*, *ϕK* and the average scalar product, *XK*, as in [43]:

$$\sum\_{l=1}^{K} \left\| \mathbf{x}\_{i} - \mathbf{x}\_{l} \right\|^{2} = K \left( \left\| \mathbf{x}\_{i} - \mathbf{q}\_{K} \right\|^{2} + X\_{K} - \left\| \mathbf{q}\_{K} \right\|^{2} \right) \tag{5}$$

$$\sum\_{l=1}^{K} \sum\_{j=1}^{K} \left|| \mathbf{x}\_{l} - \mathbf{x}\_{j} \right||^{2} = 2K^{2} \left( \mathbf{X}\_{K} - \left\| \mathbf{p}\_{K} \right\|^{2} \right) \tag{6}$$

Finally, the third empirical data analysis operator is global density (*DGK* ). It is necessary to identify in the data the weighted sum of the local density by a similar occurrence in *f*1, *f*2 ... , *f*Ψ*K*. It is defined for unique data samples and their corresponding number of repetitions in the data set/stream. It is expressed in (7).

$$D\_K^G(\mathfrak{uu}\_k) = \frac{f\_k}{\sum\_{j=1}^{\mathsf{T}\_k} f\_j} D\_K(\mathfrak{uu}\_k) = \frac{f\_k}{1 + \frac{\|\mathfrak{uu}\_k - \mathfrak{P}\_k\|^2}{\mathsf{X}\_K - \|\mathfrak{P}\_k\|^2}}\tag{7}$$

The essence of an evolving model is to have its parameters evolve as new samples emerge with new information. The empirical data analysis operators can be updated recursively. This update can be seen below:

$$
\mathfrak{p}\_k = \frac{k-1}{k} \mathfrak{p}\_{k-1} + \frac{1}{k} \mathfrak{x}\_k; \quad \mathfrak{p}\_1 = \mathfrak{x}\_1; \quad k = 1, 2, \dots, K \tag{8}
$$

$$X\_k = \frac{k-1}{k}X\_{k-1} + \frac{1}{k}||\mathbf{x}\_k||^2;\ X\_1 = ||\mathbf{x}\_1||^2;\ k = 1,2,\ldots,K\tag{9}$$

$$D\_{\mathbb{K}}(\mathbf{x}\_{k}) = \frac{1}{1 + \frac{\left\|\mathbf{x}\_{k} - \mathbf{p}\_{\mathbb{K}}\right\|^{2}}{\left\|\mathbf{x}\_{k} - \left\|\mathbf{p}\_{\mathbb{K}}\right\|^{2}}}\tag{10}$$

The considerable typical distance metric used, Euclidean, was adopted in SODA as the magnitude component and, consequently, can be represented between *xi* and *xj* by [9]:

$$d\_M(\mathbf{x}\_i) = \left||\mathbf{x}\_i - \mathbf{x}\_j\right|| = \sqrt{\sum\_{k=1}^m (\mathbf{x}\_{ik} - \mathbf{x}\_{jk})^2} \\ i, j = 1, 2, \dots, n \tag{11}$$

The angular component that assumed the ideas of cosine similarity and is provided in the SODA model can be expressed as [9]:

$$d\_A(\mathbf{x}\_i) = \sqrt{1 - \cos\left(\Theta\_{\mathbf{x}\_i, \mathbf{x}\_j}\right)} i\_\prime j = 1, 2, \dots, n \tag{12}$$

where cos (<sup>Θ</sup>*xi*,*xj*) = <sup>&</sup>lt;*xi*,*xj*<sup>&</sup>gt; "*xi*","*xj*" expresses the value of the angle between *xi* and *xj* [9].

Simultaneously, one applies the measurement and angular component values mutually. Considerable problems can be estimated on a 2D plane, called the direction-aware plane [9]. The empirical data analytics operators [40] employed in this approach are the cumulative proximity (3), local density (4), and global density (7).

The recursive update of these parameters also follows the identical concepts as empirical data analysis operators. Their formulation can be seen in Equations (8) and (9). The angular component updates similarly, as expressed below [9]:

$$
\mathfrak{sp}\_n^A = \frac{n-1}{n} \mathfrak{sp}\_{n-1}^A + \frac{1}{n} \frac{\mathbf{x}\_n}{\|\mathbf{x}\_n\|}; \quad \mathfrak{sp}\_1^A = \frac{\mathbf{x}\_1}{\|\mathbf{x}\_1\|}\tag{13}
$$

When Euclidean distance is used for Local Density in the SODA approach, it can be represented as follows [9]:

$$\Sigma\_{j=1}^{n}\,\pi\_{n}^{M}(\mathbf{x}\_{j}) = 2n^{2}\left(X\_{n}^{M} - \left\|\,\boldsymbol{\varrho}\_{\mathbf{n}}^{\mathbf{M}}\right\|^{2}\right) \tag{14}$$

The initial phases of the SODA algorithm regard, firstly, composing different directionaware planes of the recognized data samples operating both the magnitude-based and angular-based densities; secondly, distinguishing focal points; and finally, handling the focal points to partition the data space into data clouds [9]. The algorithm is executed in the following steps:

Stage 1—Preparation: Estimate the average values between every pair of data samples, *x*1, *x*2, ... , *xn* for both the square Euclidean components, *dM* and square angular components, *dA* [9].

$$\overline{d}\_M^2 = \frac{\sum\_{i=1}^{\overline{\nu}} \sum\_{j=1}^{\eta} d\_M^2(\mathbf{x}\_i, \mathbf{x}\_j)}{n^2} = \frac{\sum\_{i=1}^{\overline{\nu}} \sum\_{j=1}^{\eta} \left\| \mathbf{x}\_i - \mathbf{x}\_j \right\|^2}{n^2} = 2\left( X\_\eta^M - \left\| \mathbf{q}\_\mathbf{n}^\mathbf{M} \right\|^2 \right) \tag{15}$$

$$\overline{d}\_A^2 = \frac{\sum\_{i=1}^n \sum\_{j=1}^n d\_A^2(\mathbf{x}\_i, \mathbf{x}\_j)}{n^2} = \frac{\sum\_{i=1}^n \sum\_{j=1}^n \left\| \frac{\mathbf{x}\_i}{\|\mathbf{x}\_i\|} - \frac{\mathbf{x}\_j}{\|\mathbf{x}\_j\|} \right\|^2}{2n^2} = 1 - \left\| \mathbf{q}\_n^A \right\|^2 \tag{16}$$

After computing the global density, SODA reclassifies the problem samples in a decreasing way and renames them as {**Φ**' 1, **Φ**' 2,..., **Φ**' *nun*} [9].

Stage 2—Direction-Aware Plane Projection: The direction aware projection process starts with the unique data sample with the highest global density, namely −1 . It is initially set to be the first reference, *ϕ*1 = **Φ** ' *j*, which is also the origin point of the first directionaware plane, denoted by P1 (Ψ = 1, and Ψ is the number of existing direction-aware planes in the data space. The following rule can describe the second step of the algorithm [9]:

$$\mathbf{Comiction1} \quad IF\left(\frac{d\_M(\boldsymbol{\varrho}\_l, \hat{\boldsymbol{\Phi}}\_j)}{\hat{d}\_M} < \frac{1}{\theta}\right) \\ AND\left(\frac{d\_A(\boldsymbol{\varrho}\_l, \hat{\boldsymbol{\Phi}}\_j)}{\hat{d}\_A} < \frac{1}{\theta}\right) \\ \tag{17}$$
 
$$ THEN\left(\mathbf{P}\_l = \hat{\boldsymbol{\Phi}}\_l\right)$$

where *ϑ* assists in the deduction of the granularity of the cloud used. When considerable direction-aware planes satisfy the Equation (17) criteria at the same time, **Φ** ' *j* will be assigned to the one nearest to them according to the following equation [9]:

$$d = \operatorname\*{arg\,min}\_{l=1,2,\dots,\mathbb{P}} \left( \frac{d\_M \left( \rho\_l, \hat{\Phi}\_j \right)}{d\_M} + \frac{d\_A \left( \rho\_l, \hat{\Phi}\_j \right)}{d\_A} \right) \tag{18}$$

In this step, the mean, the support (number of samples of the problem), and the sum of the global density (*ϕi*, *Spi* and *Di*, respectively) are updated as follows [9]:

$$
\mathfrak{sp}\_i = \frac{\mathbb{S}\_i}{\mathbb{S}\_i + \mathbb{T}} \mathfrak{sp}\_i + \frac{1}{\mathbb{S}\_i + \mathbb{T}} \widehat{\Phi}\_j \tag{19}
$$

$$Sp\_i = Sp\_i + 1\tag{20}$$

$$\mathbf{D}\_{i} = \mathbf{D}\_{i} + D\_{n}^{G} \left(\boldsymbol{\hat{\Phi}}\_{j}\right) \tag{21}$$

Nonetheless, if Equation (17) is not fulfilled, the parameters involved in the analysis will be updated after the creation of a new direction-aware plane (*P*Ψ+1) and new references, as follows [9]:

$$
\Psi = \Psi + 1\tag{22}
$$

$$
\varphi\_{\mathbb{F}} = \hat{\Phi}\_{\mathbb{F}} \tag{23}
$$

$$S\_{\Psi} = 1\tag{24}$$

$$\mathbf{D}\_{\overline{\Psi}} = D\_{\overline{\pi}}^{G} \left( \hat{\Phi}\_{\overline{\!\!\!/} } \right) \tag{25}$$

This procedure occurs until all problem samples are organized. In this situation, some data samples may be located on several direction-aware planes simultaneously, depending on the behavior of the problem. In this sense, the final establishment of these samples is defined by the distances between them and the points of origin of the following direction-aware plans [9].

Stage 3—Identifying the Focal Points: For each direction-aware plane, denoted as *<sup>P</sup>β*, find the neighboring direction-aware planes ({**P**}*<sup>n</sup>β*, *l* = 1, 2, ... , *L*, *l* = *β*). The subsequent association can define the rule for determining new plans [9]:

$$\begin{array}{c} \textbf{Condition 2} \end{array} I F \left( \frac{d\_M \left( \boldsymbol{\uprho}\_{\boldsymbol{\beta}} \boldsymbol{\uprho}\_{l} \right)}{d\_M} \leq \frac{2}{\theta} \right) \\ \begin{array}{c} \text{AND} \left( \frac{d\_A \left( \boldsymbol{\uprho}\_{\boldsymbol{\beta}} \boldsymbol{\uprho}\_{l} \right)}{d\_A} \leq \frac{2}{\theta} \right) \\ \text{THEN} \left( \{ \mathbf{P} \}\_{\boldsymbol{\beta}}^n = \{ \mathbf{P} \}\_{\boldsymbol{\beta}}^n \cup \{ \mathbf{P}\_l \} \right) \end{array} \tag{26}$$

The central mode/peak of data density (**P***β*) will be selected if the following equation is attended (assuming the corresponding **D** of {**P**}*<sup>n</sup>β*, *l* = 1, 2, ... , Ψ, *l* = *β* is expressed by {**D**}*<sup>n</sup>β* [9]):

$$\textbf{Combination 3 } IF\left(\mathbf{D}\_{\beta} > \max\left(\{\textbf{D}\}\_{\beta}^{n}\right)\right) \text{ } THEN\left(\mathbf{P}\_{\beta} \text{ is a mode/peak of } \mathbf{D}\right) \text{ }\tag{27}$$

All peaks are found when conditions 2 and 3 perform mutually.

Stage 4—Forming Data Clouds: After all the direction-aware planes encountering the modes/peaks of the data density are appointed, it reflects their origin points, represented by *ϕ<sup>o</sup>*, as the focal points and uses them to form data clouds according to a Voronoi tessellation [44]. The term that represents the data cloud is suggested as follows [9]:

$$\begin{array}{c} \textbf{Condition 4} \quad IF \Big( \mathbb{M} = \underset{j=1,2,\ldots,\mathbb{J}}{\arg\min} \left( \frac{d\_M \left( \mathbf{x}\_l \mathbf{o}\_j^o \right)}{d\_M} + \frac{d\_A \left( \mathbf{x}\_l \mathbf{o}\_j^o \right)}{d\_A} \right) \Big) \\\quad THEN(\mathbf{x}\_l \text{ is assigned to the } \diamondsuit^{\text{th}} \text{ data cloud}) \end{array} \tag{28}$$

where *ξ* is seen as the number of focal points. The concept of clouds is equivalent to clusters. Regardless, there are distinctions because they are non-parametric, do not have a typical shape, and can express any real data distributions following local density criteria [45].

This approach can also perform with the evolution of their clouds due to the ability to update their parameters recursively. For the evolving strategy used in SODA, some actions are essential for new samples to influence changes in the data clouds. Therefore, the following steps, which are related to the steps listed previously, help in describing the update of the SODA fuzzification method [9]:

Stage 5—Update Problem Parameters: For each new sample introduced to training (*<sup>x</sup>n*+1, *ϕMn* , *ϕAn* , and *XMn* ), they are updated by Equations (9), (13), and (14). Likewise, the Euclidean components and the angular components between the new sample and the established centers of the direction-aware planes are also updated for each new sample employing Equations (11) and (12) and are currently described by *dM*(*<sup>x</sup>n*+1,*ϕl*) and *dA*(*<sup>x</sup>n*+1,*ϕl*) for *l* = 1, 2, . . . , Ψ, sequentially.

At this point in the procedure, the direction-aware plane projection stage is triggered, letting a joint analysis of Condition 1 and Equation (18) determine whether the new sample belongs to an actual data set (model parameters are updated based on Equations (19) and (20), or, if it is a new sample with different behavior, consequently forming the demand to produce a new direction-aware plane (and hence the parameters will be updated based on Equations (22), (23) and (24)) [9].

Stage 6—The Fusion of Overlapping Direction-Aware Planes: After the realization of Stage 5, a new requirement is examined on the fuzzification approach, recognizing strongly overlapping direction-aware planes [9]:

$$\begin{array}{c} \textbf{Condition 5} \quad IF \left( \frac{d\_M(\boldsymbol{\varphi}\_i, \boldsymbol{\varphi}\_j)}{\overline{d}\_M} < \frac{1}{2\theta} \right) \\\quad THEN(\boldsymbol{\mathcal{P}}\_i \text{ and } \boldsymbol{\mathcal{P}}\_j \text{ are strongly overlapping}) \end{array} \tag{29}$$

This situation is not advisable, so these circumstances are solved by assembling a new direction-aware plane on *Pj* by combining the analyzed direction-aware plane. This merger criterion is described by [9]:

$$
\Psi = \Psi - 1 \tag{30}
$$

$$
\mathfrak{op}\_{\dot{\jmath}} = \frac{Sp\_{\dot{\jmath}}}{Sp\_{\dot{\jmath}} + Sp\_{\dot{\jmath}}} \mathfrak{op}\_{\dot{\jmath}} + \frac{S\_i}{S\_{\dot{\jmath}} + S\_i} \mathfrak{op}\_{\dot{\jmath}} \tag{31}
$$

$$Sp\_{\rangle} = Sp\_{\rangle} + Sp\_{i} \tag{32}$$

The procedure occurs until all the direction-aware planes are not highly overlapping. This technique occurs by eliminating the parameters of *Pi* and returning to Stage 5. This flow occurs until the conditions reported in Condition 5 are not satisfied, thus flowing to the final stage of the evolving fuzzification process [9]. This situation is essential for using the SODA technique, it is a benchmark for certifying the interpretability of the results.

Stage 7—Forming Data Clouds: Once all samples are examined, SODA defines all the focal points of the existent centers in the direction-aware planes resulting from Stage 6. The subsequent steps are correlated to the global density estimation of the centers of the direction-aware plane through Equation (7), operating the supports of each direction-aware plane as the number of repetitions to be performed. Therefore, it obtains the global density, which is called *DGn* (*ϕl*) [9]. Second, each direction-aware plane is examined using Condition 2 to find its neighboring direction-aware plane. In this circumstance, Condition 3 is used in order to confirm whether the *DGn* (*ϕe*) underneath investigation is one of the local maxima of (*DGn* (*ϕl*)), so that eventually all the identified maximum *DGn* (*ϕl*) zones and the centers of the corresponding direction-aware planes, together with Condition 4, will act in the establishment of a focal point, and hence in the data clouds [9].

It is meaningful that at the first moment of the model's training, the SODA algorithm's demeanor will be similar to the offline behavior of the algorithm (Stages 1–4). From the subsequent data presented to the model, SODA will act in its evolving state, primarily concentrating on the stages 5–7 of the SODA algorithm [9].

Figure 4 presents an example of the construction of clouds and identification of the centers that guide the construction of the structures resulting from the fuzzification process.

#### 3.1.2. Incremental Feature Weights Learning

This paper addresses a different creation in the definition of weights when compared with various models of evolving fuzzy neural networks that define them randomly. Here, the weights are defined by an online and incremental criterion for evaluating the relevance of the problem features when determining the best separation of the target classes. This proposal increases the reducibility of fuzzy rules, as irrelevant weights can be discarded from a visual and operational evaluation and, at the same time, bring more interpretability to the Gaussian neurons of the first layer of the model. It should be noted that this procedure has already been successfully applied in evolving fuzzy neural networks, as in [11,46].

**Figure 4.** Example of the SODA approach with 17 clouds.

The approach proposed in [39] uses the Dy–Brodley separability criterion [47] for the classes analyzed in the problem. Either compute as a single feature (using each feature (and only one of them)) or discard each feature from the complete set to identify the feature in question that best defines the separation of the classes analyzed. This procedure sets values closer to 1 to features that are more satisfactory in separate classes [11]. On the other hand, dimensions that are ineffective at identifying labels correctly are assigned values close to zero. Accordingly, if a dimension is nonessential to the model, it can be excluded from the model to reduce interpretation complexity.

The class separability criterion can be represented as an extension of Fisher's separability criterion [11]:

> *J*

$$\mathbf{s} = \delta(\mathbf{S}\_w^{-1}\mathbf{S}\_b) \tag{33}$$

where *<sup>δ</sup>*(*<sup>S</sup>*−<sup>1</sup> *w Sb*) is the sum of the diagonal elements of the matrix *S*−<sup>1</sup> *w Sb*. *Sb* means the dispersion matrix between classes that estimate the class's dispersion averages to the total average, and *Sw* denotes the within scattering matrix (that measures the samples' dispersion in their class averages) [11].

This paper uses the leave-one-feature-out approach. It works by discarding each feature from the complete set and then calculates Equation (33) for each obtained subset → again obtaining *J*1,..., *JN* for *N* features in the data set. It can be expressed by [39]:

$$w\_{j} = 1 - \frac{f\_{j} - \min\_{i=1,\ldots,N} f\_{i}}{\max\_{i=1,\ldots,N} f\_{i}} \tag{34}$$

The lower the value of *Ji* becomes, the more important feature *i* is, because feature *i* was discarded. Hence, and seeking comparable importance among all features (relative to the most noteworthy feature, which should acquire a weight of 1), the feature weights are assigned by [39]:

In particular, *Sb* can be updated by updating class-wise and the overall mean through an incremental mean formula. Recursive covariance update formulas update the covariance matrices per class with rank-1 change for more powerful and faster convergence to the batch calculation [48]. For all formulas and details, check [39].
