1. Introduction
As a flexible, interpretable machine learning model, fuzzy neural networks (FNNs) have been widely used in various fields, such as image processing [
1], fuzzy control [
2,
3], ranking challenges, risks and threats [
4], actual classification and prediction [
5,
6,
7,
8], and so on. One of the most commonly used FNN structures is the Takagi-Sugeno-Kang (TSK) [
9] fuzzy system, also called TSK neuro-fuzzy system because it can be represented as a neural network [
10,
11,
12]. The most famous TSK neuro-fuzzy system is the adaptive-network-based fuzzy inference system (ANFIS) [
10], which can be regarded as a first-order type-1 TSK FNN. This kind of first-order type-1 TSK FNN can bridge the gap between the linguistic and numerical representation of knowledge. Its fuzzy logic component provides a linguistic representation of knowledge, whereas the neural network component provides a numerical representation. Thus, it is a powerful tool that can address practical and theoretical gaps in modeling complex systems. Its ability to handle uncertain and imprecise data, provide interpretable rules, and perform online learning make it useful in various applications.
The training of FNNs is a necessary work. There are many existing training algorithms, such as backpropagation [
5], particle swarm algorithm [
13], hybrid algorithm [
14] and so on [
15]. Although evolutionary algorithms and hybrid-type algorithms work well, they require considerable running time. As an efficient and commonly used algorithm, the backpropagation algorithm has become a common scheme to optimize TSK FNNs. Many convergence results of gradient-based backpropagation algorithms [
16,
17] in FNNs provide theoretical guarantees for a wide range of applications of this algorithm. Hence, the gradient-based backpropagation algorithm is used in this paper to train the FNN model.
For fuzzy neural networks (FNNs), the fuzzy partition of input space is an important task in structure recognition, which affects the number of generated fuzzy rules (FRs). There are two typical partitioning methods: grid-type partition [
10] and clustering-based partition [
17,
18,
19,
20]. The grid-type partition partitions the input space into multiple grids. Each grid represents a FR. It is simple but leads to the generation of rules that increase dramatically with the number of dimensions. Compared to the grid-type partitioning method, the clustering-based partition reduces the number of FRs yielded. It clusters input training vectors into input space to provide more flexible partitioning. To make the meaning of each clustering center and the matching fuzzy term of each input transparent to their users, the formed clusters can be projected onto each dimension of the input space. On each dimension, one input variable corresponds to a mapping membership function (MF), and one cluster can be represented by the product of mapping MFs. The clustering-based partition commonly uses K-means or fuzzy c-means as the clustering algorithm. They need to set the number of clusters in advance, which is usually equal to the number of classes, making the generated fuzzy rule base not expressive enough. If the right number of clusters is chosen, the clustering-based partition can greatly improve the performance of FNNs. Therefore, it is an important and significant work for FNNs to adaptively generate suitable clusters and then gain a rich fuzzy rule base.
Redundancy inevitably occurs when enough FRs are yielded in a rich fuzzy rule base. The interpretability of FNNs is mainly reflected in the fuzzy rule base, which is a collection of FRs in the form of IF-THEN statements. Since too many rules will weaken the interpretability of FNNs, a reduction in the number of rules is necessary and meaningful. In order to reduce the number of FRs, various methods have been proposed, such as direct extraction techniques from numerical data [
21,
22], genetic algorithm-based approaches [
23,
24], and embedded neuro-fuzzy approaches [
5,
25,
26,
27]. Among these methods, the embedded neuro-fuzzy approaches simultaneously perform the rule extraction and evaluation of the model, enhancing the effectiveness and reducing the computational burden. In [
5,
27], the gate functions are introduced and embedded into the neuro-fuzzy models for fuzzy rule selection, which is an effective method. However, they require the introduction of the additional functions. In this paper, we attempt to explore another method without introducing an additional function to perform embedded rule selection.
The FNNs mentioned above are based upon type-1 fuzzy sets (FSs), which are precise. Thus, these type-1 FNNs struggle to tackle the uncertainty of rules. To tackle uncertainties (such as noise measurements, semantics variations, and so on), the type-2 FNNs are presented [
28] and have excellent performance [
29,
30,
31]. Unfortunately, due to type-reduction procedure (such as Karnik-Mendel iteration [
32]) from type-2 to type-1, significant additional computational costs will incur for type-2 FNNs. Moreover, the type-2 fuzzy sets cannot capture the notion of variability, nor to model the variation (over time) in the opinions of individual experts and expert groups, called “intraexpert” and “interexpert” variability [
33], respectively. To address the problems of the type-2 FSs, the notion of nonstationary FSs (called NFSs) are introduced in [
33], as well as the nonstationary fuzzy inference systems (NFISs). Different from type-2 FSs, a NFS can actually be viewed as a collection of multiple type-1 fuzzy sets yielded by the perturbation MFs without secondary MFs. Therefore, the NFIS has a fundamentally different inference mechanism than the type-2 fuzzy inference system. Combining neural networks with NFIS, a zero-order nonstationary FNN (NFNN-0) is presented in [
17], which can directly address the uncertainties and model the “intraexpert” and “interexpert” variability. These previous works have demonstrated the validity and strong robustness of NFS and nonstationary FNN. However, the nonlinear mapping ability of NFNN-0 is weak due to the use of zero-order, so improving its nonlinear representation ability is also a valuable work.
In this paper, based on the Mean Shift algorithm and the Group Lasso regularization, a first-order sparse TSK nonstationary fuzzy neural network is proposed. It combines the learning strategy of a neural network with the logical inference ability and language expression ability of first-order NFIS, making the neural network interpretable/translatable and realizing the self-learning of fuzzy rules/sets. It does not need to face the difficulty that type-2 fuzzy inference mechanism needs to perform type-reduction procedures, because it is actually a repetition of a first-order type-1 FNN with slightly different MFs over time. Like NFNN-0, the proposed model can also model the “intraexpert” and “interexpert” variability as well as directly deal with the uncertainties. The main contributions of this paper are summarized as follows:
- (i)
To improve the nonlinear representation ability of NFNN-0, we extend the nonstationary fuzzy neural network from zero-order to first-order, and propose a first-order sparse TSK nonstationary fuzzy neural network, called SNFNN-1. SNFNN-1 can significantly improve the performance of NFNN-0. Simulation experiments confirm the effectiveness and robustness of our model.
- (ii)
To adaptively generate a suitable number of clusters and FRs, the Mean Shift algorithm is used to partition the input space and construct the antecedent structure of our SNFNN-1 model. Compared with fuzzy partition based on K-means clustering or fuzzy c-means clustering, this fuzzy partition method does not need to set the number of clusters in advance and can provide more effective centers as well as the number of MFs. With the gained MFs, a rich fuzzy rule base is subsequently generated.
- (iii)
Considering the redundancy among rules of the rich fuzzy rule base, we add a regularization term, i.e., Group Lasso term, to the objective function to penalize each rule, thus producing sparsity of rules in a grouped manner. Then, in the rule-consequent structure of SNFNN-1, combined with a rule selection method, the important rules are retained and the useless or inappropriate rules are deleted, so as to achieve the purpose of rule reduction.
The rest of the research content is arranged as follows.
Section 2 describes the architecture of our proposed SNFNN-1 model. In
Section 3, supporting experiments are implemented.
Section 4 draws the conclusions and outlines directions for future work.
2. First-Order Sparse TSK Nonstationary Fuzzy Neural Network (SNFNN-1)
In this section, we elaborate on the construction of the first-order sparse TSK nonstationary fuzzy neural network, namely SNFNN-1. There is no need to preset the number of clusters in the fuzzy partition of our model due to the use of the Mean Shift algorithm. Additionally, rule selection is achieved by adding a Group Lasso regularization to the objective function.
Assume that is the sample set, and is the corresponding desired outputs, where is the i-th sample, D is the number of input dimensions, and n is the number of samples.
Before constructing our model, the number of clusters needs to be determined via the Mean Shift algorithm, as well as the corresponding centers and standard deviations. The following is the specific method. Giving a bandwidth value
h, take an unlabeled sample as a cluster center
, and then update this center by
where
is the
m-th iteration. Mark all samples that were once within the bandwidth until it converges. Then, we can gain a clustering center
. Repeat the above operation until all sample points are labeled. Eventually, we can obtain an adaptive number of clusters,
S, as well as
S clustering centers
and a clustering sample set
. According to [
34], each standard deviation is calculated by
where
, and
.
For brevity, the structure of a multi-input-single-output SNFNN-1 is constructed, as shown in
Figure 1. It is easily extended to the case of multiple outputs. Actually, the proposed SNFNN-1 can be considered as an integrated network of
T sub-networks, where
T means the repetitions with variation in center/width or noise [
33]. In this paper, the variation in center is taken as an example. Each sub-network is a first-order type-1 FNN based on the Mean Shift algorithm and the Group Lasso, abbreviated as SFNN-1.
In the FNNs with the clustering-based fuzzy partition, the number of fuzzy rules are commonly the same with the number of clusters, that is,
. For an input vector
, each first-order type-1 fuzzy rule of the
t-th SFNN-1 is defined as:
where
means the
r-th rule, and
R represents the total number of rules.
is the
t-th sub-network (i.e., SFNN-1).
is the fuzzy set associated with
d-th feature in the
r-th rule of
t-th sub-network, whose membership function contains triangular, trapezoidal, Gaussian functions, and so on. In this paper, the widely used Gaussian functions are adopted. The clustering centers generated by the Mean Shift algorithm and the corresponding standard deviations are regarded as the centers and widths of membership functions, respectively.
is the consequent parameter vector of
r-th rule, and
is all consequent parameters.
indicates the
dimensional vector.
A simple SNFNN-1 model owns seven layers and the detailed presentation of each layer is as follows:
Layer 1 (Input layer): each node indicates an input variable (crisp variable) in this layer. Then all input variables are fed to the next layer.
Layer 2 (Membership/Fuzzification layer): in this layer, the Gaussian membership functions (GMFs) are adopted to produce the membership degrees of the input variables. In order to avoid the derivation of the denominator, reduce the amount of calculation and the difficulty of theoretical analysis, we take the reciprocals of the widths of GMFs as the independent variables, referring to [
16]. Hence, each membership value with variation in center is defined as
where
,
,
, and
is the input variable of
i-th sample on the
d-th dimension.
means the center yielded by the perturbation function [
33] for
t-th SFNN-1, where
is the benchmark center.
,
and
denote three hyper-parameters used to cause tiny periodic perturbations in center for simulating the variation in opinions of expert groups.
is the reciprocal of width without variation, where
means the benchmark reciprocal of width.
Figure 2 shows an instantiation of nonstationary fuzzy set (NFS) based on Gaussian functions with
,
,
,
,
,
and
.
Layer 3 (Rule layer): each node of this layer is a rule node indicating a term of a fuzzy rule. For the
r-th rule of the
t-th SFNN-1, the firing strength based on the product T-norm is calculated by:
Layer 4 (Normalization layer): for each rule, the normalized firing strength is calculated in this layer. The corresponding strength of the
r-th rule in the
t-th SFNN-1,
, is given as
Layer 5 (Defuzzification layer): the defuzzification operation is applied in this layer. It multiplies each output conclusion in the first-order fuzzy rule with each normalized firing strength. The output of the r-th node in the t-th SFNN-1 is , where is called the consequent parameter of the r-th rule in the t-th SFNN-1.
Layer 6 (Summation layer): by summing the outputs of Layer 5, the actual output of each SFNN-1 is yielded in this layer:
Layer 7 (Output/Integration layer): through ensemble learning (such as an averaging or voting mechanism), this layer synthetically considers all outputs of T sub-networks. Due to the various centers of the GMFs over time, the SNFNN-1 is in fact a set of shifts of type-1 SFNN-1.
Remark 1. The constructed nonstationary seven-layer network can be regarded as a special kind of ensemble network. Thus, it performs better than a simple type-1 FNN, especially for robustness, as illustrated by the experiments of Section 3.3. To realize the rule selection (RS), the Group Lasso regularization is added to the objective function. Compared with other Lasso regularization [
35,
36,
37], it can induce row or column sparsity, thus producing sparsity of rules in a grouped manner, which provides the possibility for rule selection [
38]. Therefore, the following objective function of each SFNN-1 contains two parts, that is, the mean square error (MSE) and the Group Lasso penalty term:
where
is the weight vector containing all parameters,
,
,
,
, and
is the hyper-parameters of the penalty term.
The gradients of Equation (7) with respect to
are calculated by
where
and
,
.
For an initial weight vector
, the updating formula of SNFNN-1 based on the typical gradient descent method is as follows:
where
is the
m-th iteration, and
is the learning rate.
To select the appropriate rules, we introduce a threshold
. The rules are selected in the following way:
where
To concisely construct the SNFNN model, we train a SFNN network in advance according to the above update method and rule selection method. Use this trained network as the baseline network to generate
T sub-networks according to the perturbation function, taking the variation in center as an example in this paper. Then, fine-tune the consequent parameters of each sub-network, while the centers and the reciprocals of widths are not retrained. Finally, we gain the outputs of all sub-networks, and then comprehensively consider all these outputs to yield the final output. Algorithm 1 summarizes the training procedure of SNFNN-1. Note that the fine-tuning process of Step 12 can be parallelized to save time.
Algorithm 1: Training procedure of SNFNN-1 |
- Input:
The training sample set and its labels , the bandwidth h of Mean Shift algorithm, the initial consequence parameters , the learning rate , the penalty parameter , the maximum iterations M and the stop threshold for the model to converge, the hyper-parameter for rule selection, the hyper-parameters of periodic perturbation function T, , and . - Output:
The SNFNN-1 model and final results - 1:
Adopt the Mean Shift algorithm to partition the input space and generate S cluster centers as the initial centers of membership functions, , where . - 2:
Calculate all widths via Equation (2), where . Then gain the initial reciprocals of widths , where - 3:
Define the initial weight vector , including all initial centers , reciprocals of widths and consequent parameters . - 4:
Calculate the outputs of each layer of a SFNN-1 sub-network in turn by using Equations (3)–(6). Here, we can think of . - 5:
According to Equation (7), calculate the objective function of this SFNN-1. - 6:
Train this model based on the gradient descent method (13) until it converges. - 7:
By (15), calculate the thresholds . Use the (14) to select the important rules. - 8:
Retrain the SFNN-1 by using the selected rules until it converges. - 9:
Use this retrained SFNN-1 as the benchmark sub-network of SNFNN-1. The retrained centers and reciprocals of widths of this base model are regarded as the benchmark centers and benchmark reciprocal of widths , respectively. The retrained consequent parameters is denoted by . - 10:
By Equation (3), we get T nonstationary MFs of SNFNN-1. Here, the perturbation function adopts the variation in center [ 33] to generate T various centers and T identical reciprocals of widths, separately denoted as and , where . - 11:
Let , where . Then, according to the T nonstationary MFs and , construct the whole SNFNN-1 model, which owns T SFNN-1 sub-networks. For various sub-networks of SNFNN-1, calculate their outputs of each layer based on the various centers, the identical reciprocals of widths and the identical consequent parameters. - 12:
For each SFNN-1 sub-network, only fine-tune the consequent parameters by using the gradient descent method, while the centers and widths are no longer trained. - 13:
Via the ensemble learning, generate the final result of SNFNN-1 after comprehensively considering all outputs of T sub-networks. - 14:
Return the whole SNFNN-1 model and its final results.
|