**1. Introduction**

With the rapid development of powerful computing environments and rich data sources, artificial intelligence (AI) technology such as neural networks [1–3], adaptive filtering [4–6] and evolutionary algorithms [7–9] has become increasingly more applicable for forecasting problems in various scenarios, such as medicine [10–12], economy [13–15] and electronic engineering [16–18]. The methods have acquired high reputations due to their grea<sup>t</sup> approximation abilities.

Although AI methods perform well when solving real world problems, most corresponding models adapt the mean squared error (MSE) as the criterion for training hidden nodes or building the cost functions, assuming that the data satisfy a Gaussian distribution. Moreover, the MSE is a global similarity measure where all the samples in the joint space have the same contribution [19]. Therefore, the MSE is likely to be badly affected by the noise and outliers that are hiding in the samples and this happens commonly in applications, such as speech signals, images, real-time traffic signals and electronic signals from ill-conditioned devices [20–22]. Therefore, MSE-based models are likely to result in poor performance in real world applications.

To conquer the weaknesses of the least mean squares (LMS), over the past decades, a number of studies have proposed methods to improve the robustness of the model against the noise and outliers that are contained in the data [23–27]. Among the existing technologies, M-estimators have been the focus of many academic studies. By detecting the potential outliers during training procedures, the M-estimator can eliminate the negative influences from the output weights that adversely a ffect the predictions [28]. Using these advantages, Zhou et al. [29] proposed a novel data-driven standard least-squares support vector regression (LSSVR) applying the M-estimator, which reduces the interference of outliers and enhances the robustness. However, there are di fficulties accessing clean learning data without noises so that the application on the M-estimator-based forecasting models based is limited.

Recently, information theoretic learning (ITL) has drawn considerable attention due to its good performance avoiding the e ffect of the noise and outliers [30–35] and it has become an e ffective alternative to the MSE criterion. In [36], the authors presented a novel training criterion based on the minimum error entropy (MEE) to replace the MSE. By taking advantages of the higher order description on entropy, MEE has become superior for non-Gaussian signal processing compared with traditional training criteria. Inspired by the entropy and Parzen kernel estimator, Liu et al. [37] proposed an extended definition of the correlation function for random processes using a generalized correlation function, known as correntropy. Although di fferent from global measurements, such as the mean squared error (MSE), the correntropy is regarded as a local similarity measurement where its value is primarily determined by the kernel function along x = y line [38], leading to high robustness against noise and outlies. Moreover, the correntropy has many grea<sup>t</sup> properties such as symmetry, nonnegativity and boundness. Most of all, it is easy to form convex cost functions based on the correntropy, which is very convenient for training the models [39–42]. Therefore, the correntropy has been widely used in forming robust models [43–45].

To enhance the forecasting ability of the model, in [46], the correntropy was introduced into the affine projection (AP) algorithm to overcome the degradation of the identification performance with impulsive noise environments. From the simulation results, it is easy to verify that the proposed algorithm has achieved better performance than other methods. Another approach to improve the robustness via the correntropy is enhancing the feature selection e fficiencies [47–49]. In [50], the kernel modal regression and gradient-based variable identification were integrated together using the maximum correntropy criterion, which guarantees the robustness of the algorithm. Additionally, in [51], a novel principal component analysis (PCA), based on the correntropy and known as the correntropy-optimized temporal PCA (CTPCA), was adapted to enhance the robustness for rejecting the outlier. The outlier improves the models training in simulation experiments. In addition to providing the extractions of the features in neural networks and filtering methods, the correntropy turns out to be a powerful tool for developing robust training methods that generate and adjust the weights in the model. In [52], Wang et al. introduced a feedback mechanism using the kernel recursive maximum correntropy to provide a novel kernel adaptive filters known as the kernel recursive maximum correntropy with multiple feedback (KRMC-MF). The experiments show that the generated filters have high robustness against outliers. In [53], Ahmad et al. proposed the correntropy based conjugate gradient backpropagation (CCG-BP), which can achieve high robustness in environments with both impulsive noise and heavy-tailed noise distributions. Unfortunately, most of the neural networks have to adjust the weights of each node during each training iterations which wastes time during the training process.

Recently, forecasting models with parameters that are free from adjustments have gained increasingly more attention due to their fast training speeds for the models [54–56]. Combined with the correntropy, these algorithms have shown grea<sup>t</sup> potential in real-world applications. For example, Guo et al. [57] developed a novel training method for echo state networks (ESNs) based on a correntropy induced loss function (CLF), which provides robust predictions for time-series signals. Similar to ESNs, extreme learning machines (ELMs) have received grea<sup>t</sup> attention on fast learning due to the random assignments of the hidden layer and being equipped with simpler structures, such as single

layer feedback networks (SLFNs) [58–60]. It has been proven that the hidden nodes can be assigned with any continuous probability distribution, while the model satisfies the universal approximation and classification capacity [61]. In particular, the extreme learning machine has been applied and received a high reputation for predicting production processes [62,63], system anomalies [64], etc. [65]. In [66], the authors first developed the correntropy-based ELM that uses the regularized correntropy criterion in place of the MSE with half quadratic (HQ) optimization which is called the regularized correntropy criterion for an extreme learning machine (RCC-ELM). Later, Chen et al. [67] extended the dimensions of the correntropy by combining two kinds of correntropy together to enhance the flexibility of the model to generate more robust ELM called ELM by maximum mixture correntropy criterion (MMCC-ELM). The experimental results show that the learning method performs better than the conventional maximum correntropy method. Although the RCC-ELM and MMCC-ELM possess high robustness compared with other ELM methods, the corresponding correntropy is constrained by no more than two kernels. The kernel bandwidth required for the assignments by users in advance is likely to degrade the model due to the improper description on the probability distribution of the signal with the correntropy.

To conquer the weakness of the existing correntropy-based ELMs, this paper focuses on providing a more robust predicting model with adaptive generation based on multi-kernel correntropy which can bring an accurate description of the current errors of ELM. This study developed a more flexible and robust forecasting ELM based on a newly developed adaptive multi-dimension correntropy using evolving cooperation. In the proposed method, the output weights of the ELM are trained based on the maximum multi-dimension correntropy with no constraints on the dimensions of the kernels. To achieve the most appropriate assignment of the parameters of each kernel in the correntropy, a novel evolving cooperation method is developed to concurrently optimize the bandwidths and the corresponding influence coe fficients to achieve the best estimations of the residual errors of the model. Furthermore, the training approach has been developed based on the properties of the multi-dimension correntropy. The main contribution of the paper can be summarized as follows.


The experiments compare the performance of the proposed method and several state-of-art methods using both simulated data and real-world data, which show that the proposed method obtains more the robust predictions than other methods. Finally, the proposed method is incorporated into the forecasting model for the current transfer ratio (CTR) signals for the optical couplers, and it achieves high accuracies and robustness.

The rest of the paper is as follows. The next section introduces the framework of the proposed method and multi-dimension correntropy. Section 3 describes the evolved cooperation for the kernels with multi-dimension correntropy and Section 4 provides the training procedures of the forecasting model. Then, Section 5 estimates the performance of the proposed method using both simulation data and real-world applications. Finally, the conclusion is drawn in Section 6.

## **2. The Framework of the Proposed Method**

The structure of the prediction model that is built using the proposed method is similar to those of other ELM-based methods. Figure 1 shows the basic structure of the method. Generally, the network includes one input layer, one hidden layer and one output layer. The hidden output is calculated using the given input vectors and the weights and the biases of the hidden nodes which are randomly assigned [54]:

$$h = f(\mathbf{zw} + \mathbf{b}) \tag{1}$$

where *f*(.) is the activation function and (*<sup>w</sup>*,*b*) are the weights and bias of the hidden nodes.

With the hidden layer, the network can simulate any kind of function by generating the output weights with the least mean squares (LMS) The cost function is calculated as follows [58]:

$$\mathbf{J}\_{\rm LS} = \|\mathbf{Y} - \mathbf{T}\|\tag{2}$$

where **T** is the expected output and **Y** is the predicted output of the model. **Y** calculated with the hidden outputs h and the output weights β as follows:

$$\mathbf{Y} = \beta \mathbf{h} \tag{3}$$

Therefore, the output layer is calculated as follows:

$$\mathcal{B} = (H^T H)^{-1} H^T T \tag{4}$$

Further, to constrain the output weights, the output layer is calculated as follows:

$$\mathcal{J} = \left(H^{\mathrm{I}}H + \lambda I\right)^{-1}H^{\mathrm{T}}T\tag{5}$$

where λ is the constraining coefficient.

Although the output weights that are calculated by Equation (4) or Equation (5) can provide good predictions using the training data, the model has suffered with the outliers and noises in the data which negatively affect the predictions. To overcome the problem, the correntropy, as a high order similarity measurement, has been used in some recently developed methods.

In [62], the cost function built using the correntropy as follows:

$$\mathbf{J}\_{\rm RCC} = \max\_{\beta} \left[ \sum\_{\mathbf{p}=1}^{N} \mathbf{G}(\mathbf{t\_{p}} - h\beta) - \lambda \|\beta\| \right] \tag{6}$$

where G(**tp** − *h*β) is the Gussian kernel calculated as follows:

$$\mathbf{G}(\mathbf{t\_p} - h\boldsymbol{\beta}) = \exp(-\frac{(\mathbf{t\_p} - h\boldsymbol{\beta})}{2\sigma^2})\tag{7}$$

where σ is the bandwidth of the kernel.

> Therefore, the output layer is calculated as follows:

$$\mathcal{B} = (\mathbf{H}^{\mathsf{T}} \boldsymbol{\Lambda} \mathbf{H} - \boldsymbol{\lambda} \mathbf{I})^{-1} \boldsymbol{H}^{\mathsf{T}} \boldsymbol{\Lambda} \mathbf{T} \tag{8}$$

where Λ is the diagonal matrix of the local optimal solution. It is calculated as follows:

$$
\alpha\_{\mathbf{p}}^{\pi+1} = -\mathbf{G}(\mathbf{t\_{p}} - h\boldsymbol{\beta}) \tag{9}
$$

To further improve the flexibility of the correntropy, the cost function with a mixed correntropy is defined in [67] as follows:

$$\mathbf{J}\_{\rm MMCC} = 1 - \frac{1}{\mathbf{N}} \sum\_{\mathbf{i}=1}^{\mathbf{N}} \left[ \alpha \mathbf{G}\_{\sigma\_1}(\mathbf{e}\_{\mathbf{i}}) + (1 - \alpha) \mathbf{G}\_{\sigma\_2}(\mathbf{e}\_{\mathbf{i}}) \right] + \lambda \left\| \boldsymbol{\beta} \right\|\tag{10}$$

Therefore, the output is calculated as follows:

$$\mathcal{B} = \left(\boldsymbol{H}^{\mathrm{T}}\boldsymbol{\Lambda}\boldsymbol{H} + \boldsymbol{\lambda}^{\prime}\boldsymbol{I}\right)^{-1}\boldsymbol{H}^{\mathrm{T}}\boldsymbol{\Lambda}\boldsymbol{T} \tag{11}$$

where the λ- = 2Nλ and Λ is the diagonal matrix with elements calculated as follows:

$$\Lambda\_{\text{ii}} = \alpha / \sigma\_1 \text{G}\_{\sigma\_1}(\text{e}\_{\text{i}}) + (1 - \alpha) / \sigma\_2 \text{G}\_{\sigma\_2}(\text{e}\_{\text{i}}) \tag{12}$$

*x1* x2 x3 x4 x5 W1, b1 W2, b2 W3, b3 W4, b4 W5, b5 W6, b6 Wm ,bm ȕ1 ȕ2 ȕ3 ȕ4 ȕ5 ȕk y *Input layer Hidden layer Output layer*

**Figure 1.** The structure of the prediction model.

With two coefficients, Equation (9) gives a more accurate estimation of the costs of the output layer, leading to a higher robustness of the model. Although Equations (7) and (9) can acquire better local similarity measurements compared with Equation (5), both criterions limit the correntropy into two kernels, leading to an inappropriate description on the probability distribution of the data. Additionally, the bandwidths and the coefficients must be assigned by users, thus limiting the performance of the corresponding model in real world applications which can be badly affected since the bandwidths are not suitable for the estimation of the correntropy. To provide a more flexible criterion for the training strategy with a more appropriate description of the probability distribution of the data, the proposed method develops a multi-kernel correntropy criterion that is calculated as follows:

$$\mathbf{k}(\mathbf{T} - \beta \mathbf{H}) = \sum\_{i=1}^{K} \alpha\_{i} \mathbf{G}\_{\sigma\_{i}}(\mathbf{T} - \beta \mathbf{H}) \tag{13}$$

where αi is the influence coefficients controlling the weight of each kernel. By using multiple kernels to construct the correntropy, the proposed method brings a more accurate approximation on the probability distribution of the samples, leading to a high prediction performance of the model. Based on the corretropy using Equation (13), the proposed method built a convex cost function for training the output weights, which has been analyzed in Section 4. For the suitable assignments of the parameters in Equation (13), a novel generation strategy using an evolved cooperating process based on SDPSO with the MIE to generate the parameters adaptively has been developed. Therefore, the framework of the proposed method can be summarized in Figure 2. The proposed method developed

an evolved-cooperation strategy to generate the optimized solution of the influence coe fficients and the bandwidths which suits the distribution of the prediction errors. To achieve an accurate estimation, the bandwidth was generated based on switching delayed particle swarm optimization (SDPSO) [68] and the influence coe fficients were calculated based on the cost function for estimating the probability distribution function of errors.

The basic procedures of the method are as follows. Supposing that the input vector of the samples is represented as *x* = {x1, x2, ... , x N}, calculate the output of hidden nodes with randomly assigned weights and biases as Equation (1). Then, adapt the cooperating evolution technology for training the output weights. For each iterations of the evolution, the output of the predicting model can be generated using Equation (3). Compared with the actual outputs, the predicted outputs result in current error *e* with the model. Based on the current error *e*, the proposed method makes the best assignments of the bandwidths in the correntropy with SDPSO and accesses the optimal coe fficients based on MIE. This is shown in the next section. Using the generated correntropy, a list of diagnostic kernels can be calculated which e ffects the updating of the output layer to reach higher accuracy. This is presented in Section 4. The processes stop when the cost function of the model is stable.

**Figure 2.** The framework of the proposed method.

More details are presented in the next section.

#### **3. The Cooperating Evolution Process for the Bandwidth and Influence Coe** ffi**cients of the Kernel**

For the correntropy that is defined by Equation (12), the bandwidth and the influence coe fficients are for the similarity measurements since the bandwidths act as the zoom lens for the measurements and the coe fficients determine the e ffect that each kernel has on the estimation of the correntropy according to the assigned bandwidth. They are defined as follows:

$$\boldsymbol{\sigma} = \{\sigma\_1, \sigma\_2, \sigma\_3, \dots, \sigma\_M\} \tag{14}$$

$$A = \langle \alpha\_1, \alpha\_2, \alpha\_3, \dots, \alpha\_M \rangle \tag{15}$$

Therefore, the bandwidth and the influence coefficients should be carefully assigned to match the probability distribution of the samples to achieve the best effect of the correntropy on generating the output weights of the prediction model. Since the correntropy depicts the probability distribution of the distance between the actual output and the model response, the bandwidth and the coefficients are able to form the probability distribution (pdf) function as follows:

$$\hat{f}(\mathbf{e}) = \sum\_{i=1}^{N} \alpha\_1 \mathbf{G}\_{\sigma\_1}(\mathbf{e}\_i) + \alpha\_2 \mathbf{G}\_{\sigma\_2}(\mathbf{e}\_i) + \dots + \alpha\_N \mathbf{G}\_{\sigma\_n}(\mathbf{e}\_i) \tag{16}$$

In applications, the real joint probability distribution for the cases are unknown. Therefore, the joint pdf can only be estimated for a finite number of samples{(ti,yi)}, where i = 1, 2, ... , N:

$$\mathbf{f}(\mathbf{e}) = \frac{1}{N} \mathbf{g}(\|(\mathbf{t}\_{\mathbf{k}'} \mathbf{y}\_{\mathbf{k}})\| \| \mathbf{t}\_{\mathbf{k}} - \mathbf{y}\_{\mathbf{k}} \| = \mathbf{e} \|)\tag{17}$$

where g(**S**) is the cardinal number of the set **S**.

Using the kernel contrasts between the pdf estimated with the assigned parameters and the pdf estimated using the data, the least mean integrated error (MIE) can be calculated as follows:

$$\text{MIE} = \text{E}(\int (\hat{f}(e) - f(e))^2 de) \tag{18}$$

Based on the MIE, the performance of the bandwidth and coefficients can be estimated using the contrasts with the pdf from the data. Therefore, the optimization of these parameters can be transformed to finding the solution with the minimum MIE.

In the proposed method, the switching delay particle swarm optimization is adapted to search for the best bandwidth. To achieve this, the particles are initialized with a list of potential bandwidth setting σc = {<sup>σ</sup>c,1, <sup>σ</sup>c,2, ... , σc,N}. With respect to each bandwidth of the particle, the velocities for the evolution of the particles are defined as follows:

$$\mathbf{v}\sigma\_{\mathbf{c}} = \{\mathbf{v}\sigma\_{\mathbf{c},1\prime}, \mathbf{v}\sigma\_{\mathbf{c},2\prime}, \dots, \mathbf{v}\sigma\_{\mathbf{c},N}\} \tag{19}$$

Meanwhile, the influence coefficient is denoted as vector *A*:

$$A\_{\mathfrak{c}} = \left\{ \alpha\_{\mathfrak{c},1}, \alpha\_{\mathfrak{c},2}, \alpha\_{\mathfrak{c},3}, \dots, \alpha\_{\mathfrak{c},M} \right\} \tag{20}$$

where αi is the influence coefficient according to <sup>σ</sup>c,i.

Since the samples provide disperse values of the outputs, the pdf from the data is estimated using the discrete version of Equation (16):

$$F = \{ \mathbf{f}(m\_1), \mathbf{f}(m\_2), \dots, \mathbf{f}(m\_k) \} \tag{21}$$

$$\mathbf{f}(m) = \frac{1}{N} \mathbf{g}(\left| (\mathbf{t}\_{\mathbf{k}'} \mathbf{y}\_{\mathbf{k}}) \right| m - \varepsilon \le \left| \mathbf{t}\_{\mathbf{k}} - \mathbf{y}\_{\mathbf{k}} \right| \le m + \varepsilon \| \mathbf{} \| \tag{22}$$

where the vector **m** = {m1, m2, ... , mk} is a list of values that satisfy m1 < m2 < ... < mk and |mi − mi−<sup>1</sup>| = ε. ε is the step length of the estimation.

Accordingly, the values from Equation (15) with respect to **m** are equivalent to the following set:

$$\mathbf{F} = \left\{ f(m\_1), f(m\_2), \dots, f(m\_k) \right\} \tag{23}$$

They can be calculated as:

$$
\mathbf{\hat{F}} = \mathbf{A}\mathbf{K} \tag{24}
$$

where *K* is the kernel matrix, which is as follows:

$$\mathbf{K} = \begin{bmatrix} \mathbf{G}\_{\sigma\_1}(\mathbf{e}\_1) & \mathbf{G}\_{\sigma\_1}(\mathbf{e}\_2) & \dots & \mathbf{G}\_{\sigma\_1}(\mathbf{e}\_N) \\ \mathbf{G}\_{\sigma\_2}(\mathbf{e}\_1) & \mathbf{G}\_{\sigma\_2}(\mathbf{e}\_2) & \dots & \mathbf{G}\_{\sigma\_2}(\mathbf{e}\_N) \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{G}\_{\sigma\_M}(\mathbf{e}\_1) & \mathbf{G}\_{\sigma\_M}(\mathbf{e}\_2) & \dots & \mathbf{G}\_{\sigma\_M}(\mathbf{e}\_N) \end{bmatrix} \tag{25}$$

By inserting Equations (20) and (22) into Equation (17), the following cost function can be obtained:

$$\text{MIE} = (\mathbf{A}\mathbf{K} - \mathbf{F})(\mathbf{A}\mathbf{K} - \mathbf{F})^{\text{T}} \tag{26}$$

Then, the following differential equations with respect to **A** are calculated:

$$\mathcal{Z}(\mathbf{A}\mathbf{K} - \mathbf{F}) = 0\tag{27}$$

Therefore, the coefficient can be calculated using the assigned bandwidth as follows:

$$\mathbf{A} = \mathbf{F} \mathbf{K}^T (\mathbf{K} \mathbf{K}^T)^{-1} \tag{28}$$

Since each particle contains one solution for the kernels' parameters, the personal best solution pσ and the global best solution gσ is updated by minimizing the costs. Then, the particles are updated as follows:

$$\mathbf{v}\sigma\_{\mathbf{c}}(\mathbf{k}+1) = \mathbf{w}\mathbf{v}\sigma\_{\mathbf{c}} + \mathbf{c}\_{1}(\mathbf{k}) \times \mathbf{r}\_{1}(\mathbf{p}\sigma(\mathbf{k}-(\mathbf{k})) - \sigma\_{\mathbf{c}}(\mathbf{k}))) + \mathbf{c}\_{2}(\mathbf{k}) \times \mathbf{r}\_{2}(\mathbf{g}\sigma(\mathbf{k}-\tau\_{2}(\mathbf{k}) - \sigma\_{\mathbf{c}}(\mathbf{k}))) \tag{29}$$

$$
\sigma\_{\mathbb{C}}(\mathbf{k}+1) = \sigma\_{\mathbb{C}}(\mathbf{k}) + \mathbf{v}\sigma\_{\mathbb{C}}(\mathbf{k}+1) \tag{30}
$$

where c1(k) and c2(k) are the acceleration coefficients and τ1(k) and τ2(k) are the time delays. All the parameters are adjusted based on the evolution factor, Ef, which determines the evolutionary states, and it is calculated as follows:

$$\mathbf{E}\mathbf{f} = (\mathbf{d}\_{\overline{\otimes}} - \mathbf{d}\_{\text{min}}) / (\mathbf{d}\_{\text{max}} - \mathbf{d}\_{\text{min}}) \tag{31}$$

where dg is the global best particle among the mean distance. It is calculated as:

$$\mathbf{d}\_{\mathfrak{G}} = \frac{1}{\mathcal{N}} \sum\_{i=1}^{N} \|\mathbf{o}\_{\mathbf{c},i} - \mathbf{g}\sigma\|\tag{32}$$

With the estimate on Ef, the parameters can be selected as shown in Table 1.


**Table 1.** The strategies for selecting the parameters.

In summary, the cooperative evolution process is shown in Algorithm 1. First, the bandwidth and the corresponding velocity of each particle are randomly assigned. Then, for each iteration of the process, the influence coefficients are evolved using the bandwidth based on the MIE and the particles are updated using the cost function. Finally, the algorithm finds the best solutions for the

The final solution of the bandwidth and the influence coefficients are determined as the solution that minimizes the costs during the evolution procedures.

bandwidth and the influence coefficients, from which the kernel depicts the pdf from the data. Based on the generated kernel, the correntropy can lead to a model with good robustness.

**Algorithm 1** Evolved cooperation for the kernel parameters

**Input:** the samples {xi,ti},i = 1, 2, ... , N **Output:** the vector of bandwidth σ and the vector of influence coefficients *A* **Parameters:** the step length and the number of iterations L **Initialization:** Set the cost function of the best solution MIEbest to ∞ and randomly assign the bandwidth of the kernels σc = {<sup>σ</sup>c,1, <sup>σ</sup>c,2, ... , σc,N} and the corresponding velocity **v**σc = {v<sup>σ</sup>c,1, <sup>v</sup>σc,2, ... , vσc,N}. 1: **for** k = 1, 2, ... L **do** 2: Generate the best influence coefficients *A*c using Equation (26) for each particles. 3: Calculate value of cost function for each particle MIEc based on Equation (24) 4: Update the personal best solution *p*σ and the global best solution *g*σ based on minimizing the cost function. 5: Calculate the Ef of the iteration with Equation (29) 6: Access the parameters for evolution based on Table 1 7: Update the swarm with Equations (27) and (28) 8: **end for** 9: Return the global best bandwidth *g*σ and the corresponding influence coefficients

#### **4. Training the Extreme Learning Machine Using the Multi-Dimension Correntropy**

To improve the robustness of the extreme learning machine, in the proposed method, the training procedure of the output layer as Equation (5), is replaced by the developed calculation using the mixture correntropy that is generated using the evolved kernel from Section 3. The loss function for the output layer is developed according to the following properties.

**Property 1.** *K(Y,T) is symmetric, which means the following: K(Y,T)* = *K(T,Y).*

**Property 2.** *K(T,Y) is positive and bounded, which means the following: 0* < *K(Y,T)* < = *1 and K(T,Y)* = *1 if and only if T* = *Y.*

**Property 3.** *K(T,Y) involves all the even moments of e, which means the following:*

$$K(T, Y) = E[e^{2u}] \sum\_{n=0}^{\infty} \frac{(-1)^n \sum\_{l=1}^M \alpha\_l \sigma\_l^{2u}}{2^u \prod\_{i=1}^M (\sigma\_i)^{2u} n!} \tag{33}$$

**Property 4.** *When the first bandwidth is large enough, it satisfies the following:*

$$\mathbb{K}(T,\mathcal{Y}) \approx \sum\_{i=1}^{M} \alpha\_{i} - \frac{\sum\_{i=1}^{M} \alpha\_{i} \sigma\_{i}^{2}}{2 \prod\_{i=1}^{M} \sigma\_{i}^{2}} \mathbb{E}[\boldsymbol{e}^{2}] \tag{34}$$

**Proof.** For **lim x**→**0exp**(**x**) ≈ 1 + **x**, suppose that σ1 is large enough, K(**T**,**Y**) can be approximated as follows:

$$\begin{array}{l} K(T, \mathbf{Y}) = \alpha\_1 \mathbf{G}\_{\sigma\_1}(e) + \alpha\_2 \mathbf{G}\_{\sigma\_2}(e) + \dots + \alpha\_m \mathbf{G}\_{\sigma\_m}(e) \\ \mathbf{I} = \alpha\_1 \left( 1 - \frac{e^2}{2\sigma\_1^2} \right) + \alpha\_2 \left( 1 - \frac{e^2}{2\sigma\_2^2} \right) + \dots + \alpha\_m \left( 1 - \frac{e^2}{2\sigma\_m^2} \right) \\ \mathbf{I} = \sum\_{i=1}^m \alpha\_i - \frac{\sum\_{i=1}^M \alpha\_i \sigma\_i^2}{2\prod\_{i=1}^M \sigma\_i^2} E[e^2] \end{array} \tag{35}$$

that completes the proof. - **Remark 1.** *Based on Property 4, the mixed C-loss is defined as L(T,Y)* = *1* − *K(T,Y), which is approximately equivalent to the mean square error (MSE) with a large enough bandwidth.*

**Property 5.** *The empirical mixed C-loss L(e) that is a function of e is convex at any point satisfying* ||*e*||∞ = *max*|*ei*| ≤ σ*1.*

**Proof.** Build the Hessian matrix of the C-loss function L(e) with respect to e as follows:

$$H\_{\mathbf{L}(\mathbf{e})=} \left[ \frac{\partial \mathbf{L}(\mathbf{e})}{\partial \mathbf{e}\_{\mathbf{l}} \partial \mathbf{e}\_{\mathbf{j}}} \right] = \mathbf{diag}(\xi\_{\mathbf{l}}, \xi\_{2}, \dots, \xi\_{\mathbf{N}}) \tag{36}$$

The elements of matrix ξ is calculated as follows:

$$\mathfrak{E}\_{i} = \sum\_{i=1}^{m} \alpha\_{i} \frac{\sigma\_{i}^{4} - \varepsilon\_{i}^{4}}{N \sigma\_{i}^{4}} G\_{\sigma\_{i}}(\mathfrak{e}\_{i}) \tag{37}$$

It is obvious that ξi is positive. Therefore, L(e) is convex. -

**Remark 2.** *Using Property 4 and Property 5, the loss function of the output weights is based on the empirical mixed C-loss L(e) from the data observations, which can be defined as follows:*

$$J = L(T, \mathbf{Y}) + \Lambda \left\| \mathcal{B} \right\|^2 = 1 - \frac{1}{N} \sum\_{i=1}^{N} \sum\_{j=1}^{M} \alpha\_j \mathcal{G}\_{\mathcal{O}\_j}(e\_i) + \Lambda \left\| \mathcal{B} \right\|^2 \tag{38}$$

*Based on Equation (38), the training criterion is generated for improvement on the robustness of the model.*

Taking the differential of the loss function, it is easy to ge<sup>t</sup> the following:

$$\begin{array}{lcl} \frac{\partial l(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = 0\\ \frac{N}{i} & \{\sum\_{j=1}^{M} \frac{\alpha\_{j}}{\alpha\_{j}^{T}} G\_{\sigma\_{j}}(e\_{i}) [e\_{i} \boldsymbol{h}\_{i}^{T}] + 2N\boldsymbol{\Lambda}\boldsymbol{\theta} = \boldsymbol{0} \\ \sum\_{i=1}^{N} \left(\boldsymbol{\varphi}(e\_{i}) \boldsymbol{h}\_{i}^{T} \boldsymbol{h}\_{i} \boldsymbol{\theta} - \boldsymbol{\varphi}(e\_{i}) \boldsymbol{t}\_{i} \boldsymbol{h}\_{i}^{T}\right) + \boldsymbol{\Lambda}' \boldsymbol{\mathcal{B}} = \boldsymbol{0} \\ \sum\_{i=1}^{N} \left(\boldsymbol{\varphi}(e\_{i}) \boldsymbol{h}\_{i}^{T} \boldsymbol{h}\_{i} \boldsymbol{\theta} + \boldsymbol{\Lambda}' \boldsymbol{\mathcal{B}} = \sum\_{i=1}^{N} \left(\boldsymbol{\varphi}(e\_{i}) \boldsymbol{t}\_{i} \boldsymbol{h}\_{i}^{T}\right) \\ \boldsymbol{\mathcal{B}} = \left[\boldsymbol{H}^{T} \boldsymbol{A} \boldsymbol{H} + \boldsymbol{\Lambda}' \boldsymbol{I}\right]^{-1} \boldsymbol{H}^{T} \boldsymbol{A} \boldsymbol{T} \end{array} \tag{39}$$

where Λ-= 2*N*Λ, ϕ(ei) = Mj=<sup>1</sup> αj σ2j <sup>G</sup><sup>σ</sup>j(ei) and Λ is a diagonal matrix with diagonal elements Λ*ii* = ϕ(*ei*), which provides the local similarity measurements between the predicted output and the actual outputs. When the training data contain large noise or many outliers, the corresponding diagonal elements are relatively low which induce the effects of such samples. Therefore, the algorithm can achieve high robustness against noises and outliers in the signals.

Since Equation (37) is a fixed-point equation because the diagonal matrix depends on the weight vector, the optimal solution should be solved by applying the evolved cooperation using Equation (37).

Therefore, combined with the kernel optimization in Section 3, the whole training process can be summarized in Algorithm 2, which is referred to as the ECC-ELM algorithm in this paper.

## **Algorithm 2** ECC-ELM

**Input:** the samples {xi,ti}, i = 1, 2, ... , N

## **Output:** output weights

**Parameters:** the number of hidden nodes N, the number of iterations L, the iterations T and termination tolerance ε

**Initialization:** Randomly set the weights and bias of the hidden nodes and initialize the output weights β using Equation (5)

1: **for** t = 1, 2, ... , T **do**

2: Calculate the residual error: ei = ti − *h*iβ, i = 1, 2, ... , N

3: Calculate the kernel parameters {<sup>σ</sup>, **A**} using Algorithm 1

4: Calculate the diagonal matrix Λ**:** Λ*ii* = ϕ(*ei*) = Mj=<sup>1</sup> αj <sup>G</sup><sup>σ</sup>j(ei)


7: **end for**

#### **5. Analysis on Time Complexity and Space Complexity of ECC-ELM**

In this section, the time complexity of the proposed method is analyzed and compared with the other algorithms. The main time complexity of the ECCELM comes from the cooperating evolution process and the training process of the model. The cooperative evolution contains the calculations of the influence coefficients and the particles updating with the time complexity of O(ItNK2), where It is the number of iterations, N is the number of particles and K is the number of disperse values of the outputs. To train the ELM, the procedures share the same time complexity as the RCC-ELM and MMCC-ELM, which is O(IhNl(5M+<sup>M</sup>2)), where Ih is the amount of iterations for training and Nl is the number of training data. Additionally, M is the number of hidden nodes. Therefore, the time complexity of ECC-ELM is O(IhNl(5M+M2+ItNK2)), which is slightly higher than those of the RCC-ELM and MMCC-ELM but it satisfies the requirements in most applications.

With respect to the spatial complexity, the ECC-ELM has the same complexity as the prediction models using the RCC-ELM, which is O(N+(N+2)M+Nl2). Additionally, the space complexity consumed by evolving process is O(2N+K). Therefore, the space complexity of ECC-ELM is O(N+(N+2)M+Nl<sup>2</sup>+2N+K), which has the same order as RCC-ELM and MMCC-ELM.

In summary, the time complexity and spatial complexity are practical for most applications.
