Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning

Zhang, Tao; Wang, Shiyuan; Zhang, Haonan; Xiong, Kui; Wang, Lin

doi:10.3390/e21060588

Open AccessArticle

Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning

¹

College of Electronic and Information Engineering, Southwest University, Chongqing 400715, China

²

Chongqing Key Laboratory of Nonlinear Circuits and Intelligent Information Processing, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(6), 588; https://doi.org/10.3390/e21060588

Submission received: 8 May 2019 / Revised: 11 June 2019 / Accepted: 12 June 2019 / Published: 13 June 2019

(This article belongs to the Special Issue Information Theoretic Learning and Kernel Methods)

Download

Browse Figures

Versions Notes

Abstract

:

As a nonlinear similarity measure defined in the reproducing kernel Hilbert space (RKHS), the correntropic loss (C-Loss) has been widely applied in robust learning and signal processing. However, the highly non-convex nature of C-Loss results in performance degradation. To address this issue, a convex kernel risk-sensitive loss (KRL) is proposed to measure the similarity in RKHS, which is the risk-sensitive loss defined as the expectation of an exponential function of the squared estimation error. In this paper, a novel nonlinear similarity measure, namely kernel risk-sensitive mean p-power error (KRP), is proposed by combining the mean p-power error into the KRL, which is a generalization of the KRL measure. The KRP with

p = 2

reduces to the KRL, and can outperform the KRL when an appropriate p is configured in robust learning. Some properties of KRP are presented for discussion. To improve the robustness of the kernel recursive least squares algorithm (KRLS) and reduce its network size, two robust recursive kernel adaptive filters, namely recursive minimum kernel risk-sensitive mean p-power error algorithm (RMKRP) and its quantized RMKRP (QRMKRP), are proposed in the RKHS under the minimum kernel risk-sensitive mean p-power error (MKRP) criterion, respectively. Monte Carlo simulations are conducted to confirm the superiorities of the proposed RMKRP and its quantized version.

Keywords:

correntropic; quantized; kernel risk-sensitive mean p-power error; recursive; kernel adaptive filters

1. Introduction

Online kernel-based learning is to extend the kernel methods to online settings where the data arrives sequentially, which has been widely applied in signal processing thanks to its excellent performance in addressing nonlinear issues [1]. The development of kernel methods is of great significance for practical applications. In kernel methods, the input data are transformed from the original space into the reproducing kernel Hilbert space (RKHS) using the kernel trick [2]. As the representative of the kernel methods, kernel adaptive filters (KAFs) provide an effective way to transform a nonlinear problem into a linear one, which have been widely introduced in system identification and time-series prediction [3,4,5]. Generally, KAFs are designed for Gaussian and non-Gaussian noises from the aspect of cost function, respectively.

For Gaussian noises, the second-order similarity measures of errors are generally used as a cost function of KAFs to achieve desirable filtering accuracy. Therefore, in the Gaussian noise environment, KAFs based on the second-order similarity measures of errors are mainly divided into three categories, i.e., the kernel least mean square (KLMS) algorithm [6], the kernel affine projection algorithm (KAPA) [7], and the kernel recursive least square algorithm (KRLS) [8]. However, the network size of KAFs increases linearly with the length of training, leading to large computational and storage burdens. To curb this structure growth, many sparsification methods are required, such as the surprise criterion (SC) [9], novelty criterion (NC) [10], coherence criterion [11], and approximate linear dependency (ALD) criterion [8]. However, these sparsification methods only discard the redundant data, leading to reduction of filtering accuracy. Unlike the aforementioned sparsification methods, the vector quantization (VQ) utilizes the redundant data to update the weights for accuracy improvement. The VQ is combined into KAFs to generate quantized KAFs, e.g., the quantized kernel least mean square algorithm (QKLMS) [12] and quantized kernel recursive least squares algorithm (QKRLS) [13].

However, the second-order similarity measures used in the aforementioned algorithms merely contain the second order statistics of errors, which cannot address non-Gaussian noises or outliers, efficiently [14]. Thus, it is very important to design a cost function beyond the second-order statistics of errors for combating non-Gaussian noises. The non-second order similarity measures can be divided into three categories, i.e., the mean p-power error (MPE) criterion [15], information theoretic learning (ITL) [14], and risk-sensitive loss (RL) based criteria [16,17]. The MPE criterion based on the pth absolute moment of the error can deal with non-Gaussian data with a proper p-value, efficiently. In general, MPE is robust to large outliers when

p < 2

[15], generating robust adaptive filters [15], e.g., the kernel least mean p-power (KLMP) algorithm [18] and the kernel recursive least mean p-power (KRLP) algorithm [18]. ITL can incorporate the complete distribution of errors into the learning process, resulting in the improvement of filtering precision and robustness to outliers. The most widely used ITL criterion is the maximum correntropy criterion (MCC) [19,20,21,22,23,24]. As a local similarity measure defined as a generalized correlation in the RKHS, the correntropy used in MCC can leverage higher order statistics of data to combat outliers [25]. However, the performance surface of the correntropic loss (C-Loss) is highly non-convex, which may lead to poor convergence performance. In the RL-based criteria, e.g., minimum risk-sensitive loss [16] and minimum kernel risk-sensitive loss (MKRL) [17,26], the risk-sensitive loss in the RKHS is convex extremely, which is more efficient for combating non-Gaussian noises or outliers than MCC [17,26]. However, since the MKRL uses the stochastic gradient descent (SGD) method to update its weights, the desirable filtering performance cannot be achieved for some complex nonlinear issues. The recursive update rules with excellent tracking ability can improve the filtering performance of adaptive filtering algorithms [21]. For example, KRLS based on the recursive update rule can improve the filtering performance of KLMS based on the SGD, significantly. To the best of our knowledge, however, it has not yet been proposed to design a recursive MKRL algorithm for desirable filtering performance in the RKHS by a recursive update rule.

In this paper, to inherit the advantages of both KRL and MPE for robustness improvement, we propose the risk-sensitive mean p-power error (RP) defined as the expectation of an exponential function of the pth absolute moment of the estimation error, and its kernel RP (KRP). The KRP can outperform the KRL by setting an appropriate p-value for robust learning, and the KRP with

p = 2

reduces to the KRL. The proposed KRP criterion is used to derive a novel recursive minimum kernel risk-sensitive mean p-power error (RMKRP) algorithm for desirable filtering performance by combining the weighted output information. Furthermore, to curb the growth of network size in the RMKRP, the VQ is combined into RMKRP to generate quantized RMKRP (QRMKRP).

The rest of this paper is organized as follows. In Section 2, we define the KRP, and give some basic properties. The KRP criterion is derived to develop a recursive adaptive algorithm by combining the weighted output information, called RMKRP algorithm in Section 3. To further reduce the network size of RMKRP, the vector quantization method is applied in RMKRP, thus generating the quantized RMKRP (QRMKRP) in Section 3. In Section 4, Monte Carlo simulations are conducted to validate the superiorities of the proposed algorithms in nonlinear examples. The conclusion is summarized in Section 5.

2. Kernel Risk-Sensitive Mean p-Power Error

2.1. Definition

According to [17], the risk-sensitive loss is defined in RKHS, called the kernel risk-sensitive loss (KRL). Given two arbitrary scalar random variables X and Y, where

X, Y \in R

, the KRL is defined by

\begin{matrix} L_{λ} (X, Y) & = \frac{1}{λ} E [exp (λ (\frac{1}{2} {∥φ (X) - φ (Y)∥}_{F}^{2}))] \\ = \frac{1}{λ} E [exp (λ (\frac{1}{2} {〈φ (X) - φ (Y), φ (X) - φ (Y)〉}_{F}))] \\ = \frac{1}{λ} E [exp (λ (\frac{1}{2} ({〈φ (X), φ (X)〉}_{F} + {〈φ (Y), φ (Y)〉}_{F} - 2 {〈φ (X), φ (Y)〉}_{F})))] \\ = \frac{1}{λ} E [exp (λ (1 - κ_{σ} (X - Y)))] \\ = \frac{1}{λ} \int exp (λ (1 - κ_{σ} (X - Y))) d F_{X Y} (x, y), \end{matrix}

(1)

where

λ > 0

is a risk-sensitive scalar parameter;

φ (X) = κ (X, .)

is a nonlinear mapping induced by a Mercer kernel

κ_{σ} (.)

, which transforms the data from the original space into the RKHS

F

equipped with an inner product

{〈., .〉}_{F}

satisfying

{〈φ (X), φ (Y)〉}_{F} = φ^{T} (X) φ (Y) = κ_{σ} (X - Y)

;

E

denotes the mathematical expectation;

{∥.∥}_{F}

denotes the norm in RKHS

F

;

F_{X Y} (x, y)

denotes the joint distribution function of

(X, Y)

. A shift-invariant Gaussian kernel

κ_{σ} (.)

with bandwidth

σ

is given as follows:

\begin{matrix} κ_{σ} (x, y) = κ_{σ} (x - y) = exp (- \frac{{(x - y)}^{2}}{2 σ^{2}}) . \end{matrix}

(2)

However, the joint distribution of

(X, Y)

is usually unknown, and only N samples

{x (i), y (i)}_{i = 1}^{N}

are available. Hence, the nonparametric estimate of

L_{λ} (X, Y)

is obtained by applying the Parzen windows [19] as

{\hat{L}}_{λ} (X, Y) = \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ (1 - κ_{σ} (x (i) - y (i))))

. Note that the inner product in the RKHS for the same input is calculated by using kernel trick and (2), i.e,

φ^{T} (X) φ (X) = exp (- \frac{{(X - X)}^{2}}{2 σ^{2}}) = 1

.

In this paper, we define a new non-second order similarity measure in the RKHS, i.e., the kernel risk-sensitive mean p-power error (KRP) loss. Given two random variables X and Y, the KRP loss is defined by

\begin{matrix} L_{λ, p} (X, Y) & = \frac{1}{λ} E [exp (λ 2^{- p / 2} ({∥φ (X) - φ (Y)∥}_{F}^{p}))] \\ = \frac{1}{λ} E [exp (λ 2^{- p / 2} {({∥φ (X) - φ (Y)∥}_{F}^{2})}^{p / 2})] \\ = \frac{1}{λ} E [exp (λ 2^{- p / 2} {(2 - 2 κ_{σ} (X - Y))}^{p / 2})] \\ = \frac{1}{λ} E [exp (λ {(1 - κ_{σ} (X - Y))}^{p / 2})] \\ = \frac{1}{λ} \int exp (λ {(1 - κ_{σ} (X - Y))}^{p / 2}) d F_{X, Y} (x, y), \end{matrix}

(3)

where

p > 0

is the power parameter. Note that the KRL can be regarded as a special case of the KRP with

p = 2

.

However, the joint distribution of X and Y is usually unknown in practice. Hence, the empirical KRP is defined as follows:

\begin{matrix} {\hat{L}}_{λ, p} (X, Y) = \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ {(1 - κ_{σ} (x (i) - y (i)))}^{p / 2}), \end{matrix}

(4)

where

{x (i), y (i)}_{i = 1}^{N}

denotes the available finite number of samples. The empirical KRP can be regarded as a distance between both

X = {[x (1), x (2), \dots, x (N)]}^{T}

and

Y = {[y (1), y (2), \dots, y (N)]}^{T}

.

2.2. Properties

In the following, we give some important properties of the proposed KRP.

Property 1.

L_{λ, p} (X, Y)

is symmetric that is

L_{λ, p} (X, Y) = L_{λ, p} (Y, X)

.

Proof.

Straightforward since

κ_{σ} (X - Y) = κ_{σ} (Y - X)

. □

Property 2.

L_{λ, p} (X, Y)

is positive and bounded, i.e.,

\frac{1}{λ} \leq L_{λ, p} (X, Y) \leq \frac{1}{λ} exp (λ)

, and reaches its minimum if

X = Y

.

Proof.

Straightforward since

0 < κ_{σ} (X - Y) \leq 1

, and

κ_{σ} (X - Y) = 1

if

X = Y

. □

Property 3.

As λ is small enough, it holds that

L_{λ, p} (X, Y) \approx \frac{1}{λ} + E [{(1 - κ_{σ} (X - Y))}^{p / 2}]

.

Proof.

For a small enough

λ

, we have

λ {(1 - κ_{σ} (X - Y))}^{p / 2} \to 0

, i.e.,

\begin{matrix} exp (λ {(1 - κ_{σ} (X - Y))}^{p / 2}) \approx 1 + λ {(1 - κ_{σ} (X - Y))}^{p / 2} . \end{matrix}

(5)

Therefore, we can obtain

\begin{matrix} L_{λ, p} (X, Y) & = \frac{1}{λ} E [exp (λ {(1 - κ_{σ} (X - Y))}^{p / 2})] \\ \overset{(5)}{\approx} \frac{1}{λ} E [1 + λ {(1 - κ_{σ} (X - Y))}^{p / 2}] \\ = \frac{1}{λ} + E [{(1 - κ_{σ} (X - Y))}^{p / 2}] . \end{matrix}

(6)

□

Property 4.

As σ is large enough, it holds that

L_{λ, p} (X, Y) \approx \frac{1}{λ} + {(2 σ^{2})}^{- p / 2} E [| X - Y |^{p}]

.

Proof.

Since

exp (x)

is approximated by

1 + x

for a small enough x, for the case of large enough

σ

, i.e.,

\frac{{(X - Y)}^{2}}{2 σ^{2}} \to 0

. Thus, we can obtain the approximation as

\begin{matrix} exp (- \frac{{(X - Y)}^{2}}{2 σ^{2}}) \approx 1 - \frac{{(X - Y)}^{2}}{2 σ^{2}} . \end{matrix}

(7)

Similarly, when

λ {(\frac{{(X - Y)}^{2}}{2 σ^{2}})}^{p / 2} \to 0

for large enough

σ

, we can also obtain the approximation as

\begin{matrix} exp (λ {(\frac{{(X - Y)}^{2}}{2 σ^{2}})}^{p / 2}) \approx 1 + λ {(\frac{{(X - Y)}^{2}}{2 σ^{2}})}^{p / 2} . \end{matrix}

(8)

According to (7) and (8), we have

\begin{matrix} L_{λ, p} (X, Y) & = \frac{1}{λ} E [exp (λ {(1 - exp (- \frac{{(X - Y)}^{2}}{2 σ^{2}}))}^{p / 2})] \\ \overset{(7)}{\approx} \frac{1}{λ} E [exp (λ {(\frac{{(X - Y)}^{2}}{2 σ^{2}})}^{p / 2})] \\ \overset{(8)}{\approx} \frac{1}{λ} E [1 + λ {(\frac{{(X - Y)}^{2}}{2 σ^{2}})}^{p / 2}] \\ = \frac{1}{λ} + {(2 σ^{2})}^{- p / 2} E [{|X - Y|}^{p}] . \end{matrix}

(9)

□

Remark 1.

According to Properties 3 and 4, the KRP is, approximately, equivalent to the KMPE [27] as λ is small enough, and equivalent to the MPE [15] as σ is large enough. Thus, the KMPE and MPE can be viewed as two extreme cases of the KRP.

Property 5.

As p is small enough, it holds that

L_{λ, p} (X, Y) \approx \frac{1}{λ} exp (λ (1 + (p / 2) E [\log (1 - κ_{σ} (X - Y))])) \approx \frac{1}{λ} exp (λ)

.

Proof.

Property 5 holds because of

(1 - κ_{σ} {(X - Y))}^{p / 2} \approx 1 + (p / 2) E [\log (1 - κ_{σ} (X - Y))] \approx 1

. □

Property 6.

Let

e = X - Y = {[e (1), e (2), \dots, e (N)]}^{T}

, where

e (i) = x (i) - y (i)

. The empirical KRP

{\hat{L}}_{λ, p} (X, Y)

as a function of

e

is convex at any point satisfying

{∥e∥}_{\infty} = {max}_{i = 1, 2, \dots, N} |e (i)| \leq σ

and

p \geq 2

. When

{∥e∥}_{\infty} > σ

, the empirical KRP

{\hat{L}}_{λ, p} (X, Y)

is also convex if the risk-sensitive parameter

λ > 0

and power parameter

p \geq max_{i = 1, 2, \dots, N} \{\frac{2 (e^{2} (i) - σ^{2}) (1 - κ_{σ} (e (i)))}{e^{2} (i) κ_{σ} (e (i))} + 2\}

.

Proof.

Since

{\hat{L}}_{λ, p} (X, Y) = \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ {(1 - κ_{σ} (e (i)))}^{p / 2})

, the Hessian matrix of

{\hat{L}}_{λ, p} (X, Y)

with respect to

e

can be derived as

\begin{matrix} H_{{\hat{L}}_{λ, p} (X, Y)} (e) = [\frac{\partial^{2} {\hat{L}}_{λ, p} (X, Y)}{\partial e (i) \partial e (j)}] = diag [γ_{1}, γ_{2}, \dots, γ_{N}], \end{matrix}

(10)

where

\begin{matrix} γ_{i} = & ζ_{i} (\frac{p λ}{2 σ^{2}} {(1 - κ_{σ} (e (i)))}^{(p - 2) / 2} exp (- \frac{e^{2} (i)}{2 σ^{2}}) e^{2} (i) + 1 \\ + \frac{p - 2}{2} {(1 - κ_{σ} (e (i)))}^{- 1} exp (- \frac{e^{2} (i)}{2 σ^{2}}) \frac{e^{2} (i)}{σ^{2}} - \frac{e^{2} (i)}{σ^{2}}), \end{matrix}

(11)

with

ζ_{i} = \frac{p}{2 N σ^{2}} exp (λ {(1 - κ_{σ} (e (i)))}^{p / 2}) {(1 - κ_{σ} (e (i)))}^{(p - 2) / 2} exp (- \frac{e^{2} (i)}{2 σ^{2}}), i = 1, 2, \dots, N

. When

p \geq 2

, we have

H_{{\hat{L}}_{λ, p} (X, Y)} (e) > 0

if

max_{i = 1, 2, \dots, N} |e (i)| \leq σ

. From (11), if

| e (i) | \leq σ

and

p \geq 2

, or

| e (i) | > σ

and

p \geq [2 (e^{2} (i) - σ^{2}) (1 - κ_{σ} (e (i))) / (e^{2} (i) κ_{σ} (e (i)))] + 2

, we have

ζ_{i} \geq 0

. Therefore, we have

H_{{\hat{L}}_{λ, p} (X, Y)} (e) \geq 0

if

\begin{matrix} p (λ {(1 - κ_{σ} (e (i)))}^{p / 2} + 1) \geq max_{i = 1, 2, \dots, N} \{\frac{2 (e^{2} (i) - σ^{2}) (1 - κ_{σ} (e (i)))}{e^{2} (i) κ_{σ} (e (i))} + 2\}, \end{matrix}

(12)

where

(λ {(1 - κ_{σ} (e (i)))}^{p / 2} + 1) \geq 1

. Thus, we have

p \geq max_{i = 1, 2, \dots, N} \{\frac{2 (e^{2} (i) - σ^{2}) (1 - κ_{σ} (e (i)))}{e^{2} (i) κ_{σ} (e (i))} + 2\}

. □

Remark 2.

According to Property 6, the empirical KRP as a function of

e

is convex at any point satisfying

{∥e∥}_{\infty} \leq σ

and

p \geq 2

. For the case

{∥e∥}_{\infty} > σ

, the empirical KRP can still be convex at a point if the risk-sensitive parameter

λ > 0

and power parameter

p \geq max_{i = 1, 2, \dots, N} \{\frac{2 (e^{2} (i) - σ^{2}) (1 - κ_{σ} (e (i)))}{e^{2} (i) κ_{σ} (e (i))} + 2\}

.

Property 7.

As

σ \to \infty

or

x (i) \to 0

,

i = 1, 2, \dots, N

, it holds that

\begin{matrix} {\hat{L}}_{λ, p} (X, 0) \approx \frac{1}{λ} + \frac{1}{N {(\sqrt{2} σ)}^{p}} {∥X∥}_{p}^{p}, \end{matrix}

(13)

where

0

denotes an N-dimensional zero vector.

Proof.

\begin{matrix} {\hat{L}}_{λ, p} (X, 0) & = \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ {(1 - κ_{σ} (x (i)))}^{p / 2}) \approx \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ {(1 - (1 - \frac{x^{2} (i)}{2 σ^{2}}))}^{p / 2}) \\ = \frac{1}{N λ} \sum_{i = 1}^{N} exp (λ {(\frac{x^{2} (i)}{2 σ^{2}})}^{p / 2}) \approx \frac{1}{N λ} \sum_{i = 1}^{N} [1 + λ {(\frac{x^{2} (i)}{2 σ^{2}})}^{p / 2}] \\ = \frac{1}{λ} + \frac{1}{N {(\sqrt{2} σ)}^{p}} \sum_{i = 1}^{N} | x (i) |^{p} = \frac{1}{λ} + \frac{1}{N {(\sqrt{2} σ)}^{p}} {∥X∥}_{p}^{p} . \end{matrix}

(14)

□

Remark 3.

According to Property 7, the empirical KRP

{\hat{L}}_{λ, p} (X, 0)

behaves like an

L_{p}

norm of

X

when kernel bandwidth σ is large enough.

3. Application to Adaptive Filtering

In this section, to combat non-Gaussian noises, two recursive robust adaptive algorithms under the proposed KRP criterion are proposed in the RKHS using the kernel trick and vector quantization technique, respectively.

3.1. RMKRP

The recursive strategy is introduced into the KRP loss function, namely the recursive minimum kernel risk-sensitive mean p-power error (RMKRP) algorithm. The offline solution to minimum of the KRP loss is first obtained. Based on the obtained offline solution, the recursive solution or online solution to minimum of the KRP loss is then derived using some matrix operations, which generates the RMKRP algorithm. The details of RMKRP are shown as follows.

Consider the prediction of a continuous input-output model

f : U \to R

based on adaptive filtering shown in Figure 1, where

u (i) \in U \subset R^{D}

is the ith D-dimensional input vector,

d (i) \in R

is the ith scalar desired output contaminated by a noise

v (i)

, i.e.,

d (i) = f (u (i)) + v (i)

. A sequence of training samples

{\{u (j), d (j)\}}_{j = 1}^{i}

is used to perform the prediction of

f (\cdot)

in an adaptive filter. The nonlinear mapping

φ (u (j))

of input

u (j)

is denoted by

φ (j)

for simplicity. Hence, in the RKHS

F

, the training samples are changed to

\{Φ (i), d (i)\}

, where the desired output vector is

d (i) = {[d (1), d (2), \dots, d (i)]}^{T}

and the input kernel mapping matrix is

Φ (i) = [φ (1), φ (2), \dots, φ (i)]

. The prediction denoted by

\hat{f} (\cdot)

in the RKHS is therefore given as

\hat{f} (\cdot) = φ^{T} (\cdot) Ω

, where

Ω \in F

is the weight vector in a high dimensional feature space

F

.

An exponentially-weighted loss function is used here to put more emphasis on recent data and to de-emphasize data on the remote past [28]. When

\{Φ (i), d (i)\}

are available, the weight vector

Ω (i)

is obtained as the offline solution to minimizing the following weighted cost function:

\begin{matrix} J (Ω (i)) = \sum_{j = 1}^{i} ρ^{i - j} \frac{1}{λ} exp (λ {(z (j))}^{\frac{p}{2}}) + \frac{1}{2} ρ^{i} ζ {∥Ω (i)∥}^{2}, \end{matrix}

(15)

where

ρ

denotes the forgetting factor in the interval

[0, 1]

,

ζ

is the regularization factor,

z (j) = 1 - exp (- \frac{e^{2} (j)}{2 σ^{2}})

, and

e (j) = d (j) - φ^{T} (j) Ω (i)

denotes the jth estimate error. The second term is a norm penalizing term, which is to guarantee the existence of the inverse of the input data autocorrelation matrix especially during the initial update stages. In addition, the regularization term is weighted by

ρ

, which deemphasizes regularization as time progresses. According to Property 6, the empirical KRP as a function of

e

is convex at any point satisfying

max_{j = 1, 2, \dots, i} |e (j)| \leq σ

,

λ > 0

, and

p \geq 2

. To obtain the minimum of (15), its gradient is calculated, i.e.,

\begin{matrix} \frac{\partial J (Ω (i))}{\partial Ω (i)} = & - \frac{p}{2 σ^{2}} \sum_{j = 1}^{i} φ (j) ρ^{i - j} exp (λ {(z (j))}^{\frac{p}{2}}) {(z (j))}^{\frac{p - 2}{2}} (d (j) - φ^{T} (j) Ω (i)) (1 - z (j)) + ρ^{i} ζ Ω (i) \\ = & - \frac{p}{2 σ^{2}} \sum_{j = 1}^{i} φ (j) ρ^{i - j} exp (λ {(z (j))}^{\frac{p}{2}}) {(z (j))}^{\frac{p - 2}{2}} (1 - z (j)) d (j) \\ + \frac{p}{2 σ^{2}} \sum_{j = 1}^{i} φ (j) ρ^{i - j} exp (λ {(z (j))}^{\frac{p}{2}}) {(z (j))}^{\frac{p - 2}{2}} (1 - z (j)) φ^{T} (j) Ω (i) + ρ^{i} ζ Ω (i) . \end{matrix}

(16)

Setting (16) to zero, i.e.,

\frac{\partial J (Ω)}{\partial Ω} = 0

, we can obtain the offline solution to minimum of (15) as follows:

\begin{matrix} Ω (i) = {(Φ (i) H (i) Φ^{T} (i) + ρ^{i} ζ 2 σ^{2} / p I)}^{- 1} Φ (i) H (i) d (i), \end{matrix}

(17)

where

H (i) = diag [H_{1} (i), H_{2} (i), \dots, H_{i} (i)]

with

H_{j} (i) = ρ^{i - j} exp (λ {(z (j))}^{\frac{p}{2}}) {(z (j))}^{\frac{p - 2}{2}} (1 - z (j))

,

j = 1, 2, \dots, i

, and

I

denotes an identity matrix with an appropriate dimension.

To obtain an efficient recursive solution to the minimum of (15), a Mercer kernel is used to construct the RKHS. Here, the Gaussian kernel is used as a Mercer kernel, which is denoted as

κ_{σ_{1}} (.)

with

σ_{1}

being the kernel width. The inner product in the RKHS can be calculated by using the kernel trick [28], i.e.,

κ_{σ_{1}} (u (i), u (j)) = κ_{σ_{1}} (u (i) - u (j)) = φ^{T} (u (i)) φ (u (j)) = φ^{T} (i) φ (j)

, efficiently, which can avoid the direct calculation of nonlinear mapping

φ (\cdot)

.

Since the matrix inversion lemma [28] is described by

{(A + BCD)}^{- 1} = A^{- 1} - A^{- 1} B {(C^{- 1} + D A^{- 1} B)}^{- 1} D A^{- 1}

, by letting

A = ρ^{i} ζ 2 σ^{2} / p I

,

B = Φ (i)

,

C = H (i)

, and

D = Φ^{T} (i)

, we rewrite (17) as

\begin{matrix} {(Φ (i) H (i) Φ^{T} (i) + ρ^{i} ζ 2 σ^{2} / p I)}^{- 1} Φ (i) H (i) = Φ (i) {(Φ^{T} (i) Φ (i) + ρ^{i} ζ 2 σ^{2} / p H {(i)}^{- 1})}^{- 1} . \end{matrix}

(18)

Substituting (18) into (17) yields

\begin{matrix} Ω (i) = Φ (i) {(Φ^{T} (i) Φ (i) + ρ^{i} ζ 2 σ^{2} / p H {(i)}^{- 1})}^{- 1} d (i) . \end{matrix}

(19)

Note that

Φ^{T} (i) Φ (i)

in (19) can be computed by the kernel trick, efficiently. The weight vector

Ω (i)

is therefore described explicitly as a linear combination of the input data in the RKHS, i.e.,

\begin{matrix} Ω (i) = Φ (i) α (i), \end{matrix}

(20)

where

α (i)

denotes the coefficients vector.

It can be seen from (20) that the recursive form of

Ω (i)

is changed to that of

α (i)

. Hence, in the following, the key for finding a recursive solution to the minimum of (15) is to obtain the recursive form of

α (i)

.

The coefficients vector

α (i)

is calculated using the kernel trick as

\begin{matrix} α (i) = {(Φ^{T} (i) Φ (i) + ρ^{i} ζ 2 σ^{2} / p H {(i)}^{- 1})}^{- 1} d (i) . \end{matrix}

(21)

For simplicity, we obtain the update form of

α (i)

indirectly by defining

Λ (i)

as

\begin{matrix} Λ (i) = {(Φ^{T} (i) Φ (i) + ρ^{i} ζ 2 σ^{2} / p H {(i)}^{- 1})}^{- 1}, \end{matrix}

(22)

where

Φ (i) = {Φ (i - 1), φ (i)}

. Then, the update form of

Λ (i)

can be further obtained

\begin{matrix} Λ (i) = [\begin{matrix} Φ^{T} (i - 1) Φ (i - 1) + ρ^{i} ζ 2 σ^{2} / p H {(i - 1)}^{- 1} \\ φ^{T} (i) Φ (i - 1) \end{matrix} {\begin{matrix} Φ^{T} (i - 1) φ (i) \\ φ^{T} (i) φ (i) + ρ^{i} ζ 2 σ^{2} / p ν (i) \end{matrix}]}^{- 1}, \end{matrix}

(23)

where

ν (i) = {(exp (λ {(z (i))}^{\frac{p}{2}}) {(z (i))}^{\frac{p - 2}{2}} (1 - z (i)))}^{- 1}

. By using some matrix operations, we further simplify (23) as

\begin{matrix} Λ {(i)}^{- 1} = [\begin{matrix} Λ {(i - 1)}^{- 1} \\ ξ^{T} (i) \end{matrix} \begin{matrix} ξ (i) \\ ρ^{i} ζ 2 σ^{2} / p ν (i) + φ^{T} (i) φ (i) \end{matrix}], \end{matrix}

(24)

where

ξ (i) = Φ^{T} (i - 1) φ (i)

. By using the following block matrix inversion identity [18,21,28]

\begin{matrix} [\begin{matrix} A \\ C \end{matrix} {\begin{matrix} B \\ D \end{matrix}]}^{- 1} = [\begin{matrix} {(A - B D^{- 1} C)}^{- 1} \\ - D^{- 1} C {(A - B D^{- 1} C)}^{- 1} \end{matrix} \begin{matrix} - A^{- 1} B {(D - C A^{- 1} B)}^{- 1} \\ {(D - C A^{- 1} B)}^{- 1} \end{matrix}], \end{matrix}

(25)

then, we can obtain the update equation for the inverse of the growing matrix in (24) as

\begin{matrix} Λ (i) = r^{- 1} (i) [\begin{matrix} Λ (i - 1) r (i) + θ (i) θ^{T} (i) \\ - θ^{T} (i) \end{matrix} \begin{matrix} - θ (i) \\ 1 \end{matrix}], \end{matrix}

(26)

where

θ (i) = Λ (i - 1) ξ (i)

and

r (i) = ρ^{i} ζ 2 σ^{2} / p ν (i) + φ^{T} (i) φ (i) - θ^{T} (i) ξ (i)

. Combining (21) with (26), the coefficients vector

α (i)

of the weight vector

Ω (i)

is shown as follows:

\begin{matrix} α (i) = Λ (i) d (i) & = [\begin{matrix} Λ (i - 1) + θ (i) θ^{T} (i) r^{- 1} (i) \\ - θ^{T} (i) r^{- 1} (i) \end{matrix} \begin{matrix} - θ (i) r^{- 1} (i) \\ r^{- 1} (i) \end{matrix}] [\begin{matrix} d (i - 1) \\ d (i) \end{matrix}] \\ = [\begin{matrix} α (i - 1) - θ (i) r^{- 1} (i) e (i) \\ r^{- 1} (i) e (i) \end{matrix}], \end{matrix}

(27)

where

e (i) = d (i) - \hat{f} (i)

denotes the difference between the desired output

d (i)

and the system output

\hat{f} (i) = ξ {(i)}^{T} α (i - 1) = \sum_{j = 1}^{i - 1} α_{j} (i - 1) κ_{σ_{1}} (u (j), u (i))

.

α_{j} (i - 1)

is the jth element of

α (i - 1)

and all the previous data are the centers. The coefficients

α (i - 1)

and all the previous data should be stored at each iteration. Finally, the RMKRP algorithm is summarized in Algorithm 1.

Algorithm 1: The RMKRP Algorithm.

Initialization:

{u (i), d (i)}, i = 1, 2, \dots

p, λ, ρ, σ, σ_{1} > 0, ζ \in [0, 1], z (1) = 1 - exp (- \frac{d^{2} (1)}{2 σ^{2}})

.

H_{1} (1) = exp (λ {(z (1))}^{\frac{p}{2}}) {(z (1))}^{\frac{p - 2}{2}} (1 - z (1))

.

Λ (1) = {(ζ ρ 2 σ^{2} / p / H_{1} (1) + κ (u (1), u (1)))}^{- 1}, α (1) = Λ (1) d (1)

.
Computation:
While {

u (i), d (i)} (i > 1)

available do
1)

ξ (i) = {[κ (u (1), u (1)), . . ., κ (u (i), u (i - 1))]}^{T}

2)

e (i) = d (i) - ξ {(i)}^{T} α (i - 1)

3)

θ (i) = Λ (i - 1) ξ (i)

4)

z (i) = 1 - exp (- \frac{e^{2} (i)}{2 σ^{2}})

5)

ν (i) = {(exp (λ {(z (i))}^{\frac{p}{2}}) {(z (i))}^{\frac{p - 2}{2}} (1 - z (i)))}^{- 1}

6)

r (i) = ρ^{i} ζ 2 σ^{2} / p ν (i) + κ (u (i), u (i)) - θ^{T} (i) ξ (i)

7)

Λ (i) = r^{- 1} (i) [\begin{matrix} Λ (i - 1) r (i) + θ (i) θ^{T} (i) \\ - θ^{T} (i) \end{matrix} \begin{matrix} - θ (i) \\ 1 \end{matrix}]

8)

α (i) = [\begin{matrix} α (i - 1) - θ (i) r^{- 1} (i) e (i) \\ r^{- 1} (i) e (i) \end{matrix}] .

end while

3.2. QRMKRP

The RMKRP algorithm generates a linearly growing network owing to the used kernel trick. The online vector quantization (VQ) method [12] has been successfully applied in KAFs to curb its network growth efficiently. Thus, we incorporate the online VQ method into the RMKRP to develop the quantized RMKRP (QRMKRP) algorithm, which is shown as follows.

Suppose that the dictionary

C (i)

contains L vectors at discrete time i, i.e.,

C (i) = {C_{k} (i)}_{k = 1}^{L}

,

k \in I d = {1, 2, \dots, L}

, which means that there are L distinctive quantization regions. In the RKHS, the prediction

\hat{f} (i)

is therefore expressed as

\hat{f} (i) = φ^{T} (C_{k} (i)) \hat{Ω}

, where

\hat{Ω} \in F

is the weight vector in RKHS

F

. The cost function of QRMKRP based on

C (i)

is denoted as

\begin{matrix} \sum_{k = 1}^{L} \sum_{n = 1}^{| D_{k} |} ρ^{i - k} \frac{1}{λ} exp (λ {(1 - exp (- \frac{{(d_{k n} (i) - φ^{T} (C_{k} (i)) \hat{Ω} (i))}^{2}}{2 σ^{2}}))}^{\frac{p}{2}}) + \frac{1}{2} ρ^{i} ζ {∥\hat{Ω} (i)∥}^{2}, \end{matrix}

(28)

where

| D_{k} |

denotes the number of input data those lie in the kth quantization region of

C (i)

and satisfies

\sum_{k \in I d} | D_{k} | = i

and

| D_{k} | \geq 1

, and

d_{k n} (i)

is the desired output

d (i)

corresponding to the nth element within the kth quantization region.

The offline solution to the minimization of (28) can be described by

\begin{matrix} \hat{Ω} (i) = {[\hat{Φ} (i) \hat{H} (i) {\hat{Φ}}^{T} (i) + ρ^{i} ζ 2 σ^{2} / p I]}^{- 1} \hat{Φ} (i) \hat{d} (i), \end{matrix}

(29)

where

\hat{Φ} (i) = [φ (C_{1} (i)), φ (C_{2} (i)), \dots, φ (C_{L} (i))]

with

L ≪ i

elements;

\hat{H} (i) = diag [\sum_{n = 1}^{| D_{1} |} H_{1 n} (i), \sum_{n = 1}^{| D_{2} |} H_{2 n} (i), \dots, \sum_{n = 1}^{| D_{L} |} H_{L n} (i)]

denotes a accumulated diagonal matrix;

\hat{d} (i) = diag {[\sum_{n = 1}^{| D_{1} |} H_{1 n} (i) d_{1 n} (i), \sum_{n = 1}^{| D_{2} |} H_{2 n} (i) d_{2 n} (i), \dots, \sum_{n = 1}^{| D_{L} |} H_{L n} (i) d_{L n} (i)]}^{T}

denotes a accumulated weighted output vector;

H_{k n} (i)

denotes

H_{i} (i)

corresponding to the nth entry of the kth quantization region;

I

denotes an identity matrix with an appropriate dimension. Since (29) has a similar form to (17), we simplify (29) as

\begin{matrix} \hat{Ω} (i) = \hat{Φ} (i) {[\hat{H} (i) \hat{K} (i) + ρ^{i} ζ 2 σ^{2} / p I]}^{- 1} \hat{d} (i) = \hat{Φ} (i) \hat{Q} (i) \hat{d} (i) = \hat{Φ} (i) \hat{α} (i), \end{matrix}

(30)

where

\hat{K} (i) = {\hat{Φ}}^{T} (i) \hat{Φ} (i)

. To obtain the recursive solution to the minimization of (28), we let

\hat{α} (i) = \hat{Q} (i) \hat{d} (i)

and denote

\begin{matrix} \hat{Q} (i) = {[\hat{H} (i) \hat{K} (i) + ρ^{i} ζ 2 σ^{2} / p I]}^{- 1} . \end{matrix}

(31)

To update

\hat{Ω} (i)

in (30) recursively, two cases are therefore considered.

(1) First, Case: dis

(u (i), C (i - 1)) \leq ϵ

: In this case, we have

C (i) = C (i - 1)

and

\hat{Q} (i) = \hat{Q} (i - 1)

, which means the input

u (i)

is therefore quantized to the

k^{*}

th element of dictionary

C (i - 1)

, where

k^{*}

=

\underset{1 \leq k \leq |C_{k} (i - 1)|}{arg min} {∥u (i) - C_{k} (i - 1)∥}^{2}

. The matrix

\hat{H} (i)

and the vector

\hat{d} (i)

have a similar form to [13]. Here,

\hat{H} (i)

and

\hat{d} (i)

can be shown as

\begin{matrix} \{\begin{matrix} \hat{H} (i) = \hat{H} (i - 1) + H_{i} (i) τ_{k^{*}} τ_{k^{*}}^{T} \\ \hat{d} (i) = \hat{d} (i - 1) + H_{i} (i) d (i) τ_{k^{*}} \end{matrix}, \end{matrix}

(32)

where

τ_{k^{*}}

is a

|C (i - 1)|

-dimensional column vector whose

k^{*}

th element is 1 and all other elements are 0. Combining (32) into (31), the matrix

\hat{Q} (i)

can be expressed as

\hat{Q} (i) = {[\hat{Q} {(i - 1)}^{- 1} + H_{i} (i) τ_{k^{*}} τ_{k^{*}}^{T} \hat{K} (i - 1)]}^{- 1}

. By using the matrix inversion lemma [28], we obtain

\begin{matrix} \hat{Q} (i) = \hat{Q} (i - 1) - \frac{{\hat{Q}}_{k^{*}} (i - 1) {\hat{K}}_{k^{*}}^{T} (i - 1) \hat{Q} (i - 1)}{H_{i}^{- 1} (i) + {\hat{K}}_{k^{*}}^{T} (i - 1) {\hat{Q}}_{k^{*}} (i - 1)}, \end{matrix}

(33)

where

{\hat{Q}}_{k^{*}} (i - 1)

and

{\hat{K}}_{k^{*}} (i - 1)

represent the

k^{*}

th columns of the matrices

\hat{Q} (i - 1)

and

\hat{K} (i - 1)

, respectively. Therefore,

\hat{α} (i)

in (30) can be calculated as

\begin{matrix} \hat{α} (i) = \hat{Q} (i) \hat{d} (i) = \hat{α} (i - 1) + \frac{(d (i) - {\hat{K}}_{k^{*}}^{T} (i - 1) \hat{α} (i - 1)) {\hat{Q}}_{k^{*}} (i - 1)}{H_{i}^{- 1} (i) + {\hat{K}}_{k^{*}}^{T} (i - 1) {\hat{Q}}_{k^{*}} (i - 1)} . \end{matrix}

(34)

(2) Second Case: dis

(u (i), C (i - 1)) > ϵ

: In this case, we have

C (i) = {C (i - 1), u (i)}

,

\hat{Φ} (i) = [\hat{Φ} (i - 1), φ (u (i))]

, and we have

\begin{matrix} \hat{H} (i) = [\begin{matrix} \hat{H} (i - 1) \\ 0^{T} \end{matrix} \begin{matrix} 0 \\ H_{i} (i) \end{matrix}], \hat{K} (i) = [\begin{matrix} \hat{K} (i - 1) \\ \hat{h} {(i)}^{T} \end{matrix} \begin{matrix} \hat{h} (i) \\ κ_{i i} \end{matrix}], \end{matrix}

(35)

where

0

is the null column vector with a compatible dimension;

\hat{h} (i) = \hat{Φ} {(i - 1)}^{T} φ (u (i))

and

κ_{i i} = κ_{σ_{1}} (u (i), u (i))

. Combining (31), (35),

\hat{d} (i) = {[\hat{d} {(i - 1)}^{T}, H_{i} (i) d (i)]}^{T}

, and the block matrix inversion identity [28], we obtain

\begin{matrix} \hat{Q} (i) = [\begin{matrix} \hat{Q} (i - 1) + \hat{r} {(i)}^{- 1} H_{i} (i) {\hat{z}}_{\hat{H} (i)} \hat{z} {(i)}^{T} \\ - \hat{r} {(i)}^{- 1} H_{i} (i) \hat{z} {(i)}^{T} \end{matrix} \begin{matrix} - \hat{r} {(i)}^{- 1} {\hat{z}}_{\hat{H} (i)} \\ \hat{r} {(i)}^{- 1} \end{matrix}], \end{matrix}

(36)

where

\begin{matrix} \{\begin{matrix} {\hat{z}}_{\hat{H} (i)} = \hat{Q} (i - 1) \hat{H} (i - 1) \hat{h} (i) \\ \hat{z} (i) = \hat{Q} {(i - 1)}^{T} \hat{h} (i) \\ \hat{r} (i) = κ_{σ_{1}} (u (i), u (i)) H_{i} (i) + ρ^{i} ζ 2 σ^{2} / p - H_{i} (i) \hat{h} {(i)}^{T} {\hat{z}}_{\hat{H} (i)} \end{matrix} . \end{matrix}

(37)

Furthermore, due to

\hat{d} (i) = {[\hat{d} {(i - 1)}^{T}, H_{i} (i) d (i)]}^{T}

, we obtain

\begin{matrix} \hat{α} (i) = \hat{Q} (i) \hat{d} (i) = [\begin{matrix} \hat{α} (i - 1) - \hat{r} {(i)}^{- 1} H_{i} (i) {\hat{z}}_{\hat{H} (i)} (d (i) - \hat{h} {(i)}^{T} \hat{α} (i - 1)) \\ \hat{r} {(i)}^{- 1} H_{i} (i) (d (i) - \hat{h} {(i)}^{T} \hat{α} (i - 1)) \end{matrix}] . \end{matrix}

(38)

The QRMKRP algorithm is summarized in Algorithm 2, where L denotes the dictionary size.

Algorithm 2: The QRMKRP algorithm.

Initialization:

{u (i), d (i)}, i = 1, 2, \dots

σ, σ_{1}, p, λ > 0, L = 1, C (1) = {u (1)}

,

z (1) = 1 - exp (- \frac{d^{2} (1)}{2 σ^{2}})

,

H_{1} (1) = exp (λ {(z (1))}^{\frac{p}{2}}) {(z (1))}^{\frac{p - 2}{2}} (1 - z (1)), \hat{H} (1) = [H_{1} (1)]

.

\hat{Q} (1) = {[ζ ρ 2 σ^{2} / p + H_{1} (1) κ_{11}]}^{- 1}, \hat{α} (1) = \hat{Q} (1) H_{1} (1) d (1)

,

ϵ > 0, ρ > 0, ζ \in [0, 1]

.
Computation:
While {

u (i), d (i)} (i > 1)

available do
1) Compute the distance between

u (i)

and

C (i - 1)

:
dis

(u (i), C (i - 1)) = min_{1 \leq k \leq |C_{k} (i - 1)|} {∥u (i) - C_{k} (i - 1)∥}^{2}

,
where

k^{*}

=

\underset{1 \leq k \leq |C_{k} (i - 1)|}{arg min} {∥u (i) - C_{k} (i - 1)∥}^{2}

.
2) If dis

(u (i), C (i - 1)) \leq ϵ

:
Keep the dictionary unchanged:

C (i) = C (i - 1), L \Leftarrow L

,
Update

\hat{H} (i)

by (32),

\hat{Q} (i)

by (33),

\hat{α} (i)

by (34).
3) Otherwise:
The dictionary changes:

C (i) = [C (i - 1), u (i)], L \Leftarrow L + 1

,
Update

\hat{H} (i)

by (35),

\hat{Q} (i)

by (36),

\hat{α} (i)

by (38).
end while

4. Simulation

In this section, to validate the performance of the proposed RMKRP algorithm and its quantized version, two examples, i.e., Mackey–Glass (MG) chaotic time series prediction and nonlinear system identification, are used to validate the performance superiorities of the proposed two algorithms.

In this example, the noise environment considered is the impulsive noise, which is modeled by the combination of two independent noise processes [17], i.e.,

v (i) = (1 - b (i)) v_{1} (i) + b (i) v_{2} (i),

(39)

where

v_{1} (i)

is an ordinary noise disturbance with small variance and

v_{2} (i)

represents large outliers with large variance;

b (i)

is of binary distribution random process over

{0, 1}

with

P r o b \{b (i) = 1\} = c

and

P r o b \{b (i) = 0\} = 1 - c

(

0 \leq c \leq 1

is an occurrence probability). Here, we select

c = 0.1

. The distribution of

v_{1} (i)

is considered as a Binary distribution over

{0.5, - 0.5}

with probability mass

P r o b \{x = 0.5\} = P r o b \{x = - 0.5\} = 0.5

. In addition,

v_{2} (i)

is modeled by the

α

-stable process, owing to its heavy-tailed probability density function. The

α

-stable process is described by the following characteristic function [29]:

f (t) = exp \{j δ t - γ {|t|}^{α} [1 + j β sgn (t) S (t, α)]\},

(40)

where

\begin{matrix} S (t, α) = \{\begin{matrix} tan \frac{α π}{2}, if α \neq 1 \\ \frac{2}{π} log |t|, if α = 1 \end{matrix}, \end{matrix}

(41)

with

α \in (0, 2]

being the characteristic factor,

β \in [- 1, 1]

being the symmetry parameter,

γ > 0

being the dispersion parameter,

sgn (.)

denotes the sign function,

j = \sqrt{- 1}

, and

- \infty < δ < \infty

being the location parameter. Generally, a smaller

α

generates a heavier tail and a smaller

γ

generates fewer large outliers. The characteristic function denoted as

V_{α - stable} (α, β, γ, δ)

is chosen as

V_{α - stable} (0.8, 0, 0.1, 0)

to model the impulse noise in the simulations.

4.1. Chaotic Time Series Prediction

The MG chaotic time series is generated from the following differential equation [9]:

\frac{d x (t)}{d t} = \frac{β x (t - τ)}{1 + x {(t - τ)}^{n}} - γ x (t),

(42)

where

β, γ, n > 0

. Here, we set

β = 0.2

,

γ = 0.1

, and

τ = 30

. The time series is discretized at a sampling period of six seconds. The training set includes a segment of 2000 samples corrupted by the additive noises which are shown in (39), and another 200 samples without noise are used as the testing set. The kernel size

σ_{1}

in the Gaussian kernel is set to 1. The filter length is set at

L = 7

, which means that

[x_{t}, x_{t - 1}, \dots, x_{t - 6}]

is used to predict

x_{t + 1}

.

To evaluate the filtering accuracy, the testing mean square error (MSE) is defined as follows:

MSE (dB) = \frac{1}{N} 10 lo g_{10} (\sum_{i = 1}^{N} {(d (i) - \hat{f} (i))}^{2}),

(43)

where

\hat{f} (i)

is the estimate of

d (i)

, and N is the length of testing data.

The KLMS [6], KMCC [22], MKRL [26], KRMC [21], and KRLS [8] algorithms are chosen for performance comparison with RMKRP thanks to their excellent filtering performance. The other sparsification algorithms, i.e., the QKLMS [12], QKMCC [30], QMKRL [26], QKRLS [13], and KRMC-NC [21] algorithms are used for performance comparison with QRMKRP owing to their modest space complexities and excellent performance. All simulation results are averaged over 50 independent Monte Carlo runs.

Since power parameter p, risk-sensitive parameter

λ

, and kernel width

σ

are crucial parameters in the proposed RMKRP and QRMKRP algorithms, the influence of these parameters on the performance is first discussed. In the simulations, we take 12 points evenly in the close interval

p \in [1, 6]

and

σ \in [0.17, 5]

, respectively. The influence of p on the steady-state performance of RMKRP is shown in Figure 2a, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: p is set within

[1, 6]

; risk-sensitive parameter

λ

in the KRP is set as 1;

ζ = 0.1

and

ρ = 1

; kernel size

σ

in the KRP is set as 1. As can be seen from Figure 2a, we have that the filtering accuracy of RMKRP is the highest when

p = 4

and decreases gradually when p is either too small or too large. Then, the influence of

σ

on the filtering performance of RMKRP with

p = 4

is shown in Figure 2b, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: risk-sensitive parameter

λ

is fixed at 1;

σ

lies in

[0.17, 5]

. From Figure 2b, we see that RMKRP can achieve the highest filtering accuracy when

σ

is about 1. It is reasonable to note that RMKRP are sensitive to outliers when the kernel width is large, and decreases its ability of error-correction when the kernel width is small. Finally, the influence of

λ

on the filtering performance of RMKRP with

σ = 1

and

p = 4

is shown in Figure 2c, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: the range of

λ

is selected as

λ \in {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 2, 3, 4}

. From Figure 2c, we see that

λ

has a slight influence on the filtering accuracy when

λ

is small, and a large

λ

can increase the steady-state MSE obviously. Therefore, from Figure 2, the parameters of RMKRP can be chosen by trials to obtain the best performance in practice. Similarly, the parameters of QRMKRP can be chosen by the same method as that in RMKRP.

The performance comparison of QKLMS, QKMCC, QMKRL, KLMS, KMCC, MKRL, KRLS, KRMC, KRMC-NC, and QKRLS is conducted in the same environments as in (39). The parameters of the proposed algorithms are selected by trials to achieve desirable performance, and the parameters of compared algorithms are chosen such that they have almost the same convergence rate.

λ = 1

,

p = 4

, and

σ = 1

are set for RMKRP;

λ = 1

,

p = 4

,

σ = 1

, and

ϵ = 0.2

for QRMKRP;

η = 0.1

for KLMS;

η = 0.09

and

σ = 3.5

for KMCC;

η = 0.09

,

σ = 1

, and

λ = 1

for MKRL;

η = 0.1

and

ϵ = 0.2

for QKLMS;

η = 0.09

,

ϵ = 0.2

, and

σ = 3.5

for QKMCC;

η = 0.09

,

ϵ = 0.2

,

σ = 1

, and

λ = 1

for QMKRL;

ζ = 0.1

,

ρ = 1

, and

σ = 3.5

for KRMC; the novelty criterion thresholds

δ_{1} = 0.15

,

δ_{2} = 0.1

,

ζ = 0.1

,

ρ = 1

, and

σ = 3.5

for KRMC-NC;

ζ = 0.1

for KRLS;

ζ = 0.1

and

ϵ = 0.2

for QKRLS. Figure 3 shows the compared MSEs of RMKRP, QRMKRP, and the compared algorithms. As can be seen from Figure 3, RMKRP achieves a better filtering accuracy than KRLS, KRMC, KLMS, KMCC, and MKRL. QRMKRP achieves a better steady-state testing MSE than the sparsification algorithms including QKRLS, KRMC-NC, QKLMS, QKMCC, and QMKRL. We also see from Figure 3 that the proposed algorithms provide good robustness to impulsive noises. For detailed comparison, the dictionary size, consumed time, and steady-state MSEs in Figure 3 are shown in Table 1. Note that the steady-state MSEs of KLMS, QKLMS, KRLS, and QKRLS are not shown in Table 1 since they cannot converge in such impulsive noise environment. From Table 1, we see that RMKRP has similar consumed time to KRLS and KRMC but provides better filtering accuracy. In addition, QRMKRP provides the highest filtering accuracy in all the compared sparsification algorithms and approaches the filtering accuracy of RMKRP with a significantly lower network size.

4.2. Nonlinear System Identification

To further validate the performance superiorities of the proposed RMKRP and QRMKRP algorithms, the nonlinear system identification is considered. Here, the nonlinear system is of the following form [31].

\begin{matrix} s (t) & = s (t - 1) (0.8 - 0.5 exp (- s^{2} (t - 1))) - (0.3 + 0.9 exp (- s^{2} (t - 1))) s (t - 2) + 0.1 sin (s (t - 1) π), \end{matrix}

(44)

where

s (t)

denotes the output at discrete time t with the initial

s (- 1) = 0.1

and

s (- 2) = 0.1

. The two previous outputs

u (k) = {[s (t - 1), s (t - 2)]}^{T}

are utilized as the input to estimate the current output

s (t)

. The training set includes a segment of 2000 samples corrupted by the additive noises shown in (39), and another 200 samples without noise are used as the testing set. The kernel width

σ_{1}

is set to 1 for the Gaussian function. All simulation results are averaged over 50 independent Monte Carlo runs.

Similar to MG chaotic time series prediction, the influence of power parameter p, risk-sensitive parameter

λ

, and kernel width

σ

on the performance of RMKRP is also discussed in nonlinear system identification. The influence of p on the steady-state performance of RMKRP is shown in Figure 4a, where the steady-state MSEs are obtained as averages over the last 100 iterations. The parameters are set as: p is set within

[1, 6]

;

λ

is set as

0.1

;

ζ = 0.1

and

ρ = 1

; kernel size

σ

in the KRP is set as 1. The influence of

σ

on the filtering performance of RMKRP is shown in Figure 4b, where risk-sensitive parameter

λ

is fixed at

0.1

;

σ

lies in

[0.17, 5]

; p is set as 4. The influence of

λ

on the filtering performance of RMKRP is shown in Figure 4c, where the range of

λ

is selected as

λ \in {0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 2, 3, 4}

;

σ

is set as 1; p is set as 4. As can be seen from Figure 4, we can obtain the same conclusions as those in Figure 2.

We compare the filtering performance of QKLMS, QKMCC, QMKRL, KLMS, KMCC, MKRL, KRLS, KRMC, KRMC-NC, and QKRLS in the same environments as in (39). The parameters of the proposed algorithms are selected by trials to achieve desirable performance, and the parameters of compared algorithms are chosen such that they have almost the same convergence rate.

λ = 0.1

,

p = 4

, and

σ = 1

are set for RMKRP;

λ = 0.1

,

p = 4

,

σ = 1

, and

ϵ = 0.2

for QRMKRP;

η = 0.1

for KLMS;

η = 0.09

and

σ = 3.5

for KMCC;

η = 0.09

,

σ = 1

, and

λ = 2

for MKRL;

η = 0.1

and

ϵ = 0.2

for QKLMS;

η = 0.09

,

ϵ = 0.2

, and

σ = 3.5

for QKMCC;

η = 0.09

,

ϵ = 0.2

,

σ = 1

, and

λ = 2

for QMKRL;

ζ = 0.1

,

ρ = 1

, and

σ = 3.5

for KRMC; the novelty criterion thresholds

δ_{1} = 0.01

,

δ_{2} = 0.1

,

ζ = 0.1

,

ρ = 1

, and

σ = 3.5

for KRMC-NC;

ζ = 0.1

for KRLS;

ζ = 0.1

and

ϵ = 0.2

for QKRLS. Figure 5 shows the compared MSEs of RMKRP, QRMKRP, and the compared algorithms. For detailed comparison, the dictionary size, consumed time, and steady-state MSEs in Figure 5 are also shown in Table 2, where the steady-state MSEs of KLMS, QKLMS, KRLS, and QKRLS are not shown since they cannot converge in such impulsive noise environments. From Figure 5 and Table 2, we can obtain the same conclusions as those in Figure 3 and Table 1.

5. Conclusions

In this paper, the kernel risk-sensitive mean p-power error (KRP) criterion is proposed by constructing mean p-power error (MPE) into kernel risk-sensitive loss (KRL) in RKHS, and some basic properties are presented. The KRP criterion with power parameter p is more flexible than KRL to handle the signal corrupted by impulsive noises. Two kernel recursive adaptive algorithms are derived to obtain desirable filtering accuracy under the minimum KRP (MKRP) criterion, i.e., the recursive minimum KRP (RMKRP) and quantized RMKRP (QRMKRP) algorithms. The RMKRP can achieve higher accuracy but with almost identical computational complexity as that of the KRLS and KRMC. The vector quantization method is introduced into RMKRP, thus generating QRMKRP, and QRMKRP can effectively reduce network size while maintaining the filtering accuracy. Simulations conducted in Mackey–Glass (MG) chaotic time series prediction and nonlinear system identification under impulsive noises illustrate the superiorities of RMKRP and QRMKRP from the aspects of robustness and filtering accuracy.

Author Contributions

Conceptualization, T.Z. and S.W.; methodology, T.Z. and H.Z.; software, T.Z. and L.W.; validation, S.W. and T.Z.; formal analysis, T.Z. and H.Z.; investigation, T.Z. and K.X.; resources, S.W.; data curation, S.W. and T.Z.; writing—original draft preparation, T.Z.; writing—review and editing, S.W. and H.Z.; visualization, T.Z.; supervision, S.W.; project administration, S.W. and L.W.; funding acquisition, S.W. and K.X.

Funding

This work was supported by the National Natural Science Foundation of China (61671389), Fundamental Research Funds for the Central Universities (XDJK2019B011), and the Research Fund for Science and Technology Commission Foundation of Chongqing (cstc2017rgzn-zdyfX0002).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kivinen, J.; Smola, A.J.; Williamson, R.C. Online learning with kernels. IEEE Trans. Signal Process. 2004, 52, 1540–1547. [Google Scholar] [CrossRef]
Chen, B.; Li, L.; Liu, W.; Príncipe, J.C. Nonlinear adaptive filtering in kernel spaces. In Springer Handbook of Bio-/Neuroinformatics; Springer: Berlin, Germany, 2014; pp. 715–734. [Google Scholar]
Nakajima, Y.; Yukawa, M. Nonlinear channel equalization by multi-kernel adaptive filter. In Proceedings of the IEEE 13th International Workshop on Signal Processing Advances in Wireless Communications, Cesme, Turkey, 17–20 June 2012; pp. 384–388. [Google Scholar]
Jiang, S.; Gu, Y. Block-sparsity-induced adaptive filter for multi-clustering system identification. IEEE Trans. Signal Process. 2015, 63, 5318–5330. [Google Scholar] [CrossRef]
Zheng, Y.; Wang, S.; Feng, J.; Tse, C.K. A modified quantized kernel least mean square algorithm for prediction of chaotic time series. Digital Signal Process. 2016, 48, 130–136. [Google Scholar] [CrossRef]
Liu, W.; Príncipe, P.P.; Príncipe, J.C. The kernel least mean square algorithm. IEEE Trans. Signal Process. 2008, 56, 543–554. [Google Scholar] [CrossRef]
Liu, W.; Príncipe, J.C. Kernel affine projection algorithms. IEEE Trans. Signal Process. 2004, 52, 2275–2285. [Google Scholar]
Engel, Y.; Mannor, S.; Meir, R. The kernel recursive least-squares algorithm. IEEE Trans. Signal Process. 2004, 52, 2275–2285. [Google Scholar] [CrossRef]
Liu, W.; Park, I.; Príncipe, J.C. An information theoretic approach of designing sparse kernel adaptive filters. IEEE Trans. Neural Netw. 2009, 20, 1950–1961. [Google Scholar] [CrossRef]
Platt, J. A resource-allocating network for function interpolation. Neural Comput. 1991, 3, 213–225. [Google Scholar] [CrossRef]
Richard, C.; Bermudez, J.C.M.; Honeine, P. Online prediction of time series data with kernels. IEEE Trans. Signal Process. 2009, 57, 1058–1067. [Google Scholar] [CrossRef]
Chen, B.; Zhao, S.; Zhu, P.; Príncipe, J.C. Quantized kernel least mean square algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 22–32. [Google Scholar] [CrossRef]
Chen, B.; Zhao, S.; Zhu, P.; Príncipe, J.C. Quantized kernel recursive least squares algorithm. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 1484–1491. [Google Scholar] [CrossRef] [PubMed]
Príncipe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
Pei, S.-C.; Tseng, C.-C. Least mean p-power error criterion for adaptive FIR filter. IEEE J. Sel. Areas Commun. 1994, 12, 1540–1547. [Google Scholar]
Boel, R.K.; James, M.R.; Petersen, I.R. Robustness and risk sensitive filtering. IEEE Trans. Autom. Control 2002, 47, 451–461. [Google Scholar] [CrossRef]
Chen, B.; Xing, L.; Xu, B.; Zhao, H.; Zheng, N.; Príncipe, J.C. Kernel risk-sensitive loss: Definition, properties and application to robust adaptive filtering. IEEE Trans. Signal Process. 2017, 65, 2888–2901. [Google Scholar] [CrossRef]
Ma, W.; Duan, J.; Man, W.; Zhao, H.; Chen, B. Robust kernel adaptive filters based on mean p-power error for noisy chaotic time series prediction. Eng. Appl. Artif. Intell. 2017, 58, 101–110. [Google Scholar] [CrossRef]
Liu, W.; Pokharel, P.P.; Príncipe, J.C. Correntropy: Properties and applications in non-gaussian signal processing. IEEE Trans. Signal Process. 2007, 55, 5286–5298. [Google Scholar] [CrossRef]
Chen, B.; Xing, L.; Liang, J.; Zheng, N.; Príncipe, J.C. Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion. IEEE Signal Process. Lett. 2014, 21, 880–884. [Google Scholar]
Wu, Z.; Shi, J.; Zhang, X.; Ma, W.; Chen, B. Kernel recursive maximum correntropy. Signal Process. 2015, 117, 11–16. [Google Scholar] [CrossRef]
Zhao, S.; Chen, B.; Príncipe, J.C. Kernel adaptive filtering with maximum correntropy criterion. In Proceedings of the International Joint Conference on Neural Network, San Jose, CA, USA, 31 July–5 August 2011; Volume 31, pp. 2012–2017. [Google Scholar]
He, R.; Hu, B.; Zheng, W.; Kong, X. Robust principal component analysis based on maximum correntropy criterion. IEEE Trans. Image Process. 2011, 20, 1485–1494. [Google Scholar]
He, R.; Zheng, W.; Hu, B. Maximum correntropy criterion for robust face recognition. IEEE Trans. Patt. Anal. Mach. Intell. 2011, 33, 1561–1576. [Google Scholar]
Santamaría, I.; Pokharel, P.P.; Príncipe, J.C. Generalized correlation function: Definition, properties, and application to blind equalization. IEEE Trans. Signal Process. 2006, 54, 2187–2197. [Google Scholar] [CrossRef]
Luo, X.; Deng, J.; Wang, W.; Wang, J.-H.; Zhao, W. A Quantized Kernel Learning Algorithm Using a Minimum Kernel Risk-Sensitive Loss Criterion and Bilateral Gradient Technique. Entropy 2017, 19, 365. [Google Scholar] [CrossRef]
Chen, B.; Xing, L.; Wang, X.; Qin, J.; Zheng, N. Robust learning with kernel mean p-power error loss. IEEE Trans. Cybern. 2017, 48, 2101–2113. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Príncipe, J.C.; Haykin, S. Kernel Adaptive Filtering: A Comprehensive Introduction; Wiley: New York, NY, USA, 2010. [Google Scholar]
Weng, B.; Barner, K.E. Nonlinear system identification in impulsive environments. IEEE Trans. Signal Process. 2005, 53, 2588–2594. [Google Scholar] [CrossRef]
Wang, S.; Zheng, Y.; Duan, S.; Wang, L.; Tan, H. Quantized kernel maximum correntropy and its mean square convergence analysis. Dig. Signal Process. 2017, 63, 164–176. [Google Scholar] [CrossRef]
Fan, H.; Song, Q. A linear recurrent kernel online learning algorithm with sparse updates. Neural Netw. 2014, 50, 142–153. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Block diagram of adaptive filtering.

Figure 2. Steady-state MSE of RMKRP with different p in MG time series prediction (a); steady-state MSE of RMKRP with different

σ

in MG time series prediction (b); steady-state MSE of RMKRP with different

λ

in MG time series prediction (c).

Figure 2. Steady-state MSE of RMKRP with different p in MG time series prediction (a); steady-state MSE of RMKRP with different

σ

in MG time series prediction (b); steady-state MSE of RMKRP with different

λ

in MG time series prediction (c).

Figure 3. Comparison of the MSEs of KLMS, KMCC, MKRL, KRLS, KRMC, and RMKRP in MG time series prediction (a); comparison of the MSEs of QKLMS, QKMCC, QMKRL, QKRLS, KRMC-NC, and QRMKRP in MG time series prediction (b).

Figure 4. Steady-state MSE of RMKRP with different p in nonlinear system identification (a); steady-state MSE of RMKRP with different

σ

in nonlinear system identification (b); steady-state MSE of RMKRP with different

λ

in nonlinear system identification (c).

Figure 4. Steady-state MSE of RMKRP with different p in nonlinear system identification (a); steady-state MSE of RMKRP with different

σ

in nonlinear system identification (b); steady-state MSE of RMKRP with different

λ

in nonlinear system identification (c).

Figure 5. Comparison of the MSEs of KLMS, KMCC, MKRL, KRLS, KRMC, and RMKRP in nonlinear system identification (a); comparison of the MSEs of QKLMS, QKMCC, QMKRL, QKRLS, KRMC-NC, and QRMKRP nonlinear system identification (b).

Table 1. Simulation results of QKLMS, QKMCC, QMKRL, QKRLS, KRMC-NC, KLMS, KMCC, MKRL, KRLS, KRMC, RMKRP, and QRMKRP in MG time series prediction.

Algorithms	Size	Time (s)	MSE (dB)
KLMS [6]	2000	30.9501 s	N/A
QKLMS [12]	28	2.1011 s	N/A
KRLS [8]	2000	58.5358 s	N/A
QKRLS [13]	28	2.3374 s	N/A
KMCC [22]	2000	30.8285 s	−18.5063
QKMCC [30]	28	2.0995 s	−17.8707
MKRL [26]	2000	30.9117 s	−18.7312
QMKRL [26]	28	2.1063 s	−18.1037
KRMC [21]	2000	58.1229 s	−25.1618
KRMC-NC [21]	462	2.8045 s	−21.5183
QRMKRP	28	2.3443 s	−24.9326
RMKRP	2000	58.2196 s	−28.1802

Table 2. Simulation results of QKLMS, QKMCC, QMKRL, QKRLS, KRMC-NC, KLMS, KMCC, MKRL, KRLS, KRMC, RMKRP, and QRMKRP in nonlinear system identification.

Algorithms	Size	Time (s)	MSE (dB)
KLMS [6]	2000	21.2447 s	N/A
QKLMS [12]	14	1.7284 s	N/A
KRLS [8]	2000	48.6055 s	N/A
QKRLS [13]	14	1.9643 s	N/A
KMCC [22]	2000	21.1328 s	−19.233
QKMCC [30]	14	1.763 s	−17.9723
MKRL [26]	2000	21.0313 s	−19.5390
QMKRL [26]	14	1.7243 s	−18.5748
KRMC [21]	2000	48.7601 s	−28.7583
KRMC-NC [21]	496	2.6874 s	−23.671
QRMKRP	14	1.9681 s	−27.3128
RMKRP	2000	48.6101 s	−34.0790

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Wang, S.; Zhang, H.; Xiong, K.; Wang, L. Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning. Entropy 2019, 21, 588. https://doi.org/10.3390/e21060588

AMA Style

Zhang T, Wang S, Zhang H, Xiong K, Wang L. Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning. Entropy. 2019; 21(6):588. https://doi.org/10.3390/e21060588

Chicago/Turabian Style

Zhang, Tao, Shiyuan Wang, Haonan Zhang, Kui Xiong, and Lin Wang. 2019. "Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning" Entropy 21, no. 6: 588. https://doi.org/10.3390/e21060588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Kernel Risk-Sensitive Mean p-Power Error Algorithms for Robust Learning

Abstract

1. Introduction

2. Kernel Risk-Sensitive Mean p-Power Error

2.1. Definition

2.2. Properties

3. Application to Adaptive Filtering

3.1. RMKRP

3.2. QRMKRP

4. Simulation

4.1. Chaotic Time Series Prediction

4.2. Nonlinear System Identification

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI