An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing

Liu, Xin

doi:10.3390/math12192962

Open AccessFeature PaperArticle

An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing

by

Xin Liu

Mathematical Institute, Peking University, Beijing 100871, China

Mathematics 2024, 12(19), 2962; https://doi.org/10.3390/math12192962

Submission received: 31 July 2024 / Revised: 18 September 2024 / Accepted: 22 September 2024 / Published: 24 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

This paper proposes an iteratively re-weighted importance kernel Bayes filter (IRe-KBF) method for handling high-dimensional or complex data in Bayesian filtering problems. This innovative approach incorporates importance weights and an iterative re-weighting scheme inspired by iteratively re-weighted Least Squares (IRLS) to enhance the robustness and accuracy of Bayesian inference. The proposed method does not require explicit specification of prior and likelihood distributions; instead, it learns the kernel mean representations from training data. Experimental results demonstrate the superior performance of this method over traditional KBF methods on high-dimensional datasets.

Keywords:

kernel Bayesian filtering; re-weighted importance; kernel mean

MSC:

62C10; 68T05; 93B99

1. Introduction

Bayesian filtering is a probabilistic approach that recursively estimates an unknown probability density function over time based on a mathematical model and observation process [1,2,3]. It involves updating the prior distribution to obtain the posterior distribution, which is the essence of Bayesian statistics. As a fundamental problem in probabilistic inference and sequential estimation, Bayesian filtering focuses on inferring the state of a dynamic system over time from noisy and incomplete observations while integrating prior knowledge about the system’s behavior. This problem finds wide-ranging applications in fields such as signal processing [4], robotics [5], finance [6,7], and more.

Traditional Bayesian filtering methods often encounter limitations when dealing with high-dimensional or complex data spaces. Kernel methods have emerged as a powerful tool for generalizing linear statistical methods to nonlinear settings, which is achieved by embedding samples into a high-dimensional feature space known as a reproducing kernel Hilbert space (RKHS). Kernel mean embedding plays a crucial role in this context, representing probability distributions as expectations of features in the RKHS [8,9,10].

By embedding distributions into the RKHS, kernel mean embedding facilitates various statistical analyses and machine learning tasks. The work of Smola et al. in 2007 [9] emphasized the significance of kernel mean embedding in Bayesian updates. The use of kernel means in characteristic RKHS has been widely proven successful in a number of statistical tasks, including two-sample problems [11], independence tests [12], and conditional independence tests [13]. Notably, these tests are applicable to any domain where kernels can be defined, showcasing the versatility of the kernel approach.

In the context of Bayesian filtering, it enables the representation of probability distributions in a space where operations such as inner products and norms can be easily computed. This not only enhances the scalability of Bayesian filtering algorithms but also allows for the incorporation of complex, structured data into the estimation process. Moreover, kernel mean embedding provides a framework for integrating diverse sources of information, including prior knowledge and domain expertise, into the filtering process.

The kernel Bayesian filtering procedure introduced in [14] is model-free and produces sample-derived estimates of the posterior embedding. This Bayesian filtering method involves the construction of a posterior distribution over the current state based on the sequence of noisy observations up to the present time, without making explicit modeling assumptions about the underlying dynamics. Unfortunately, it may cause instability during the calculating process. Fukumizu et al. [14] proposed a formulation that requires an unconventional form of regularization in the second stage, which adversely impacts the attainable convergence rates. Boots et al. [15] introduced one KBF where only a simple form of Tikhonov regularization is applied. Unfortunately, this method requires the inversion of the matrix, which is often indefinite, necessitating high regularization constants, which can degrade performance. Xu et al. [16] introduced an approach called importance-weighted KBF to avoid the instability problem. The importance weight is based on the density ratio. However, the calculation of the density ratio may cause high variance.

In this paper, we explore an innovative extension of the importance weighted kernel Bayesian filtering method proposed in [16] known as Iteratively Re-weighted importance Kernel Bayesian Filtering (IRe-KBF). The concept of iterated re-weighting, which has been successfully applied in other contexts (e.g., [17]), is now integrated into the filtering process, as explained in Section 3. This integration, which is also discussed in the Introduction, along with reference [17], brings about a significant enhancement by making the filter more robust and efficient in high-dimensional systems.

The IRe-KBF method learns kernel mean representations directly from training data, eliminating the need for explicit specification of prior and likelihood distributions. By harnessing importance weights and iterative re-weighting, our filter exhibits enhanced robustness and efficiency in the context of high-dimensional datasets.

The remainder of this paper is organized as follows. In Section 2, we review the basic concepts of kernel mean embeddings and the kernel Bayes rule. In Section 3, we first introduce the importance-weighted kernel Bayes rule proposed in [16]; then the iteratively re-weighted importance kernel Bayes rule (IRe-KBR) is proposed. We introduce the filtering problem in Section 4, which applies the IRe-KBR to the filtering problem. Experiments are reported in Section 5.

2. Background and Preliminaries

In this section, we review some basic concepts.

Kernel mean embeddings provide a way to represent distributions in reproducing kernel Hilbert spaces (RKHSs) using positive definite kernels. Given random variables

(X, Y)

on

X \times Y

with a joint distribution (P) and density functions

p (x, y)

,

k_{x} : X \times X \to R

and

k_{y} : Y \times Y \to R

are measurable positive definite kernel corresponding to scalar-valued RKHSs

H_{x}

and

H_{y}

, respectively. The feature maps are denoted as

ϕ (x) = k_{x} (x, .)

and

ψ (y) = k_{y} (y, .)

, and the RKHS norm is denoted as

∥ \cdot ∥

, with corresponding Gram matrices of

{(G_{X})}_{i j} = k_{x} (x_{i}, x_{j})

and

{(G_{Y})}_{i j} = k_{y} (y_{i}, y_{j})

.

The kernel mean embedding of the marginal distribution (

P (x)

) is denoted by

m_{P (x)}

and defined as follows:

\begin{matrix} m_{P (x)} = E_{P} [ϕ (x)] \in H_{x} . \end{matrix}

It always exists for kernels that are bounded. From the reproducing property, we derive the following for all

f \in H_{x},

:

\begin{matrix} < f, m_{P (x)} > = E_{P} [f (x)] \end{matrix}

which is advantageous for estimating expectations of functions. Moreover, if a kernel

k_{x}

such as the Gaussian kernel is characteristic, the embedding uniquely determines the probability distribution, which means that

m_{P (x)} = m_{Q (x)}

implies

P = Q

. Additionally, we introduce the following (uncentered) kernel covariance operators:

\begin{matrix} C_{X X}^{P} & = E_{P} [ϕ (X) \otimes ϕ (X)], C_{X Y}^{P} = E_{P} [ϕ (X) \otimes ψ (Y)] \\ C_{Y Y}^{P} & = E_{P} [ψ (Y) \otimes ψ (Y)], C_{Y X}^{P} = {(C_{X Y}^{P})}^{*} . \end{matrix}

In this context, ⊗ denotes the tensor product, and ∗ denotes the adjoint of the operator. Covariance operators extend the concept of finite-dimensional covariance matrices to the realm of infinite kernel feature spaces, and they are always defined for bounded kernels.

The kernel conditional mean embedding [10] extends the kernel mean embedding to conditional probability distributions. It is defined as follows:

\begin{matrix} m_{P (X | Y)} (y) = E_{P} [ϕ (X) | Y = y] \in H_{x} . \end{matrix}

This embedding captures the conditional expectation of the feature space representation of X given

Y = y

. Under the assumption of regularity, which requires that

E_{P} [f (X) | Y = y] \in H_{x}

for all

f \in H_{x}

, an operator (

E_{P} \in H_{x} \otimes H_{y}

) can be identified such that

\begin{matrix} m_{P (X | Y)} (y) = E_{P}^{*} ψ (y) . \end{matrix}

The conditional operator (

E_{P}

) can be obtained by minimizing the loss function (

L^{(P)}

), which is

\begin{matrix} L^{P} (E) = E_{P} [∥ ϕ (X) - E^{*} ψ (Y) ∥^{2}] . \end{matrix}

The minimization of this loss function leads to an analytical solution in the form of the following closed-form expression:

\begin{matrix} E_{P} = {(C_{Y Y}^{P})}^{- 1} C_{X Y}^{P} . \end{matrix}

The solution provides a means to estimate the conditional operator (

E_{P}

), which is crucial for various machine learning tasks that involve conditional inference or prediction. Obtaining an empirical estimate of the conditional mean embedding is a straightforward procedure. Given i.i.d samples (

{(x_{i}, y_{i})}_{i = 1}^{n} \sim P

), we seek to minimize the sample estimate of the loss function (

{\hat{L}}^{(P, λ)}

), which is defined as

\begin{matrix} {\hat{L}}^{(P, λ)} = \frac{1}{n} \sum_{i = 1}^{n} ∥ ϕ (x_{i}) - E^{*} ψ (y_{i}) ∥^{2} + λ {∥ E ∥}^{2} . \end{matrix}

This is a sample estimate of

L^{P}

that incorporates a Tikhonov regularization parameter (

λ

) to mitigate overfitting. The solution to the minimization problem can be expressed as follows:

\begin{matrix} {\hat{E}}_{(P, λ)} = {({\hat{C}}_{Y Y}^{P} + λ I)}^{- 1} {\hat{C}}_{X Y}^{P}, \end{matrix}

where

\hat{C}

is the empirical estimate of the covariance operators, i.e.,

\begin{matrix} {\hat{C}}_{Y Y}^{P} = \frac{1}{n} \sum_{i = 1}^{n} ψ (y_{i}) \otimes ψ (y_{i}) {\hat{C}}_{X Y}^{P} = \frac{1}{n} \sum_{i = 1}^{n} ϕ (x_{i}) \otimes ψ (y_{i}) . \end{matrix}

With respect to the kernel Bayes rule (KBR), in the context of Bayes’ theorem, our objective is to update the prior distribution (

Π

) with the density function (

π (x)

) to the posterior distribution (Q) of the state (X) given an observation (Y) with the corresponding density function (

q (x | y)

).

\begin{matrix} q_{(X | Y)} (x | y) = \frac{p_{(Y | X)} (y | x) π (x)}{q_{Y} (y)}, q_{Y} (y) = \int p_{(Y | X)} (y | u) π (u) d u \end{matrix}

where

p (y | x)

is the density function of the likelihood function and

q_{Y}

is the marginal probability density function.

The KBR aims to update the prior mean embedding (

m_{Π}

) to the posterior mean embedding (

m_{Q}

) according to the Bayes rule. Unlike traditional methods that require the explicit form of the likelihood function (

p_{Y | X}

), the KBR learns the relations between the latent variable (X) and the observable variable (Y) directly from the data. This is achieved by analyzing the dataset (

{(x_{i}, y_{i})}_{i = 1}^{n} \sim P

), where the density of data distribution (P) shares the same likelihood (

p_{Y | X}

) with the density function (

p (x)

). In general, the density function (

π (x)

) is not equal to

p (x)

. One of the advantages of the KBR approach is its ability to perform Bayesian inference, even in the absence of a specific parametric model or an analytic form for the prior and likelihood densities. By making sufficient observations of the system, the KBR allows for the estimation of probabilities and the construction of probabilistic models, thereby enabling a more flexible approach to statistical analysis.

Similarly, the conditional operator

E_{Q} : H_{x} \otimes H_{y}

satisfying

m_{Q (X | Y)} (y) = E_{Q}^{*} ψ (y)

is a minimizer of the loss function.

\begin{matrix} L^{Q} (E) = E_{Q} [∥ ϕ (X) - E^{*} ψ (Y) ∥^{2}] . \end{matrix}

(1)

However, since the posterior distribution (Q) is unknown, we cannot sample the dataset (

{(x_{i}, y_{i})}_{i = 1}^{n}

) from the posterior distribution (Q) directly. But we could still utilize the analytical form of

E_{Q}

according to [14], which is given by

\begin{matrix} E_{Q} = {(C_{Y Y}^{Q})}^{- 1} C_{X Y}^{Q} . \end{matrix}

(2)

In the context of vector-valued kernel regression, the covariance matrices are replaced by empirical estimators as follows:

\begin{matrix} {\hat{C}}_{X Y}^{Q} = \sum_{i = 1}^{n} μ_{i} ϕ (x_{i}) \otimes ψ (y_{i}) \\ {\hat{C}}_{Y Y}^{Q} = \sum_{i = 1}^{n} μ_{i} ψ (y_{i}) \otimes ψ (y_{i}) \end{matrix}

where each coefficient (

μ

) is given by

\begin{matrix} μ_{i} = < ϕ (x_{i}), {({\hat{C}}_{X X}^{P} + η I)}^{- 1} {\hat{m}}_{Π}) > . \end{matrix}

(3)

In Equation (3),

η

is another Tikhonov regularization parameter, and

{\hat{m}}_{Π}

is the prior mean, given as

\sum_{i = 1}^{n} ξ_{i} ϕ (x_{i})

, where

ξ_{i}

represents the weights.

However, since the

μ_{i}

coefficient may not necessarily be positive,

{\hat{C}}_{Y Y}^{Q}

could not be positive semi-definite. Consequently, calculating Equation (2) causes instabilities when inverting the operator (

C_{Y Y}^{Q}

). To address this issue, Fukumizu et al. [14] proposed an alternative formulation of

E_{Q}

, which is expressed as follows:

\begin{matrix} {\hat{E}}_{Q, λ} = {\hat{C}}_{Y Y}^{Q} {({({\hat{C}}_{Y Y}^{Q})}^{2} + λ I)}^{- 1} {\hat{C}}_{X Y}^{Q} \end{matrix}

(4)

In the next section, we introduce the iteratively re-weighted importance kernel Bayes rule for application to the filtering problem, a novel design for a KBR that does not require the problematic second-stage regularization. The essential idea is to use multiple weight functions.

3. Iteratively Re-Weighted Importance Kernel Bayes Rule

3.1. Importance Weighted Kernel Bayes Rule with Density Ratio

Here, we review the importance of the weighted kernel Bayes rule proposed in [16]. The method attempts to minimize the loss function (

L^{Q}

), which is estimated through importance sampling by applying a density ratio of

r (x) = \frac{π (x)}{p (x)}

. The loss function can be reformulated as

\begin{matrix} L^{Q} (E) = E_{P} r (X) {∥ ϕ (X) - E^{*} ψ (Y) ∥}^{2} \end{matrix}

(5)

The empirical loss function with added Tikhonov regularization can be constructed as follows:

\begin{matrix} {\hat{L}}^{(Q, λ)} (E) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{r}}_{i} (x_{i}) ∥ ϕ (x_{i}) - E^{*} ψ (y_{i}) ∥^{2} + λ {∥ E^{*} ∥}^{2} \end{matrix}

(6)

According to the Kernel-based unconstrained Least-Squares Importance Fitting (KuLSIF) estimator proposed in [16,18],

{\hat{r}}_{i}

is

\begin{matrix} \hat{r} (x_{i}) = max (0, μ_{i}), \end{matrix}

where

μ_{i}

is the same as in Equation (3).

The minimizer of

{\hat{L}}^{Q, λ} (E)

can be obtained analytically as

\begin{matrix} {\hat{E}}_{(Q, λ)} = {({\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} {\hat{C}}_{X Y}^{Q}, \end{matrix}

(7)

where

\begin{matrix} {\hat{C}}_{X Y}^{Q} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{r}}_{i} ϕ (x_{i}) \otimes ψ (y_{i}) {\hat{C}}_{Y Y}^{Q} = \frac{1}{n} \sum_{i = 1}^{n} {\hat{r}}_{i} ψ (y_{i}) \otimes ψ (y_{i}) \end{matrix}

(8)

The importance sampling technique enables the estimation of expectations under the posterior distribution using samples from the prior distribution, which is crucial for Bayesian filtering in high-dimensional spaces. The density ratio (

r (x)

) serves as the importance weights, ensuring proper weighting of the samples from the prior distribution when estimating posterior expectations. By incorporating these weights into the loss function, the weighted importance kernel Bayes rule is able to accurately estimate the posterior distribution from the available prior samples. This approach results in superior numerical stability relative to the existing approach to KBR.

3.2. Re-Weighted Importance Kernel Bayes Rule

The re-weighted method is a technique designed to enhance the robustness of the estimation process. In traditional least squares regression, which aims to minimize the sum of squared differences between observed and latent values, outliers can significantly influence the results, as a few extreme data points can disproportionately affect the outcome. The re-weighted scheme addresses this sensitivity by assigning different weights to the residuals, leading to a more robust estimation that is less susceptible to the influence of outliers. Equation (5) can be regarded as a kind of least square regression that could employ the re-weighted method to enhance the robustness of the approach.

Similarly, the loss function with multiple weights is constructed as follows:

\begin{matrix} L^{Q} (E) = E_{P} (ω (∥ ϕ (X) - E^{*} ψ (Y) ∥) [r (X)] ∥ ϕ (X) - E^{*} ψ (Y) ∥^{2}) \end{matrix}

(9)

where

ω

is a weight function of the residual (

(∥ ϕ (X) - E^{*} ψ (Y) ∥)

). Additionally, the empirical loss function, augmented with Tikhonov regularization, can be recast as follows:

\begin{matrix} {\hat{L}}^{(Q, λ)} (E) = \frac{1}{n} \sum_{i = 1}^{n} ω_{i} {\hat{r}}_{i} ∥ ϕ (x_{i}) - E^{*} ψ (y_{i}) ∥ + λ ∥ E^{*} ∥^{2} \end{matrix}

(10)

where

ω_{i} = ω (∥ ϕ (x_{i}) - E^{*} ψ (y_{i}) ∥)

.

Iteratively re-weighted least square (IRLS) is a common robust learning paradigm [17] that implements multiple weighted processes for better performance. It can be expressed by a sequence of successive minima of weighted

L

based theoretical regularized risk. The

k + 1

th iteration problem can be written as follows:

\begin{matrix} {\hat{E}}_{Q, λ, ω, (k + 1)} = arg min \frac{1}{n} \sum_{i = 1}^{n} ω (∥ ϕ (x_{i}) - {\hat{E}}_{Q, λ, ω, k}^{*} ψ (y_{i}) ∥) {\hat{r}}_{i} ∥ ϕ (x_{i}) - {\hat{E}}_{Q, λ, ω, (k + 1)}^{*} ψ (y_{i}) ∥ \\ + λ ∥ {\hat{E}}_{Q, λ, ω, (k + 1)}^{*} ∥ . \end{matrix}

Next, we write

{\hat{E}}_{Q, λ, ω}

as

{\tilde{E}}_{Q}

. The weight function (

ω (x) : R \to [0, 1]

) should satisfy the following three conditions:

1.: $ω (x)$ should be a Borel measurement;
2.: $ω (x)$ should be an even function; and
3.: $ω (x)$ should be a differential, especially, $ω^{'} (x) < 0$ when $x > 0$ .

{\tilde{E}}_{Q, k + 1}

can be obtained analytically as

\begin{matrix} {\tilde{E}}_{Q, k + 1} = {(ω ({\tilde{E}}_{Q, k}) {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} ω ({\tilde{E}}_{Q, k}) {\hat{C}}_{X Y}^{Q}, \end{matrix}

(11)

where

ω =

diag

(ω_{i})

and

ω_{i} ({\tilde{E}}_{Q, k}) = ω (∥ ϕ (x_{i}) - {\tilde{E}}_{Q, k}^{*} ψ (y_{i}) ∥)

,

{\hat{C}}_{Y Y}

and

{\hat{C}}_{X Y}

are the same as in Equation (8). Note that

\begin{matrix} {\tilde{E}}_{Q} = {(ω ({\tilde{E}}_{Q}) {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} ω ({\tilde{E}}_{Q}) {\hat{C}}_{X Y}^{Q} . \end{matrix}

(12)

According to Theorem 1 in [17], the iterative process converges such that

{\tilde{E}}_{Q, k + 1} \to {\tilde{E}}_{Q}

, and [17] has already proven the robustness of the re-weighted scheme. This iterative re-weighted scheme not only reduces the impact of outliers but also adaptively refines the estimate based on the current residuals, leading to improved performance in handling complex, high-dimensional data.

Theorem 1.

(Convergence analysis) Suppose that kernels

k_{x}

and

k_{y}

are continuous and bounded. Additionally, assume the density ratio (

r_{0}

) and the conditional operator (

E_{Q}

) are smooth. Given data (

{(x_{i}, y_{i})}_{i = 1}^{n} \sim P

) and estimated covariance operators such that

∥ {\hat{C}}_{Y Y}^{(Q)} - C_{Y Y}^{(Q)} ∥ \leq O_{p} (n^{- α})

and

∥ {\hat{C}}_{X Y}^{(Q)} - C_{X Y}^{(Q)} ∥ \leq O_{p} (n^{- α})

, by setting

λ = O_{p} (n^{- \frac{α}{β + 1}})

we have

\begin{matrix} ∥ {\tilde{E}}_{Q} - E_{Q} ∥ \leq O_{p} (n^{- \frac{α β}{β + 1}}) . \end{matrix}

where

E_{Q}

is Equation (2). The proof presented is in Appendix A.

The Weight Function to Choose

Many weight functions have been described in the literature, e.g., the Huber weight function [19] and the Hampel weight function [20], which do not satisfy the convergence conditions. In the following, we introduce three types of weight functions that have already satisfied the convergence conditions (1–3).

Logistic function: Debruyne et al. [17], introduced the following logistic weight function:

\begin{matrix} ω (x) = \frac{tanh (x)}{x} . \end{matrix}

(13)

This function is a variant of the standard logistic function, which is commonly employed to model probabilities in logistic regression. The logistic weight function is symmetric around zero and non-negative, with a derivative that becomes negative as the residual increases, effectively down-weighting observations with larger residuals.

The ℓ_s-induced weight function: Dong and Yang [21] introduced the following weight function:

\begin{matrix} ω (x) = \frac{a}{2 x} \frac{1}{(1 + exp (- a x) - \frac{1}{2})}, \end{matrix}

(14)

where a is a constant parameter. The ℓ_s-induced weight function is also symmetric around zero and non-negative, and its derivative becomes negative as the residual increases, resulting in a progressive decrease in the weight assigned to larger residuals.

Tukey Biweight function: Tukey’s Biweight function, often used in robust statistics, is a weighting function that assigns weights to data points [22]. It is also known as Tukey’s bisquare function. It is defined as follows:

ω (x) = \{\begin{matrix} {(1 - {(\frac{x}{c})}^{2})}^{2} & if | x | < c \\ 0 & if | x | \geq c \end{matrix} .

where c is a constant that determines the range of the function. The function assigns higher weights to data points closer to zero and rapidly decreases the weights for data points farther away. This makes it less sensitive to outliers than a simple average.

These weight functions are designed to assign lower weights to larger residuals, which effectively reduces their influence on operator estimation. This strategy contributes to the development of robust regression techniques that are less sensitive to outliers.

4. The Iteratively Re-Weighted Importance Kernel Bayes Filter

In this section, we describe the iteratively re-weighted importance kernel Bayes filter. Kernel Bayesian inference is a well-founded approach to non-parametric data in probabilistic graphical models, where probabilistic relationships between variables are learned from data in a non-parametric manner.

In the filtering problem, the states evolve according to a Markov process determined by the state transition model (

p (x_{t + 1} | x_{t})

) describing the conditional probability of the next state (

x_{t + 1}

) given the current state (

x_{t}

). Observation

y_{t}

at time t is generated depending only on the corresponding state (

x_{t}

) following the observation model (

p (y_{t} | x_{t})

). When applying the kernel Bayes rule, we do not need to assume the conditional probabilities (

p (y_{t} | x_{t})

and

p (x_{t + 1} | x_{t})

) to be known explicitly, nor do we estimate them with simple parametric models. Rather, we assume that a sample

{x_{1}, \dots, x_{T}, y_{1}, \dots, y_{T}}

is given for both the observable and hidden variables in the training phase.

The aim of the filtering method is to probabilistically estimate the state (

x_{t + 1}

) at each time (

t + 1

) using the new observation sequence (

{\hat{y}}_{1}, \dots, {\hat{y}}_{t + 1}

), i.e., to estimate

p (x_{t + 1} | {\hat{y}}_{1}, \dots, {\hat{y}}_{t + 1})

. The sequential estimate for the kernel mean of

p (x_{t + 1} | {\hat{y}}_{1}, \dots, {\hat{y}}_{t + 1})

can be derived by employing the iteratively re-weighted kernel Bayes rule. This can be obtained by iterating the following two steps.

Prediction step

Assume that we have a posterior embedding of

m_{x_{t} | {\hat{y}}_{1 : t}}

at time t. Then, we can compute the embedding of forward prediction (

m_{x_{t + 1} | {\hat{y}}_{1 : t}}

) as follows:

\begin{matrix} m_{x_{t + 1} | {\hat{y}}_{1 : t}} = {(E_{x_{t + 1} | x_{t}})}^{*} m_{x_{t} | {\hat{y}}_{1 : t}} \end{matrix}

where

E_{x_{t + 1} | x_{t}}

is the conditional operator for

P (x_{t + 1} | x_{t})

. Empirically, this is estimated based on

{x_{t}, x_{t + 1}}

as follows:

\begin{matrix} {\hat{E}}_{x_{t + 1} | x_{t}} = {({\hat{C}}_{x_{T - 1}, x_{T - 1}} + λ_{1} I)}^{- 1} {\hat{C}}_{x_{T - 1}, x_{T}}, \end{matrix}

where

λ_{1}

is another regularizing coefficient and

\begin{matrix} {\hat{C}}_{x_{T - 1}, x_{T - 1}} = \frac{1}{T - 1} \sum_{i = 1}^{T - 1} ϕ (x_{i}) \otimes ϕ (x_{i}), \\ {\hat{C}}_{x_{T - 1}, x_{T}} = \frac{1}{T - 1} \sum_{i = 1}^{T - 1} ϕ (x_{i}) \otimes ϕ (x_{i + 1}) \end{matrix}

Update step

When a new observation (

{\hat{y}}_{t + 1})

is obtained, the mean embedding of the posterior distribution is expressed as follows:

\begin{matrix} m_{x_{t + 1} | {\hat{y}}_{1 : t + 1}} = ({\tilde{E}}_{Q}) ψ ({\hat{y}}_{t + 1}) \end{matrix}

where

{\tilde{E}}_{Q}

is

{(ω ({\tilde{E}}_{Q}) {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} ω ({\tilde{E}}_{Q}) {\hat{C}}_{X Y}^{Q}

.

During the calculation, it is preferable to calculate the Gram matrices instead of covariance matrices. This approach simplifies the calculation and can improve computational efficiency. In general, we need to express the mean embedding in the form of a feature map with latent training points as follows:

\begin{matrix} m_{x_{t + 1} | {\hat{y}}_{1 : t + 1}} = \sum_{i = 1}^{T} ξ_{i}^{t + 1, t + 1} ϕ (x_{i}), \end{matrix}

(15)

where

ξ^{t + 1, t + 1}

is the coefficient at time

t + 1

. Notably, There is a rule (

[a \otimes b] c = a < b, c >

) that can be employed to simplify the covariance matrix in the Gram matrix. Hence, the coefficient is

\begin{matrix} ξ = {(ω (\hat{r} G_{Y}) + T λ I)}^{- 1} ω (\hat{r} {\hat{G}}_{Y}) \end{matrix}

(16)

where Gram matrices are

{(G_{Y})}_{i j} = k_{y} (y_{i}, y_{j})

and

{({\hat{G}}_{Y})}_{i j} = k_{y} (y_{i}, {\hat{y}}_{j})

,

{(\hat{r})}_{i i} = < ϕ (x_{i}), T {(G_{X} + T η I)}^{- 1} {\hat{m}}_{Π} >

. The algorithm can be summarized as follows (see Algorithm 1).

In this algorithm, we use a trick to calculate the weight. In the iteration step,

E_{k + 1}

should be

{(ω_{k} \hat{r} G_{Y} + T λ I)}^{- 1} (ω_{k} \hat{r} C_{X Y})

, but we use the

E_{k + 1} ψ (y_{i})

; then, we can calculate the Gram matrix (

G_{Y}

) instead of the covariance matrix (

C_{X Y}

).

Kernel methods have some limitations because they rely on predefined features from the RKHS, which may not work well with complex or high-dimensional data. To address this, adaptive neural network features [16] refer to the features generated by neural networks that can automatically adjust and learn from the data during the training process. Unlike fixed features, these adaptive features evolve to better capture the underlying patterns in the data, especially in complex or high-dimensional scenarios. This adaptability makes them particularly useful in situations where traditional methods, like kernel methods with predefined feature maps, may struggle to represent the data effectively. Here, we briefly introduce this concept. We rewrite the feature map (

ψ

) as

ψ_{θ}

, an adaptive feature map represented by a neural network parameterized by

θ

. The optimal

θ

is obtained by minimizing the function as follows:

\begin{matrix} ℓ (θ) = tr (G_{X} (\hat{r} - \hat{r} Ψ_{θ}^{T} {(Ψ_{θ} \hat{r} Ψ_{θ}^{T} + λ I)}^{- 1} Ψ_{θ} \hat{r})) \end{matrix}

(17)

where

Ψ_{θ} = [ψ_{θ}, \dots, ψ_{θ}]

.

The Gram matrix in Equation (16) is defined as

{(G_{Y})}_{i j} = k_{y} (y_{i}, y_{j}) = ψ_{θ} {(y_{i})}^{T} ψ_{θ} (y_{j})

.

In the following experiment, we use the linear kernel on the learned adaptive feature (

ψ (y)

) and finite-dimensional random Fourier feature approximation of

ϕ (x)

to calculate the coefficient (

ξ

) [23]. We use a linear kernel of

ϕ (x) = x

to estimate the posterior mean (

{\hat{m}}_{Q} = \sum ξ_{i} x_{i}

) in the latent space. This approach does not necessitate the same feature map used to compute the weight function, as demonstrated in [14]

Algorithm 1: Iteratively Re-weighted Importance Kernel Bayes Filter Algorithm

Input: Training dataset

{(x_{t}, y_{t})}_{i = 1}^{T}

, regularization parameters

η, λ_{1}, λ

,

and test sequence

{\hat{y}}_{1}, \dots, {\hat{y}}_{\tilde{T}}

.

Initialize

ξ^{(0, 1)}

.

Compute Gram matrices

G_{X}, G_{Y} \in R^{T \times T}

.

{(G_{X})}_{i j} = k_{x} (x_{i}, x_{j})

,

{(G_{Y})}_{i j} = k_{y} (y_{i}, y_{j})

.

For

t = 1

to

\tilde{T}

do:

1. Compute the prior embedding

{\hat{m}}_{Π}

at conditioning point

{\hat{y}}_{t}

using training dataset

{(x_{t}, y_{t})}_{i = 1}^{T}

:

{\hat{m}}_{Π} = \sum_{i = 1} ξ_{i}^{t - 1, t} ϕ (x_{i})

.

2. Compute the density ratio

\hat{r} = ({\hat{r}}_{1}, \dots, {\hat{r}}_{T})

in

R^{T}

as:

\hat{r} = max (0, T {(G_{X} + T η I)}^{- 1} g_{Π})

,

where

{(g_{Π})}_{i} = < {\hat{m}}_{Π}, ϕ (x_{i}) >

.

3. For

k = 1

to K:

Compute weight

ω_{k}

, where

ω_{k, i} = ω (∥ ϕ (x_{i}) - E_{k} ψ (y_{i}) ∥)

.

Update embedding

E_{k + 1} ψ (y_{i})

as:

E_{k + 1} ψ (y_{i}) = {(ω_{k} \hat{r} G_{Y} + T λ I)}^{- 1} (ω_{k} \hat{r} G_{Y}) ϕ (x_{i})

,

if

∥ E_{k + 1} ψ (y_{i}) - E_{k} ψ (y_{i}) ∥ \leq ϵ

, stop.

4. Compute

ξ^{t, t}

as:

ξ^{t, t} = {(ω \hat{r} G_{Y} + T λ I)}^{- 1} (ω \hat{r} {\hat{G}}_{Y})

,

where

ω

is diag(

ω_{K}

),

\hat{r}

is diag(

\hat{r}

),

{\hat{G}}_{Y} = k_{y} (y_{i}, {\hat{y}}_{t})

.

5. Compute

ξ^{t, t + 1}

as:

ξ_{1}^{(t, t + 1)} = 0

and

ξ_{2 : T}^{t, t + 1} = {(G_{X_{- 1}} + (T - 1) λ_{1} I)}^{- 1} {\tilde{G}}_{X_{- 1}} ξ^{t, t}

,

where

{(G_{X_{- 1}})}_{i j} = k_{x} (x_{i}, x_{j}) \in R^{T - 1 \times T - 1}

and

{({\tilde{G}}_{X_{- 1}})}_{i j} = k_{x} (x_{i}, x_{j}) \in R^{T - 1 \times T}

.

End For

5. Numerical Illustration

In this section, we explore the effectiveness of our method by applying it to two nonlinear dynamical models, with the code written in Python.

A Synthetic Problem:

Inspired by Fukumizu et al. [14] and Xu et al. [16], a simple, synthetic, nonlinear dynamic is introduced to illustrate the kernel Bayesian filtering methods. With a latent variable of

X_{t} = {(u_{t}, v_{t})}^{T}

, the synthetic dynamic is described as

[\begin{matrix} u_{t + 1} \\ v_{t + 1} \end{matrix}] = (1 + β sin (M θ_{t})) [\begin{matrix} cos (θ_{t} + ω) \\ sin (θ_{t} + ω) \end{matrix}] + ϵ_{X}

(18)

where

θ_{t} \in [0, 2 π]

and

cos (θ_{t}) = \frac{u_{t}}{\sqrt{u_{t}^{2} + v_{t}^{2}}}, sin (θ_{t}) = \frac{v_{t}}{\sqrt{u_{t}^{2} + v_{t}^{2}}}

for given parameters of

(β, M, ω)

and

ϵ_{X} \sim N (0, σ_{X}^{2} I)

. The noise (

ϵ_{X}

) is an independent process. Observation

Y_{t}

is expressed as

\begin{matrix} Y_{t} = X_{t} + ϵ_{Y} . ϵ_{Y} \sim N (0, σ_{Y}^{2} I) \end{matrix}

where

ϵ_{Y}

is independent noise. In the following experiments, the parameters in the dynamics are set as

ω = 0.2, β = 0.2

, and

M = 8

. Additionally, the noise levels are set to

σ_{Y} = 0.2

and

σ_{X} = 0.2

, respectively. The

η

and

λ_{1}

parameters are tuned by Kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. The bandwidth Gaussian kernel employed for all KBF methods is given by the medians of pairwise distances among the training set.

In the following figures, “original” denotes the performance of original KBF estimator using the

{\hat{E}}_{Q, λ}

operator in Equation (4). IW and re-weight represent importance KBF using the

{\hat{E}}_{(Q, λ)}

operator in Equation (7) and the iteratively re-weighted importance kernel Bayes filter using the

{\tilde{E}}_{Q}

operator in Equation (12), respectively.

Figure 1 illustrates the posterior approximation of latent variables u and v. The red line represents the given latent test sequence. The blue line, yellow line, and green line are the three KBF methods used to approach the latent variables. This experiment was conducted with a training sequence length of 100 and the regularized parameter (

λ

) set to

10^{3}

. The ℓ-induced weight function was employed in The IRe-KBF method in this experiment.

As observed in Figure 1, it is evident that the KBF method utilizing iterative re-weighting yields the closest approximation to the actual latent test values. This indicates that this particular method outperforms the other two techniques in this context. However, this kind of obvious situation is not always the case. Many times, the results are not obvious. Hence, we calculate the MSE, which is the mean square error between the posterior estimation of the latent variables and the latent test value to evaluate the performance of the methods, as shown in Figure 2.

First, we conducted some research about the kernel Bayes filter with three different weight functions described in Section 3 to find the optimize weight function. In this experiment, the parameters were

a = 1

in the ℓ_s-induced weight function and

c = 1

in the Tukey biweight function. Essential regularization (

λ

) is obtained by the hyperparameter tuning method. Figure 3 summarizes the mean squared error (MSE) over 10 runs when the conditional points (

{\hat{y}}_{t}

) are sampled. The length of test sequence is set to 50.

Figure 3 shows that the IRe-KBF with the ℓ-induced weight function leads to the fewest errors between the posterior estimation and the test sequence. Hence, in the following, we apply the ℓ-induced weight function withing the IRe-KBF method.

The MSE with varying training lengths is summarized in Figure 2. As depicted in the figure, the MSE decreases as the training length increases, which is expected. Furthermore, the “re-weight” method demonstrates comparable or superior performance compared to the other two KBF methods.

We also conducted an experiment employing the ensemble Kalman filter (EKF), which yielded a mean squared error (MSE) of 0.049. This value is notably lower than the MSE obtained by the other three filter methods when the training length is less than 200. This result is reasonable, since the ensemble Kalman filter is applied under an explicit model with a relatively low-dimensional system.

Lorenz 96 model:

The Lorenz 96 model is a simplified mathematical model used to study atmospheric and climate dynamics. It was proposed by Edward Lorenz in 1996 to capture key nonlinear characteristics of such systems in a more manageable form.

The standard form of the Lorenz 96 model is given by the following set of differential equations:

\frac{d x_{i}}{d t} = (x_{i + 1} - x_{i - 2}) \cdot x_{i - 1} - x_{i} + F,

(19)

where

$x_{i}$ represents the ith variable of the system, i.e., $i = 1, \dots, N$ ;
F is an external forcing term that represents the energy input into the system; and
The indices are treated cyclically, meaning $x_{N + 1} = x_{1}$ and $x_{0} = x_{N}$ .

If the Lorenz 96 model has four variables (i.e.,

x_{1}, x_{2}, x_{3}, x_{4}

), the system of equations becomes

\frac{d x_{1}}{d t} = (x_{2} - x_{4}) x_{3} - x_{1} + F

\frac{d x_{2}}{d t} = (x_{3} - x_{1}) x_{4} - x_{2} + F

\frac{d x_{3}}{d t} = (x_{4} - x_{2}) x_{1} - x_{3} + F

\frac{d x_{4}}{d t} = (x_{1} - x_{3}) x_{2} - x_{4} + F

This system is a four-dimensional nonlinear system, since it involves four state variables (

X = {(x_{1}, x_{2}, x_{3}, x_{4})}^{T}

), and each variable’s evolution is influenced by nonlinear interactions with the other variables. The Lorenz 96 model can be generalized to any number of variables (N), but in this case, we are considering the specific case where

N = 4

.

Observation Y is expressed as follows:

Y_{t} = H X_{t} + ϵ_{Y}

where

ϵ_{Y}

is independent noise with

ϵ_{Y} \sim N (0, σ_{Y}^{2} I)

, and

X_{t}

is the state at time t.

In this experiment, the observation matrix (H) is expressed as follows:

H = (\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{matrix}) .

In the filtering problem, we have parameters of

F = 1

and

N = 4

. The dynamical systems is discrete, with a time step

d t = 0.1

. The latent variable noise is set to zero, while the observable noise is characterized by

σ_{Y} = 0.1

. The test size for this experiment was set to 70. Additionally, in the filtering problem, the

η

and

λ_{1}

parameters are tuned by a kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. Essential regularization

λ

is obtained by the hyperparameter tuning method, and the ℓ-induced weight function was employed in this experiment. The results presented in Figure 4 consist of graphs depicting the relationship in the whole filtering process between

x_{1}, x_{2}

, and

x_{3}

, which are the first three variables in the latent variable vector (X). It is clear to see that the KBF with “re-weighting” yields a better approximation of the actual test latent value, which is given by the dynamics, than the other two methods.

We also computed the MSE between the posterior estimation and the test latent sequence. The results are depicted in Figure 5. As illustrated in Figure 5, the “re-weight” method consistently performs better than the other two methods. The original Kernel Bayesian filter exhibits significant instability in performance. However, it is worth noting that the obtained results are sensitive to the

λ

parameter. In Figure 6, we observed that the performance of IRe-KBF significantly declines when the

λ

parameter is small.

Furthermore, we conducted an experiment using the extended Kalman filter (EKF), resulting in an MSE of 375.21. This value is notably larger when compared to the performance of the three kernel-based filtering methods. The limitations of the EKF are apparent, particularly in handling complex nonlinear systems. Conversely, the iteratively re-weighted importance kernel Bayes filter (IRe-KBF) demonstrates its ability to capture intricate nonlinear dependencies between observations and latent states.

6. Conclusions

In this paper, we proposed a robust approach to kernel Bayes filtering that does not depend on an explicit dynamical system. The kernel Bayes filter process only depends on a training dataset to calculate the transition and the observation operator. This method employs multiple weight functions, i.e., the density ratio and weight function related to residuals. Additionally, we employed the iteratively re-weighted process to enhance robustness and efficiency.

Two examples that represent high-dimensional dynamics, namely a nonlinear synthetic model and Lorenz 96, were applied to illustrate the performance of the IRe-KBF. Experimental results demonstrate the effectiveness of IRe-KBF in handling complex data, with superior performance over traditional KBF methods and the importance weighted KBF method on high-dimensional datasets.

In this paper, we also conducted some research on the weight function, although more research is needed in this area. Future work will focus on exploring and refining weight functions to optimize their performance in kernel Bayes filtering applications.

Funding

This research was funded by the China Postdoctoral Science Foundation (Certificate Number 2023TQ0015).

Institutional Review Board Statement

This study did not involve humans or animals.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Proof.

The Convergence Proof

Next, we prove that

E_{Q, λ, ω}

is consistent with

E_{Q}

. □

Lemma A1

(Theorem 3.6 in [16]). Given

{x_{i}, y_{i}}_{i = 1}^{n} \sim P

and estimated covariance operators such that

∥ {\hat{C}}_{Y Y}^{Q} - C_{Y Y}^{Q} ∥ \leq O_{p} (n^{- α})

and

∥ {\hat{C}}_{X Y}^{Q} - C_{X Y}^{Q} ∥ \leq O_{p} (n^{- α})

, by setting

λ = O^{- \frac{α}{β + 1}}

, we have

\begin{matrix} ∥ {\hat{E}}_{Q, λ} - E_{Q, λ} ∥ \leq O_{p} (n^{- \frac{α β}{β + 1}}) . \end{matrix}

Proof of Theorem 1.

Next, we prove that

{\tilde{E}}_{Q}

is consistent with

E_{Q}

in Equation (2) to determine the value of E as follows:

\begin{matrix} {\tilde{E}}_{Q} = {(ω {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} ω {\hat{C}}_{X Y}^{Q} \end{matrix}

We decompose the error as follows:

\begin{matrix} ∥ E_{Q} - {\tilde{E}}_{Q} ∥ \leq ∥ E_{Q} - {\hat{E}}_{Q, λ} ∥ + ∥ {\hat{E}}_{Q, λ} - {\tilde{E}}_{Q} ∥ \end{matrix}

According to Lemma A1,

\begin{matrix} ∥ {\hat{E}}_{Q, λ} - E_{Q} ∥ \leq O_{p} (n^{- \frac{α β}{β + 1}}) \end{matrix}

The second term can be bounded as follows:

\begin{matrix} ∥ {\hat{E}}_{Q, λ} - {\tilde{E}}_{Q} ∥ & = ∥ {(ω {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} (ω {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ} + λ {\hat{E}}_{Q, λ} - ω {\hat{C}}_{X Y}^{Q}) ∥ \\ \leq ∥ {(ω {\hat{C}}_{Y Y}^{Q} + λ I)}^{- 1} ∥ ∥ (ω {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ} + λ {\hat{E}}_{Q, λ}) - ω {\hat{C}}_{X Y}^{Q} ∥ \\ \leq \frac{1}{λ} ∥ ω {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ} + λ {\hat{E}}_{Q, λ} - ω {\hat{C}}_{X Y}^{Q} ∥ \\ = \frac{1}{λ} ∥ ω {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ} + ({\hat{C}}_{X Y}^{Q} - {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ}) - ω {\hat{C}}_{X Y}^{Q} ∥ \\ = \frac{1}{λ} ∥ (ω - 1) {\hat{C}}_{Y Y}^{Q} {\hat{E}}_{Q, λ} - (ω - 1) {\hat{C}}_{X Y}^{Q} ∥ \\ = \frac{1}{λ} (1 - ω) ∥ λ {\hat{E}}_{Q, λ} ∥ \\ = (1 - ω) ∥ {\hat{E}}_{Q, λ} ∥ \end{matrix}

Therefore, we have

\begin{matrix} ∥ E_{Q} - {\tilde{E}}_{Q} ∥ \leq ∥ {\hat{E}}_{Q, λ} - E_{Q} ∥ + (1 - ω) ∥ {\hat{E}}_{Q, λ} ∥ \end{matrix}

Hence, if

∥ {\hat{C}}_{Y Y}^{Q} - C_{Y Y}^{Q} ∥ \leq O_{p} (n^{- α})

and

∥ {\hat{C}}_{X Y}^{Q} - C_{X Y}^{Q} ∥ \leq O_{p} (n^{- α})

, we have

\begin{matrix} ∥ E_{Q} - {\tilde{E}}_{Q} ∥ \leq O_{p} (n^{- \frac{α β}{β + 1}}) . \end{matrix}

Then, we set

λ = O (n^{\frac{α}{β + 1}})

. □

References

Chen, Z. Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics 2003, 182, 1–69. [Google Scholar] [CrossRef]
Zhang, F.; Xue, W.-F.; Liu, X. Overview of nonlinear Bayesian filtering algorithm. Procedia Eng. 2011, 15, 489–495. [Google Scholar]
Särkkä, S.; Svensson, L. Bayesian Filtering and Smoothing; Cambridge University Press: Cambridge, UK, 2023; Volume 17. [Google Scholar]
Candy, J.V. Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods; John Wiley & Sons: Hoboken, NJ, USA, 2016; Volume 54. [Google Scholar]
Kim, D.; Park, M.; Park, Y.L. Probabilistic modeling and Bayesian filtering for improved state estimation for soft robots. IEEE Trans. Robot. 2021, 37, 1728–1741. [Google Scholar] [CrossRef]
Javaheri, A.; Lautier, D.; Galli, A. Filtering in finance. Wilmott 2003, 3, 67–83. [Google Scholar] [CrossRef]
Lopes, H.F.; Tsay, R.S. Particle filters and Bayesian inference in financial econometrics. J. Forecast. 2011, 30, 168–209. [Google Scholar] [CrossRef]
Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel mean embedding of distributions: A review and beyond. Found. Trends® Mach. Learn. 2017, 10, 1–141. [Google Scholar] [CrossRef]
Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert space embedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, Sendai, Japan, 1–4 October 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 13–31. [Google Scholar]
Song, L.; Huang, J.; Smola, A.; Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 961–968. [Google Scholar]
Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 2006, 19, 513–520. [Google Scholar]
Gretton, A.; Fukumizu, K.; Teo, C.; Song, L.; Schölkopf, B.; Smola, A. A kernel statistical test of independence. Adv. Neural Inf. Process. Syst. 2007, 20, 585–592. [Google Scholar]
Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B. Kernel measures of conditional dependence. Adv. Neural Inf. Process. Syst. 2007, 20, 489–496. [Google Scholar]
Fukumizu, K.; Song, L.; Gretton, A. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 2013, 14, 3753–3783. [Google Scholar]
Boots, B.; Gordon, G.; Gretton, A. Hilbert space embeddings of predictive state representations. arXiv 2013, arXiv:1309.6819. [Google Scholar]
Xu, L.; Chen, Y.; Doucet, A.; Gretton, A. Importance Weighting Approach in Kernel Bayes’ Rule. arXiv 2022, arXiv:2202.02474. [Google Scholar]
Debruyne, M.; Christmann, A.; Hubert, M.; Suykens, J.A. Robustness of reweighted least squares kernel based regression. J. Multivar. Anal. 2010, 101, 447–463. [Google Scholar] [CrossRef]
Kanamori, T.; Suzuki, T.; Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Mach. Learn. 2012, 86, 335–367. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 523. [Google Scholar]
Law, J. Robust Statistics—The Approach Based on Influence Functions; Taylor & Francis: Abingdon, UK, 1986. [Google Scholar]
Dong, H.; Yang, L. Kernel-based regression via a novel robust loss function and iteratively reweighted least squares. Knowl. Inf. Syst. 2021, 63, 1149–1172. [Google Scholar] [CrossRef]
Fox, J.; Weisberg, S. Robust regression. R S-Plus Companion Appl. Regres. 2002, 91, 6. [Google Scholar]
Rahimi, A.; Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 2007, 20, 1177–1184. [Google Scholar]

Figure 1. Posterior approximations of variables u (a) and v (b) using three methods.

Figure 2. The MSE results of a synthetic problem with the original KBF, importance weighted KBF, and iteratively re-weighted importance KBF.

Figure 3. The result of the proposed KBF with three different weight functions. “weight_Tu” denotes KBF with the Tukey weight function. “weight_ ex” represents KBF with the ℓ_s-induced weight function. “weight_tan” denotes the logistic weight function.

Figure 4. Posterior approximations of the latent variable using three methods, with training set to 70. (a) Observation noise of

σ_{Y} = 0.01

; (b) observation noise of

σ_{Y} = 0.1

.

Figure 4. Posterior approximations of the latent variable using three methods, with training set to 70. (a) Observation noise of

σ_{Y} = 0.01

; (b) observation noise of

σ_{Y} = 0.1

.

Figure 5. The MSE results of Lorenz 96 using three methods with a test set of 70 and observation noise set at 0.1.

Figure 6. The MSE results of Lorenz 96 using different

λ

values with a test size of 100 and observation noise set at 0.1.

Figure 6. The MSE results of Lorenz 96 using different

λ

values with a test size of 100 and observation noise set at 0.1.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X. An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing. Mathematics 2024, 12, 2962. https://doi.org/10.3390/math12192962

AMA Style

Liu X. An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing. Mathematics. 2024; 12(19):2962. https://doi.org/10.3390/math12192962

Chicago/Turabian Style

Liu, Xin. 2024. "An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing" Mathematics 12, no. 19: 2962. https://doi.org/10.3390/math12192962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing

Abstract

1. Introduction

2. Background and Preliminaries

3. Iteratively Re-Weighted Importance Kernel Bayes Rule

3.1. Importance Weighted Kernel Bayes Rule with Density Ratio

3.2. Re-Weighted Importance Kernel Bayes Rule

4. The Iteratively Re-Weighted Importance Kernel Bayes Filter

5. Numerical Illustration

6. Conclusions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI