1. Introduction
Bayesian filtering is a probabilistic approach that recursively estimates an unknown probability density function over time based on a mathematical model and observation process [
1,
2,
3]. It involves updating the prior distribution to obtain the posterior distribution, which is the essence of Bayesian statistics. As a fundamental problem in probabilistic inference and sequential estimation, Bayesian filtering focuses on inferring the state of a dynamic system over time from noisy and incomplete observations while integrating prior knowledge about the system’s behavior. This problem finds wide-ranging applications in fields such as signal processing [
4], robotics [
5], finance [
6,
7], and more.
Traditional Bayesian filtering methods often encounter limitations when dealing with high-dimensional or complex data spaces. Kernel methods have emerged as a powerful tool for generalizing linear statistical methods to nonlinear settings, which is achieved by embedding samples into a high-dimensional feature space known as a reproducing kernel Hilbert space (RKHS). Kernel mean embedding plays a crucial role in this context, representing probability distributions as expectations of features in the RKHS [
8,
9,
10].
By embedding distributions into the RKHS, kernel mean embedding facilitates various statistical analyses and machine learning tasks. The work of Smola et al. in 2007 [
9] emphasized the significance of kernel mean embedding in Bayesian updates. The use of kernel means in characteristic RKHS has been widely proven successful in a number of statistical tasks, including two-sample problems [
11], independence tests [
12], and conditional independence tests [
13]. Notably, these tests are applicable to any domain where kernels can be defined, showcasing the versatility of the kernel approach.
In the context of Bayesian filtering, it enables the representation of probability distributions in a space where operations such as inner products and norms can be easily computed. This not only enhances the scalability of Bayesian filtering algorithms but also allows for the incorporation of complex, structured data into the estimation process. Moreover, kernel mean embedding provides a framework for integrating diverse sources of information, including prior knowledge and domain expertise, into the filtering process.
The kernel Bayesian filtering procedure introduced in [
14] is model-free and produces sample-derived estimates of the posterior embedding. This Bayesian filtering method involves the construction of a posterior distribution over the current state based on the sequence of noisy observations up to the present time, without making explicit modeling assumptions about the underlying dynamics. Unfortunately, it may cause instability during the calculating process. Fukumizu et al. [
14] proposed a formulation that requires an unconventional form of regularization in the second stage, which adversely impacts the attainable convergence rates. Boots et al. [
15] introduced one KBF where only a simple form of Tikhonov regularization is applied. Unfortunately, this method requires the inversion of the matrix, which is often indefinite, necessitating high regularization constants, which can degrade performance. Xu et al. [
16] introduced an approach called importance-weighted KBF to avoid the instability problem. The importance weight is based on the density ratio. However, the calculation of the density ratio may cause high variance.
In this paper, we explore an innovative extension of the importance weighted kernel Bayesian filtering method proposed in [
16] known as Iteratively Re-weighted importance Kernel Bayesian Filtering (IRe-KBF). The concept of iterated re-weighting, which has been successfully applied in other contexts (e.g., [
17]), is now integrated into the filtering process, as explained in
Section 3. This integration, which is also discussed in the Introduction, along with reference [
17], brings about a significant enhancement by making the filter more robust and efficient in high-dimensional systems.
The IRe-KBF method learns kernel mean representations directly from training data, eliminating the need for explicit specification of prior and likelihood distributions. By harnessing importance weights and iterative re-weighting, our filter exhibits enhanced robustness and efficiency in the context of high-dimensional datasets.
The remainder of this paper is organized as follows. In
Section 2, we review the basic concepts of kernel mean embeddings and the kernel Bayes rule. In
Section 3, we first introduce the importance-weighted kernel Bayes rule proposed in [
16]; then the iteratively re-weighted importance kernel Bayes rule (IRe-KBR) is proposed. We introduce the filtering problem in
Section 4, which applies the IRe-KBR to the filtering problem. Experiments are reported in
Section 5.
2. Background and Preliminaries
In this section, we review some basic concepts.
Kernel mean embeddings provide a way to represent distributions in reproducing kernel Hilbert spaces (RKHSs) using positive definite kernels. Given random variables on with a joint distribution (P) and density functions , and are measurable positive definite kernel corresponding to scalar-valued RKHSs and , respectively. The feature maps are denoted as and , and the RKHS norm is denoted as , with corresponding Gram matrices of and .
The kernel mean embedding of the marginal distribution (
) is denoted by
and defined as follows:
It always exists for kernels that are bounded. From the reproducing property, we derive the following for all
:
which is advantageous for estimating expectations of functions. Moreover, if a kernel
such as the Gaussian kernel is characteristic, the embedding uniquely determines the probability distribution, which means that
implies
. Additionally, we introduce the following (uncentered) kernel covariance operators:
In this context, ⊗ denotes the tensor product, and ∗ denotes the adjoint of the operator. Covariance operators extend the concept of finite-dimensional covariance matrices to the realm of infinite kernel feature spaces, and they are always defined for bounded kernels.
The kernel conditional mean embedding [
10] extends the kernel mean embedding to conditional probability distributions. It is defined as follows:
This embedding captures the conditional expectation of the feature space representation of
X given
. Under the assumption of regularity, which requires that
for all
, an operator (
) can be identified such that
The conditional operator (
) can be obtained by minimizing the loss function (
), which is
The minimization of this loss function leads to an analytical solution in the form of the following closed-form expression:
The solution provides a means to estimate the conditional operator (
), which is crucial for various machine learning tasks that involve conditional inference or prediction. Obtaining an empirical estimate of the conditional mean embedding is a straightforward procedure. Given i.i.d samples (
), we seek to minimize the sample estimate of the loss function (
), which is defined as
This is a sample estimate of
that incorporates a Tikhonov regularization parameter (
) to mitigate overfitting. The solution to the minimization problem can be expressed as follows:
where
is the empirical estimate of the covariance operators, i.e.,
With respect to the
kernel Bayes rule (KBR), in the context of Bayes’ theorem, our objective is to update the prior distribution (
) with the density function (
) to the posterior distribution (
Q) of the state (
X) given an observation (
Y) with the corresponding density function (
).
where
is the density function of the likelihood function and
is the marginal probability density function.
The KBR aims to update the prior mean embedding () to the posterior mean embedding () according to the Bayes rule. Unlike traditional methods that require the explicit form of the likelihood function (), the KBR learns the relations between the latent variable (X) and the observable variable (Y) directly from the data. This is achieved by analyzing the dataset (), where the density of data distribution (P) shares the same likelihood () with the density function (). In general, the density function () is not equal to . One of the advantages of the KBR approach is its ability to perform Bayesian inference, even in the absence of a specific parametric model or an analytic form for the prior and likelihood densities. By making sufficient observations of the system, the KBR allows for the estimation of probabilities and the construction of probabilistic models, thereby enabling a more flexible approach to statistical analysis.
Similarly, the conditional operator
satisfying
is a minimizer of the loss function.
However, since the posterior distribution (
Q) is unknown, we cannot sample the dataset (
) from the posterior distribution (
Q) directly. But we could still utilize the analytical form of
according to [
14], which is given by
In the context of vector-valued kernel regression, the covariance matrices are replaced by empirical estimators as follows:
where each coefficient (
) is given by
In Equation (
3),
is another Tikhonov regularization parameter, and
is the prior mean, given as
, where
represents the weights.
However, since the
coefficient may not necessarily be positive,
could not be positive semi-definite. Consequently, calculating Equation (
2) causes instabilities when inverting the operator (
). To address this issue, Fukumizu et al. [
14] proposed an alternative formulation of
, which is expressed as follows:
In the next section, we introduce the iteratively re-weighted importance kernel Bayes rule for application to the filtering problem, a novel design for a KBR that does not require the problematic second-stage regularization. The essential idea is to use multiple weight functions.
4. The Iteratively Re-Weighted Importance Kernel Bayes Filter
In this section, we describe the iteratively re-weighted importance kernel Bayes filter. Kernel Bayesian inference is a well-founded approach to non-parametric data in probabilistic graphical models, where probabilistic relationships between variables are learned from data in a non-parametric manner.
In the filtering problem, the states evolve according to a Markov process determined by the state transition model () describing the conditional probability of the next state () given the current state (). Observation at time t is generated depending only on the corresponding state () following the observation model (). When applying the kernel Bayes rule, we do not need to assume the conditional probabilities ( and ) to be known explicitly, nor do we estimate them with simple parametric models. Rather, we assume that a sample is given for both the observable and hidden variables in the training phase.
The aim of the filtering method is to probabilistically estimate the state () at each time () using the new observation sequence (), i.e., to estimate . The sequential estimate for the kernel mean of can be derived by employing the iteratively re-weighted kernel Bayes rule. This can be obtained by iterating the following two steps.
Prediction step
Assume that we have a posterior embedding of
at time
t. Then, we can compute the embedding of forward prediction (
) as follows:
where
is the conditional operator for
. Empirically, this is estimated based on
as follows:
where
is another regularizing coefficient and
Update step
When a new observation (
is obtained, the mean embedding of the posterior distribution is expressed as follows:
where
is
.
During the calculation, it is preferable to calculate the Gram matrices instead of covariance matrices. This approach simplifies the calculation and can improve computational efficiency. In general, we need to express the mean embedding in the form of a feature map with latent training points as follows:
where
is the coefficient at time
. Notably, There is a rule (
) that can be employed to simplify the covariance matrix in the Gram matrix. Hence, the coefficient is
where Gram matrices are
and
,
. The algorithm can be summarized as follows (see Algorithm 1).
In this algorithm, we use a trick to calculate the weight. In the iteration step, should be , but we use the ; then, we can calculate the Gram matrix () instead of the covariance matrix ().
Kernel methods have some limitations because they rely on predefined features from the RKHS, which may not work well with complex or high-dimensional data. To address this, adaptive neural network features [
16] refer to the features generated by neural networks that can automatically adjust and learn from the data during the training process. Unlike fixed features, these adaptive features evolve to better capture the underlying patterns in the data, especially in complex or high-dimensional scenarios. This adaptability makes them particularly useful in situations where traditional methods, like kernel methods with predefined feature maps, may struggle to represent the data effectively. Here, we briefly introduce this concept. We rewrite the feature map (
) as
, an adaptive feature map represented by a neural network parameterized by
. The optimal
is obtained by minimizing the function as follows:
where
.
The Gram matrix in Equation (
16) is defined as
.
In the following experiment, we use the linear kernel on the learned adaptive feature (
) and finite-dimensional random Fourier feature approximation of
to calculate the coefficient (
) [
23]. We use a linear kernel of
to estimate the posterior mean (
) in the latent space. This approach does not necessitate the same feature map used to compute the weight function, as demonstrated in [
14]
Algorithm 1: Iteratively Re-weighted Importance Kernel Bayes Filter Algorithm
|
Input: Training dataset , regularization parameters , and test sequence . Initialize . Compute Gram matrices . , . For to do: 1. Compute the prior embedding at conditioning point using training dataset : . 2. Compute the density ratio in as: , where . 3. For to K: Compute weight , where . Update embedding as: , if , stop. 4. Compute as: , where is diag(), is diag(), . 5. Compute as: and , where and . End For |
5. Numerical Illustration
In this section, we explore the effectiveness of our method by applying it to two nonlinear dynamical models, with the code written in Python.
A Synthetic Problem:
Inspired by Fukumizu et al. [
14] and Xu et al. [
16], a simple, synthetic, nonlinear dynamic is introduced to illustrate the kernel Bayesian filtering methods. With a latent variable of
, the synthetic dynamic is described as
where
and
for given parameters of
and
. The noise (
) is an independent process. Observation
is expressed as
where
is independent noise. In the following experiments, the parameters in the dynamics are set as
, and
. Additionally, the noise levels are set to
and
, respectively. The
and
parameters are tuned by Kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. The bandwidth Gaussian kernel employed for all KBF methods is given by the medians of pairwise distances among the training set.
In the following figures, “original” denotes the performance of original KBF estimator using the
operator in Equation (
4). IW and re-weight represent importance KBF using the
operator in Equation (
7) and the iteratively re-weighted importance kernel Bayes filter using the
operator in Equation (
12), respectively.
Figure 1 illustrates the posterior approximation of latent variables
u and
v. The red line represents the given latent test sequence. The blue line, yellow line, and green line are the three KBF methods used to approach the latent variables. This experiment was conducted with a training sequence length of 100 and the regularized parameter (
) set to
. The
ℓ-induced weight function was employed in The IRe-KBF method in this experiment.
As observed in
Figure 1, it is evident that the KBF method utilizing iterative re-weighting yields the closest approximation to the actual latent test values. This indicates that this particular method outperforms the other two techniques in this context. However, this kind of obvious situation is not always the case. Many times, the results are not obvious. Hence, we calculate the MSE, which is the mean square error between the posterior estimation of the latent variables and the latent test value to evaluate the performance of the methods, as shown in
Figure 2.
First, we conducted some research about the kernel Bayes filter with three different weight functions described in
Section 3 to find the optimize weight function. In this experiment, the parameters were
in the
ℓs-induced weight function and
in the Tukey biweight function. Essential regularization (
) is obtained by the hyperparameter tuning method.
Figure 3 summarizes the mean squared error (MSE) over 10 runs when the conditional points (
) are sampled. The length of test sequence is set to 50.
Figure 3 shows that the IRe-KBF with the
ℓ-induced weight function leads to the fewest errors between the posterior estimation and the test sequence. Hence, in the following, we apply the
ℓ-induced weight function withing the IRe-KBF method.
The MSE with varying training lengths is summarized in
Figure 2. As depicted in the figure, the MSE decreases as the training length increases, which is expected. Furthermore, the “re-weight” method demonstrates comparable or superior performance compared to the other two KBF methods.
We also conducted an experiment employing the ensemble Kalman filter (EKF), which yielded a mean squared error (MSE) of 0.049. This value is notably lower than the MSE obtained by the other three filter methods when the training length is less than 200. This result is reasonable, since the ensemble Kalman filter is applied under an explicit model with a relatively low-dimensional system.
Lorenz 96 model:
The Lorenz 96 model is a simplified mathematical model used to study atmospheric and climate dynamics. It was proposed by Edward Lorenz in 1996 to capture key nonlinear characteristics of such systems in a more manageable form.
The standard form of the Lorenz 96 model is given by the following set of differential equations:
where
represents the ith variable of the system, i.e., ;
F is an external forcing term that represents the energy input into the system; and
The indices are treated cyclically, meaning and .
If the Lorenz 96 model has four variables (i.e.,
), the system of equations becomes
This system is a four-dimensional nonlinear system, since it involves four state variables (), and each variable’s evolution is influenced by nonlinear interactions with the other variables. The Lorenz 96 model can be generalized to any number of variables (N), but in this case, we are considering the specific case where .
Observation
Y is expressed as follows:
where
is independent noise with
, and
is the state at time
t.
In this experiment, the observation matrix (
H) is expressed as follows:
In the filtering problem, we have parameters of
and
. The dynamical systems is discrete, with a time step
. The latent variable noise is set to zero, while the observable noise is characterized by
. The test size for this experiment was set to 70. Additionally, in the filtering problem, the
and
parameters are tuned by a kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. Essential regularization
is obtained by the hyperparameter tuning method, and the
ℓ-induced weight function was employed in this experiment. The results presented in
Figure 4 consist of graphs depicting the relationship in the whole filtering process between
, and
, which are the first three variables in the latent variable vector (
X). It is clear to see that the KBF with “re-weighting” yields a better approximation of the actual test latent value, which is given by the dynamics, than the other two methods.
We also computed the MSE between the posterior estimation and the test latent sequence. The results are depicted in
Figure 5. As illustrated in
Figure 5, the “re-weight” method consistently performs better than the other two methods. The original Kernel Bayesian filter exhibits significant instability in performance. However, it is worth noting that the obtained results are sensitive to the
parameter. In
Figure 6, we observed that the performance of IRe-KBF significantly declines when the
parameter is small.
Furthermore, we conducted an experiment using the extended Kalman filter (EKF), resulting in an MSE of 375.21. This value is notably larger when compared to the performance of the three kernel-based filtering methods. The limitations of the EKF are apparent, particularly in handling complex nonlinear systems. Conversely, the iteratively re-weighted importance kernel Bayes filter (IRe-KBF) demonstrates its ability to capture intricate nonlinear dependencies between observations and latent states.