Next Article in Journal
Solutions of the Mathieu–Hill Equation for a Trapped-Ion Harmonic Oscillator—A Qualitative Discussion
Previous Article in Journal
A Privacy-Preserving and Quality-Aware User Selection Scheme for IoT
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing

Mathematical Institute, Peking University, Beijing 100871, China
Mathematics 2024, 12(19), 2962; https://doi.org/10.3390/math12192962
Submission received: 31 July 2024 / Revised: 18 September 2024 / Accepted: 22 September 2024 / Published: 24 September 2024

Abstract

:
This paper proposes an iteratively re-weighted importance kernel Bayes filter (IRe-KBF) method for handling high-dimensional or complex data in Bayesian filtering problems. This innovative approach incorporates importance weights and an iterative re-weighting scheme inspired by iteratively re-weighted Least Squares (IRLS) to enhance the robustness and accuracy of Bayesian inference. The proposed method does not require explicit specification of prior and likelihood distributions; instead, it learns the kernel mean representations from training data. Experimental results demonstrate the superior performance of this method over traditional KBF methods on high-dimensional datasets.
MSC:
62C10; 68T05; 93B99

1. Introduction

Bayesian filtering is a probabilistic approach that recursively estimates an unknown probability density function over time based on a mathematical model and observation process [1,2,3]. It involves updating the prior distribution to obtain the posterior distribution, which is the essence of Bayesian statistics. As a fundamental problem in probabilistic inference and sequential estimation, Bayesian filtering focuses on inferring the state of a dynamic system over time from noisy and incomplete observations while integrating prior knowledge about the system’s behavior. This problem finds wide-ranging applications in fields such as signal processing [4], robotics [5], finance [6,7], and more.
Traditional Bayesian filtering methods often encounter limitations when dealing with high-dimensional or complex data spaces. Kernel methods have emerged as a powerful tool for generalizing linear statistical methods to nonlinear settings, which is achieved by embedding samples into a high-dimensional feature space known as a reproducing kernel Hilbert space (RKHS). Kernel mean embedding plays a crucial role in this context, representing probability distributions as expectations of features in the RKHS [8,9,10].
By embedding distributions into the RKHS, kernel mean embedding facilitates various statistical analyses and machine learning tasks. The work of Smola et al. in 2007 [9] emphasized the significance of kernel mean embedding in Bayesian updates. The use of kernel means in characteristic RKHS has been widely proven successful in a number of statistical tasks, including two-sample problems [11], independence tests [12], and conditional independence tests [13]. Notably, these tests are applicable to any domain where kernels can be defined, showcasing the versatility of the kernel approach.
In the context of Bayesian filtering, it enables the representation of probability distributions in a space where operations such as inner products and norms can be easily computed. This not only enhances the scalability of Bayesian filtering algorithms but also allows for the incorporation of complex, structured data into the estimation process. Moreover, kernel mean embedding provides a framework for integrating diverse sources of information, including prior knowledge and domain expertise, into the filtering process.
The kernel Bayesian filtering procedure introduced in [14] is model-free and produces sample-derived estimates of the posterior embedding. This Bayesian filtering method involves the construction of a posterior distribution over the current state based on the sequence of noisy observations up to the present time, without making explicit modeling assumptions about the underlying dynamics. Unfortunately, it may cause instability during the calculating process. Fukumizu et al. [14] proposed a formulation that requires an unconventional form of regularization in the second stage, which adversely impacts the attainable convergence rates. Boots et al. [15] introduced one KBF where only a simple form of Tikhonov regularization is applied. Unfortunately, this method requires the inversion of the matrix, which is often indefinite, necessitating high regularization constants, which can degrade performance. Xu et al. [16] introduced an approach called importance-weighted KBF to avoid the instability problem. The importance weight is based on the density ratio. However, the calculation of the density ratio may cause high variance.
In this paper, we explore an innovative extension of the importance weighted kernel Bayesian filtering method proposed in [16] known as Iteratively Re-weighted importance Kernel Bayesian Filtering (IRe-KBF). The concept of iterated re-weighting, which has been successfully applied in other contexts (e.g., [17]), is now integrated into the filtering process, as explained in Section 3. This integration, which is also discussed in the Introduction, along with reference [17], brings about a significant enhancement by making the filter more robust and efficient in high-dimensional systems.
The IRe-KBF method learns kernel mean representations directly from training data, eliminating the need for explicit specification of prior and likelihood distributions. By harnessing importance weights and iterative re-weighting, our filter exhibits enhanced robustness and efficiency in the context of high-dimensional datasets.
The remainder of this paper is organized as follows. In Section 2, we review the basic concepts of kernel mean embeddings and the kernel Bayes rule. In Section 3, we first introduce the importance-weighted kernel Bayes rule proposed in [16]; then the iteratively re-weighted importance kernel Bayes rule (IRe-KBR) is proposed. We introduce the filtering problem in Section 4, which applies the IRe-KBR to the filtering problem. Experiments are reported in Section 5.

2. Background and Preliminaries

In this section, we review some basic concepts.
Kernel mean embeddings provide a way to represent distributions in reproducing kernel Hilbert spaces (RKHSs) using positive definite kernels. Given random variables ( X , Y ) on X × Y with a joint distribution (P) and density functions p ( x , y ) , k x : X × X R and k y : Y × Y R are measurable positive definite kernel corresponding to scalar-valued RKHSs H x and H y , respectively. The feature maps are denoted as ϕ ( x ) = k x ( x , . ) and ψ ( y ) = k y ( y , . ) , and the RKHS norm is denoted as · , with corresponding Gram matrices of ( G X ) i j = k x ( x i , x j ) and ( G Y ) i j = k y ( y i , y j ) .
The kernel mean embedding of the marginal distribution ( P ( x ) ) is denoted by m P ( x ) and defined as follows:
m P ( x ) = E P [ ϕ ( x ) ] H x .
It always exists for kernels that are bounded. From the reproducing property, we derive the following for all f H x , :
< f , m P ( x ) > = E P [ f ( x ) ]
which is advantageous for estimating expectations of functions. Moreover, if a kernel k x such as the Gaussian kernel is characteristic, the embedding uniquely determines the probability distribution, which means that m P ( x ) = m Q ( x ) implies P = Q . Additionally, we introduce the following (uncentered) kernel covariance operators:
C X X P = E P [ ϕ ( X ) ϕ ( X ) ] , C X Y P = E P [ ϕ ( X ) ψ ( Y ) ] C Y Y P = E P [ ψ ( Y ) ψ ( Y ) ] , C Y X P = ( C X Y P ) .
In this context, ⊗ denotes the tensor product, and ∗ denotes the adjoint of the operator. Covariance operators extend the concept of finite-dimensional covariance matrices to the realm of infinite kernel feature spaces, and they are always defined for bounded kernels.
The kernel conditional mean embedding [10] extends the kernel mean embedding to conditional probability distributions. It is defined as follows:
m P ( X | Y ) ( y ) = E P [ ϕ ( X ) | Y = y ] H x .
This embedding captures the conditional expectation of the feature space representation of X given Y = y . Under the assumption of regularity, which requires that E P [ f ( X ) | Y = y ] H x for all f H x , an operator ( E P H x H y ) can be identified such that
m P ( X | Y ) ( y ) = E P * ψ ( y ) .
The conditional operator ( E P ) can be obtained by minimizing the loss function ( L ( P ) ), which is
L P ( E ) = E P [ ϕ ( X ) E * ψ ( Y ) 2 ] .
The minimization of this loss function leads to an analytical solution in the form of the following closed-form expression:
E P = ( C Y Y P ) 1 C X Y P .
The solution provides a means to estimate the conditional operator ( E P ), which is crucial for various machine learning tasks that involve conditional inference or prediction. Obtaining an empirical estimate of the conditional mean embedding is a straightforward procedure. Given i.i.d samples ( { ( x i , y i ) } i = 1 n P ), we seek to minimize the sample estimate of the loss function ( L ^ ( P , λ ) ), which is defined as
L ^ ( P , λ ) = 1 n i = 1 n ϕ ( x i ) E * ψ ( y i ) 2 + λ E 2 .
This is a sample estimate of L P that incorporates a Tikhonov regularization parameter ( λ ) to mitigate overfitting. The solution to the minimization problem can be expressed as follows:
E ^ ( P , λ ) = ( C ^ Y Y P + λ I ) 1 C ^ X Y P ,
where C ^ is the empirical estimate of the covariance operators, i.e.,
C ^ Y Y P = 1 n i = 1 n ψ ( y i ) ψ ( y i ) C ^ X Y P = 1 n i = 1 n ϕ ( x i ) ψ ( y i ) .
With respect to the kernel Bayes rule (KBR), in the context of Bayes’ theorem, our objective is to update the prior distribution ( Π ) with the density function ( π ( x ) ) to the posterior distribution (Q) of the state (X) given an observation (Y) with the corresponding density function ( q ( x | y ) ).
q ( X | Y ) ( x | y ) = p ( Y | X ) ( y | x ) π ( x ) q Y ( y ) , q Y ( y ) = p ( Y | X ) ( y | u ) π ( u ) d u
where p ( y | x ) is the density function of the likelihood function and q Y is the marginal probability density function.
The KBR aims to update the prior mean embedding ( m Π ) to the posterior mean embedding ( m Q ) according to the Bayes rule. Unlike traditional methods that require the explicit form of the likelihood function ( p Y | X ), the KBR learns the relations between the latent variable (X) and the observable variable (Y) directly from the data. This is achieved by analyzing the dataset ( { ( x i , y i ) } i = 1 n P ), where the density of data distribution (P) shares the same likelihood ( p Y | X ) with the density function ( p ( x ) ). In general, the density function ( π ( x ) ) is not equal to p ( x ) . One of the advantages of the KBR approach is its ability to perform Bayesian inference, even in the absence of a specific parametric model or an analytic form for the prior and likelihood densities. By making sufficient observations of the system, the KBR allows for the estimation of probabilities and the construction of probabilistic models, thereby enabling a more flexible approach to statistical analysis.
Similarly, the conditional operator E Q : H x H y satisfying m Q ( X | Y ) ( y ) = E Q * ψ ( y ) is a minimizer of the loss function.
L Q ( E ) = E Q [ ϕ ( X ) E * ψ ( Y ) 2 ] .
However, since the posterior distribution (Q) is unknown, we cannot sample the dataset ( { ( x i , y i ) } i = 1 n ) from the posterior distribution (Q) directly. But we could still utilize the analytical form of E Q according to [14], which is given by
E Q = ( C Y Y Q ) 1 C X Y Q .
In the context of vector-valued kernel regression, the covariance matrices are replaced by empirical estimators as follows:
C ^ X Y Q = i = 1 n μ i ϕ ( x i ) ψ ( y i ) C ^ Y Y Q = i = 1 n μ i ψ ( y i ) ψ ( y i )
where each coefficient ( μ ) is given by
μ i = < ϕ ( x i ) , ( C ^ X X P + η I ) 1 m ^ Π ) > .
In Equation (3), η is another Tikhonov regularization parameter, and m ^ Π is the prior mean, given as i = 1 n ξ i ϕ ( x i ) , where ξ i represents the weights.
However, since the μ i coefficient may not necessarily be positive, C ^ Y Y Q could not be positive semi-definite. Consequently, calculating Equation (2) causes instabilities when inverting the operator ( C Y Y Q ). To address this issue, Fukumizu et al. [14] proposed an alternative formulation of E Q , which is expressed as follows:
E ^ Q , λ = C ^ Y Y Q ( ( C ^ Y Y Q ) 2 + λ I ) 1 C ^ X Y Q
In the next section, we introduce the iteratively re-weighted importance kernel Bayes rule for application to the filtering problem, a novel design for a KBR that does not require the problematic second-stage regularization. The essential idea is to use multiple weight functions.

3. Iteratively Re-Weighted Importance Kernel Bayes Rule

3.1. Importance Weighted Kernel Bayes Rule with Density Ratio

Here, we review the importance of the weighted kernel Bayes rule proposed in [16]. The method attempts to minimize the loss function ( L Q ), which is estimated through importance sampling by applying a density ratio of r ( x ) = π ( x ) p ( x ) . The loss function can be reformulated as
L Q ( E ) = E P r ( X ) ϕ ( X ) E * ψ ( Y ) 2
The empirical loss function with added Tikhonov regularization can be constructed as follows:
L ^ ( Q , λ ) ( E ) = 1 n i = 1 n r ^ i ( x i ) ϕ ( x i ) E * ψ ( y i ) 2 + λ E * 2
According to the Kernel-based unconstrained Least-Squares Importance Fitting (KuLSIF) estimator proposed in [16,18], r ^ i is
r ^ ( x i ) = max ( 0 , μ i ) ,
where μ i is the same as in Equation (3).
The minimizer of L ^ Q , λ ( E ) can be obtained analytically as
E ^ ( Q , λ ) = ( C ^ Y Y Q + λ I ) 1 C ^ X Y Q ,
where
C ^ X Y Q = 1 n i = 1 n r ^ i ϕ ( x i ) ψ ( y i ) C ^ Y Y Q = 1 n i = 1 n r ^ i ψ ( y i ) ψ ( y i )
The importance sampling technique enables the estimation of expectations under the posterior distribution using samples from the prior distribution, which is crucial for Bayesian filtering in high-dimensional spaces. The density ratio ( r ( x ) ) serves as the importance weights, ensuring proper weighting of the samples from the prior distribution when estimating posterior expectations. By incorporating these weights into the loss function, the weighted importance kernel Bayes rule is able to accurately estimate the posterior distribution from the available prior samples. This approach results in superior numerical stability relative to the existing approach to KBR.

3.2. Re-Weighted Importance Kernel Bayes Rule

The re-weighted method is a technique designed to enhance the robustness of the estimation process. In traditional least squares regression, which aims to minimize the sum of squared differences between observed and latent values, outliers can significantly influence the results, as a few extreme data points can disproportionately affect the outcome. The re-weighted scheme addresses this sensitivity by assigning different weights to the residuals, leading to a more robust estimation that is less susceptible to the influence of outliers. Equation (5) can be regarded as a kind of least square regression that could employ the re-weighted method to enhance the robustness of the approach.
Similarly, the loss function with multiple weights is constructed as follows:
L Q ( E ) = E P ( ω ( ϕ ( X ) E * ψ ( Y ) ) [ r ( X ) ] ϕ ( X ) E * ψ ( Y ) 2 )
where ω is a weight function of the residual ( ( ϕ ( X ) E * ψ ( Y ) ) ). Additionally, the empirical loss function, augmented with Tikhonov regularization, can be recast as follows:
L ^ ( Q , λ ) ( E ) = 1 n i = 1 n ω i r ^ i ϕ ( x i ) E * ψ ( y i ) + λ E * 2
where ω i = ω ( ϕ ( x i ) E * ψ ( y i ) ) .
Iteratively re-weighted least square (IRLS) is a common robust learning paradigm [17] that implements multiple weighted processes for better performance. It can be expressed by a sequence of successive minima of weighted L based theoretical regularized risk. The k + 1 th iteration problem can be written as follows:
E ^ Q , λ , ω , ( k + 1 ) = arg min 1 n i = 1 n ω ( ϕ ( x i ) E ^ Q , λ , ω , k * ψ ( y i ) ) r ^ i ϕ ( x i ) E ^ Q , λ , ω , ( k + 1 ) * ψ ( y i ) + λ E ^ Q , λ , ω , ( k + 1 ) * .
Next, we write E ^ Q , λ , ω as E ˜ Q . The weight function ( ω ( x ) : R [ 0 , 1 ] ) should satisfy the following three conditions:
1.
ω ( x ) should be a Borel measurement;
2.
ω ( x ) should be an even function; and
3.
ω ( x ) should be a differential, especially, ω ( x ) < 0 when x > 0 .
E ˜ Q , k + 1 can be obtained analytically as
E ˜ Q , k + 1 = ( ω ( E ˜ Q , k ) C ^ Y Y Q + λ I ) 1 ω ( E ˜ Q , k ) C ^ X Y Q ,
where ω = diag ( ω i ) and ω i ( E ˜ Q , k ) = ω ( ϕ ( x i ) E ˜ Q , k * ψ ( y i ) ) , C ^ Y Y and C ^ X Y are the same as in Equation (8). Note that
E ˜ Q = ( ω ( E ˜ Q ) C ^ Y Y Q + λ I ) 1 ω ( E ˜ Q ) C ^ X Y Q .
According to Theorem 1 in [17], the iterative process converges such that E ˜ Q , k + 1 E ˜ Q , and [17] has already proven the robustness of the re-weighted scheme. This iterative re-weighted scheme not only reduces the impact of outliers but also adaptively refines the estimate based on the current residuals, leading to improved performance in handling complex, high-dimensional data.
Theorem 1.
(Convergence analysis) Suppose that kernels k x and k y are continuous and bounded. Additionally, assume the density ratio ( r 0 ) and the conditional operator ( E Q ) are smooth. Given data ( { ( x i , y i ) } i = 1 n P ) and estimated covariance operators such that C ^ Y Y ( Q ) C Y Y ( Q ) O p ( n α ) and C ^ X Y ( Q ) C X Y ( Q ) O p ( n α ) , by setting λ = O p ( n α β + 1 ) we have
E ˜ Q E Q O p ( n α β β + 1 ) .
where E Q is Equation (2). The proof presented is in Appendix A.
The Weight Function to Choose
Many weight functions have been described in the literature, e.g., the Huber weight function [19] and the Hampel weight function [20], which do not satisfy the convergence conditions. In the following, we introduce three types of weight functions that have already satisfied the convergence conditions (1–3).
Logistic function: Debruyne et al. [17], introduced the following logistic weight function:
ω ( x ) = tanh ( x ) x .
This function is a variant of the standard logistic function, which is commonly employed to model probabilities in logistic regression. The logistic weight function is symmetric around zero and non-negative, with a derivative that becomes negative as the residual increases, effectively down-weighting observations with larger residuals.
The s-induced weight function: Dong and Yang [21] introduced the following weight function:
ω ( x ) = a 2 x 1 ( 1 + exp ( a x ) 1 2 ) ,
where a is a constant parameter. The s-induced weight function is also symmetric around zero and non-negative, and its derivative becomes negative as the residual increases, resulting in a progressive decrease in the weight assigned to larger residuals.
Tukey Biweight function: Tukey’s Biweight function, often used in robust statistics, is a weighting function that assigns weights to data points [22]. It is also known as Tukey’s bisquare function. It is defined as follows:
ω ( x ) = 1 x c 2 2 if | x | < c 0 if | x | c .
where c is a constant that determines the range of the function. The function assigns higher weights to data points closer to zero and rapidly decreases the weights for data points farther away. This makes it less sensitive to outliers than a simple average.
These weight functions are designed to assign lower weights to larger residuals, which effectively reduces their influence on operator estimation. This strategy contributes to the development of robust regression techniques that are less sensitive to outliers.

4. The Iteratively Re-Weighted Importance Kernel Bayes Filter

In this section, we describe the iteratively re-weighted importance kernel Bayes filter. Kernel Bayesian inference is a well-founded approach to non-parametric data in probabilistic graphical models, where probabilistic relationships between variables are learned from data in a non-parametric manner.
In the filtering problem, the states evolve according to a Markov process determined by the state transition model ( p ( x t + 1 | x t ) ) describing the conditional probability of the next state ( x t + 1 ) given the current state ( x t ). Observation y t at time t is generated depending only on the corresponding state ( x t ) following the observation model ( p ( y t | x t ) ). When applying the kernel Bayes rule, we do not need to assume the conditional probabilities ( p ( y t | x t ) and p ( x t + 1 | x t ) ) to be known explicitly, nor do we estimate them with simple parametric models. Rather, we assume that a sample { x 1 , , x T , y 1 , , y T } is given for both the observable and hidden variables in the training phase.
The aim of the filtering method is to probabilistically estimate the state ( x t + 1 ) at each time ( t + 1 ) using the new observation sequence ( y ^ 1 , , y ^ t + 1 ), i.e., to estimate p ( x t + 1 | y ^ 1 , , y ^ t + 1 ) . The sequential estimate for the kernel mean of p ( x t + 1 | y ^ 1 , , y ^ t + 1 ) can be derived by employing the iteratively re-weighted kernel Bayes rule. This can be obtained by iterating the following two steps.
Prediction step
Assume that we have a posterior embedding of m x t | y ^ 1 : t at time t. Then, we can compute the embedding of forward prediction ( m x t + 1 | y ^ 1 : t ) as follows:
m x t + 1 | y ^ 1 : t = ( E x t + 1 | x t ) * m x t | y ^ 1 : t
where E x t + 1 | x t is the conditional operator for P ( x t + 1 | x t ) . Empirically, this is estimated based on { x t , x t + 1 } as follows:
E ^ x t + 1 | x t = ( C ^ x T 1 , x T 1 + λ 1 I ) 1 C ^ x T 1 , x T ,
where λ 1 is another regularizing coefficient and
C ^ x T 1 , x T 1 = 1 T 1 i = 1 T 1 ϕ ( x i ) ϕ ( x i ) , C ^ x T 1 , x T = 1 T 1 i = 1 T 1 ϕ ( x i ) ϕ ( x i + 1 )
Update step
When a new observation ( y ^ t + 1 ) is obtained, the mean embedding of the posterior distribution is expressed as follows:
m x t + 1 | y ^ 1 : t + 1 = ( E ˜ Q ) ψ ( y ^ t + 1 )
where E ˜ Q is ( ω ( E ˜ Q ) C ^ Y Y Q + λ I ) 1 ω ( E ˜ Q ) C ^ X Y Q .
During the calculation, it is preferable to calculate the Gram matrices instead of covariance matrices. This approach simplifies the calculation and can improve computational efficiency. In general, we need to express the mean embedding in the form of a feature map with latent training points as follows:
m x t + 1 | y ^ 1 : t + 1 = i = 1 T ξ i t + 1 , t + 1 ϕ ( x i ) ,
where ξ t + 1 , t + 1 is the coefficient at time t + 1 . Notably, There is a rule ( [ a b ] c = a < b , c > ) that can be employed to simplify the covariance matrix in the Gram matrix. Hence, the coefficient is
ξ = ( ω ( r ^ G Y ) + T λ I ) 1 ω ( r ^ G ^ Y )
where Gram matrices are ( G Y ) i j = k y ( y i , y j ) and ( G ^ Y ) i j = k y ( y i , y ^ j ) , ( r ^ ) i i = < ϕ ( x i ) , T ( G X + T η I ) 1 m ^ Π > . The algorithm can be summarized as follows (see Algorithm 1).
In this algorithm, we use a trick to calculate the weight. In the iteration step, E k + 1 should be ( ω k r ^ G Y + T λ I ) 1 ( ω k r ^ C X Y ) , but we use the E k + 1 ψ ( y i ) ; then, we can calculate the Gram matrix ( G Y ) instead of the covariance matrix ( C X Y ).
Kernel methods have some limitations because they rely on predefined features from the RKHS, which may not work well with complex or high-dimensional data. To address this, adaptive neural network features [16] refer to the features generated by neural networks that can automatically adjust and learn from the data during the training process. Unlike fixed features, these adaptive features evolve to better capture the underlying patterns in the data, especially in complex or high-dimensional scenarios. This adaptability makes them particularly useful in situations where traditional methods, like kernel methods with predefined feature maps, may struggle to represent the data effectively. Here, we briefly introduce this concept. We rewrite the feature map ( ψ ) as ψ θ , an adaptive feature map represented by a neural network parameterized by θ . The optimal θ is obtained by minimizing the function as follows:
( θ ) = tr G X r ^ r ^ Ψ θ T ( Ψ θ r ^ Ψ θ T + λ I ) 1 Ψ θ r ^
where Ψ θ = [ ψ θ , , ψ θ ] .
The Gram matrix in Equation (16) is defined as ( G Y ) i j = k y ( y i , y j ) = ψ θ ( y i ) T ψ θ ( y j ) .
In the following experiment, we use the linear kernel on the learned adaptive feature ( ψ ( y ) ) and finite-dimensional random Fourier feature approximation of ϕ ( x ) to calculate the coefficient ( ξ ) [23]. We use a linear kernel of ϕ ( x ) = x to estimate the posterior mean ( m ^ Q = ξ i x i ) in the latent space. This approach does not necessitate the same feature map used to compute the weight function, as demonstrated in [14]  
Algorithm 1: Iteratively Re-weighted Importance Kernel Bayes Filter Algorithm
Input: Training dataset { ( x t , y t ) } i = 1 T , regularization parameters η , λ 1 , λ ,
   and test sequence y ^ 1 , , y ^ T ˜ .
Initialize  ξ ( 0 , 1 ) .
Compute Gram matrices G X , G Y R T × T .
    ( G X ) i j = k x ( x i , x j ) , ( G Y ) i j = k y ( y i , y j ) .
For  t = 1  to  T ˜  do:
      1. Compute the prior embedding m ^ Π at conditioning point y ^ t using training dataset
          { ( x t , y t ) } i = 1 T :        m ^ Π = i = 1 ξ i t 1 , t ϕ ( x i ) .
      2. Compute the density ratio r ^ = ( r ^ 1 , , r ^ T ) in R T as:
          r ^ = max ( 0 , T ( G X + T η I ) 1 g Π ) ,
         where ( g Π ) i = < m ^ Π , ϕ ( x i ) > .
      3. For k = 1 to K:
            Compute weight ω k , where ω k , i = ω ( ϕ ( x i ) E k ψ ( y i ) ) .
               Update embedding E k + 1 ψ ( y i ) as:
                E k + 1 ψ ( y i ) = ( ω k r ^ G Y + T λ I ) 1 ( ω k r ^ G Y ) ϕ ( x i ) ,
               if E k + 1 ψ ( y i ) E k ψ ( y i ) ϵ , stop.
      4. Compute ξ t , t as:
          ξ t , t = ( ω r ^ G Y + T λ I ) 1 ( ω r ^ G ^ Y ) ,
         where ω is diag( ω K ), r ^ is diag( r ^ ), G ^ Y = k y ( y i , y ^ t ) .
      5. Compute ξ t , t + 1 as:
          ξ 1 ( t , t + 1 ) = 0 and ξ 2 : T t , t + 1 = ( G X 1 + ( T 1 ) λ 1 I ) 1 G ˜ X 1 ξ t , t ,
         where ( G X 1 ) i j = k x ( x i , x j ) R T 1 × T 1 and ( G ˜ X 1 ) i j = k x ( x i , x j ) R T 1 × T .
End For

5. Numerical Illustration

In this section, we explore the effectiveness of our method by applying it to two nonlinear dynamical models, with the code written in Python.
A Synthetic Problem:
Inspired by Fukumizu et al. [14] and Xu et al. [16], a simple, synthetic, nonlinear dynamic is introduced to illustrate the kernel Bayesian filtering methods. With a latent variable of X t = ( u t , v t ) T , the synthetic dynamic is described as
u t + 1 v t + 1 = 1 + β sin ( M θ t ) cos ( θ t + ω ) sin ( θ t + ω ) + ϵ X
where θ t [ 0 , 2 π ] and
cos ( θ t ) = u t u t 2 + v t 2 , sin ( θ t ) = v t u t 2 + v t 2
for given parameters of ( β , M , ω ) and ϵ X N ( 0 , σ X 2 I ) . The noise ( ϵ X ) is an independent process. Observation Y t is expressed as
Y t = X t + ϵ Y . ϵ Y N ( 0 , σ Y 2 I )
where ϵ Y is independent noise. In the following experiments, the parameters in the dynamics are set as ω = 0.2 , β = 0.2 , and M = 8 . Additionally, the noise levels are set to σ Y = 0.2 and σ X = 0.2 , respectively. The η and λ 1 parameters are tuned by Kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. The bandwidth Gaussian kernel employed for all KBF methods is given by the medians of pairwise distances among the training set.
In the following figures, “original” denotes the performance of original KBF estimator using the E ^ Q , λ operator in Equation (4). IW and re-weight represent importance KBF using the E ^ ( Q , λ ) operator in Equation (7) and the iteratively re-weighted importance kernel Bayes filter using the E ˜ Q operator in Equation (12), respectively.
Figure 1 illustrates the posterior approximation of latent variables u and v. The red line represents the given latent test sequence. The blue line, yellow line, and green line are the three KBF methods used to approach the latent variables. This experiment was conducted with a training sequence length of 100 and the regularized parameter ( λ ) set to 10 3 . The -induced weight function was employed in The IRe-KBF method in this experiment.
As observed in Figure 1, it is evident that the KBF method utilizing iterative re-weighting yields the closest approximation to the actual latent test values. This indicates that this particular method outperforms the other two techniques in this context. However, this kind of obvious situation is not always the case. Many times, the results are not obvious. Hence, we calculate the MSE, which is the mean square error between the posterior estimation of the latent variables and the latent test value to evaluate the performance of the methods, as shown in Figure 2.
First, we conducted some research about the kernel Bayes filter with three different weight functions described in Section 3 to find the optimize weight function. In this experiment, the parameters were a = 1 in the s-induced weight function and c = 1 in the Tukey biweight function. Essential regularization ( λ ) is obtained by the hyperparameter tuning method. Figure 3 summarizes the mean squared error (MSE) over 10 runs when the conditional points ( y ^ t ) are sampled. The length of test sequence is set to 50.
Figure 3 shows that the IRe-KBF with the -induced weight function leads to the fewest errors between the posterior estimation and the test sequence. Hence, in the following, we apply the -induced weight function withing the IRe-KBF method.
The MSE with varying training lengths is summarized in Figure 2. As depicted in the figure, the MSE decreases as the training length increases, which is expected. Furthermore, the “re-weight” method demonstrates comparable or superior performance compared to the other two KBF methods.
We also conducted an experiment employing the ensemble Kalman filter (EKF), which yielded a mean squared error (MSE) of 0.049. This value is notably lower than the MSE obtained by the other three filter methods when the training length is less than 200. This result is reasonable, since the ensemble Kalman filter is applied under an explicit model with a relatively low-dimensional system.
Lorenz 96 model:
The Lorenz 96 model is a simplified mathematical model used to study atmospheric and climate dynamics. It was proposed by Edward Lorenz in 1996 to capture key nonlinear characteristics of such systems in a more manageable form.
The standard form of the Lorenz 96 model is given by the following set of differential equations:
d x i d t = ( x i + 1 x i 2 ) · x i 1 x i + F ,
where
  • x i represents the ith variable of the system, i.e., i = 1 , , N ;
  • F is an external forcing term that represents the energy input into the system; and
  • The indices are treated cyclically, meaning x N + 1 = x 1 and x 0 = x N .
If the Lorenz 96 model has four variables (i.e., x 1 , x 2 , x 3 , x 4 ), the system of equations becomes
d x 1 d t = ( x 2 x 4 ) x 3 x 1 + F
d x 2 d t = ( x 3 x 1 ) x 4 x 2 + F
d x 3 d t = ( x 4 x 2 ) x 1 x 3 + F
d x 4 d t = ( x 1 x 3 ) x 2 x 4 + F
This system is a four-dimensional nonlinear system, since it involves four state variables ( X = ( x 1 , x 2 , x 3 , x 4 ) T ), and each variable’s evolution is influenced by nonlinear interactions with the other variables. The Lorenz 96 model can be generalized to any number of variables (N), but in this case, we are considering the specific case where N = 4 .
Observation Y is expressed as follows:
Y t = H X t + ϵ Y
where ϵ Y is independent noise with ϵ Y N ( 0 , σ Y 2 I ) , and X t is the state at time t.
In this experiment, the observation matrix (H) is expressed as follows:
H = 1 0 0 0 0 0 1 0 .
In the filtering problem, we have parameters of F = 1 and N = 4 . The dynamical systems is discrete, with a time step d t = 0.1 . The latent variable noise is set to zero, while the observable noise is characterized by σ Y = 0.1 . The test size for this experiment was set to 70. Additionally, in the filtering problem, the η and λ 1 parameters are tuned by a kernel-based unconstrained least squares importance fitting (KuLSIF) leave-one-out cross-validation procedure. Essential regularization λ is obtained by the hyperparameter tuning method, and the -induced weight function was employed in this experiment. The results presented in Figure 4 consist of graphs depicting the relationship in the whole filtering process between x 1 , x 2 , and x 3 , which are the first three variables in the latent variable vector (X). It is clear to see that the KBF with “re-weighting” yields a better approximation of the actual test latent value, which is given by the dynamics, than the other two methods.
We also computed the MSE between the posterior estimation and the test latent sequence. The results are depicted in Figure 5. As illustrated in Figure 5, the “re-weight” method consistently performs better than the other two methods. The original Kernel Bayesian filter exhibits significant instability in performance. However, it is worth noting that the obtained results are sensitive to the λ parameter. In Figure 6, we observed that the performance of IRe-KBF significantly declines when the λ parameter is small.
Furthermore, we conducted an experiment using the extended Kalman filter (EKF), resulting in an MSE of 375.21. This value is notably larger when compared to the performance of the three kernel-based filtering methods. The limitations of the EKF are apparent, particularly in handling complex nonlinear systems. Conversely, the iteratively re-weighted importance kernel Bayes filter (IRe-KBF) demonstrates its ability to capture intricate nonlinear dependencies between observations and latent states.

6. Conclusions

In this paper, we proposed a robust approach to kernel Bayes filtering that does not depend on an explicit dynamical system. The kernel Bayes filter process only depends on a training dataset to calculate the transition and the observation operator. This method employs multiple weight functions, i.e., the density ratio and weight function related to residuals. Additionally, we employed the iteratively re-weighted process to enhance robustness and efficiency.
Two examples that represent high-dimensional dynamics, namely a nonlinear synthetic model and Lorenz 96, were applied to illustrate the performance of the IRe-KBF. Experimental results demonstrate the effectiveness of IRe-KBF in handling complex data, with superior performance over traditional KBF methods and the importance weighted KBF method on high-dimensional datasets.
In this paper, we also conducted some research on the weight function, although more research is needed in this area. Future work will focus on exploring and refining weight functions to optimize their performance in kernel Bayes filtering applications.

Funding

This research was funded by the China Postdoctoral Science Foundation (Certificate Number 2023TQ0015).

Institutional Review Board Statement

This study did not involve humans or animals.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Proof. 
The Convergence Proof
Next, we prove that E Q , λ , ω is consistent with E Q .    □
Lemma A1
(Theorem 3.6 in [16]). Given { x i , y i } i = 1 n P and estimated covariance operators such that C ^ Y Y Q C Y Y Q O p ( n α ) and C ^ X Y Q C X Y Q O p ( n α ) , by setting λ = O α β + 1 , we have
E ^ Q , λ E Q , λ O p ( n α β β + 1 ) .
Proof of Theorem 1.
Next, we prove that E ˜ Q is consistent with E Q in Equation (2) to determine the value of E as follows:
E ˜ Q = ( ω C ^ Y Y Q + λ I ) 1 ω C ^ X Y Q
We decompose the error as follows:
E Q E ˜ Q E Q E ^ Q , λ + E ^ Q , λ E ˜ Q
According to Lemma A1,
E ^ Q , λ E Q O p ( n α β β + 1 )
The second term can be bounded as follows:
E ^ Q , λ E ˜ Q = ( ω C ^ Y Y Q + λ I ) 1 ( ω C ^ Y Y Q E ^ Q , λ + λ E ^ Q , λ ω C ^ X Y Q ) ( ω C ^ Y Y Q + λ I ) 1 ( ω C ^ Y Y Q E ^ Q , λ + λ E ^ Q , λ ) ω C ^ X Y Q 1 λ ω C ^ Y Y Q E ^ Q , λ + λ E ^ Q , λ ω C ^ X Y Q = 1 λ ω C ^ Y Y Q E ^ Q , λ + ( C ^ X Y Q C ^ Y Y Q E ^ Q , λ ) ω C ^ X Y Q = 1 λ ( ω 1 ) C ^ Y Y Q E ^ Q , λ ( ω 1 ) C ^ X Y Q = 1 λ ( 1 ω ) λ E ^ Q , λ = ( 1 ω ) E ^ Q , λ
Therefore, we have
E Q E ˜ Q E ^ Q , λ E Q + ( 1 ω ) E ^ Q , λ
Hence, if C ^ Y Y Q C Y Y Q O p ( n α ) and C ^ X Y Q C X Y Q O p ( n α ) , we have
E Q E ˜ Q O p ( n α β β + 1 ) .
Then, we set λ = O ( n α β + 1 ) .    □

References

  1. Chen, Z. Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics 2003, 182, 1–69. [Google Scholar] [CrossRef]
  2. Zhang, F.; Xue, W.-F.; Liu, X. Overview of nonlinear Bayesian filtering algorithm. Procedia Eng. 2011, 15, 489–495. [Google Scholar]
  3. Särkkä, S.; Svensson, L. Bayesian Filtering and Smoothing; Cambridge University Press: Cambridge, UK, 2023; Volume 17. [Google Scholar]
  4. Candy, J.V. Bayesian Signal Processing: Classical, Modern, and Particle Filtering Methods; John Wiley & Sons: Hoboken, NJ, USA, 2016; Volume 54. [Google Scholar]
  5. Kim, D.; Park, M.; Park, Y.L. Probabilistic modeling and Bayesian filtering for improved state estimation for soft robots. IEEE Trans. Robot. 2021, 37, 1728–1741. [Google Scholar] [CrossRef]
  6. Javaheri, A.; Lautier, D.; Galli, A. Filtering in finance. Wilmott 2003, 3, 67–83. [Google Scholar] [CrossRef]
  7. Lopes, H.F.; Tsay, R.S. Particle filters and Bayesian inference in financial econometrics. J. Forecast. 2011, 30, 168–209. [Google Scholar] [CrossRef]
  8. Muandet, K.; Fukumizu, K.; Sriperumbudur, B.; Schölkopf, B. Kernel mean embedding of distributions: A review and beyond. Found. Trends® Mach. Learn. 2017, 10, 1–141. [Google Scholar] [CrossRef]
  9. Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert space embedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, Sendai, Japan, 1–4 October 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 13–31. [Google Scholar]
  10. Song, L.; Huang, J.; Smola, A.; Fukumizu, K. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 961–968. [Google Scholar]
  11. Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A kernel method for the two-sample-problem. Adv. Neural Inf. Process. Syst. 2006, 19, 513–520. [Google Scholar]
  12. Gretton, A.; Fukumizu, K.; Teo, C.; Song, L.; Schölkopf, B.; Smola, A. A kernel statistical test of independence. Adv. Neural Inf. Process. Syst. 2007, 20, 585–592. [Google Scholar]
  13. Fukumizu, K.; Gretton, A.; Sun, X.; Schölkopf, B. Kernel measures of conditional dependence. Adv. Neural Inf. Process. Syst. 2007, 20, 489–496. [Google Scholar]
  14. Fukumizu, K.; Song, L.; Gretton, A. Kernel Bayes’ rule: Bayesian inference with positive definite kernels. J. Mach. Learn. Res. 2013, 14, 3753–3783. [Google Scholar]
  15. Boots, B.; Gordon, G.; Gretton, A. Hilbert space embeddings of predictive state representations. arXiv 2013, arXiv:1309.6819. [Google Scholar]
  16. Xu, L.; Chen, Y.; Doucet, A.; Gretton, A. Importance Weighting Approach in Kernel Bayes’ Rule. arXiv 2022, arXiv:2202.02474. [Google Scholar]
  17. Debruyne, M.; Christmann, A.; Hubert, M.; Suykens, J.A. Robustness of reweighted least squares kernel based regression. J. Multivar. Anal. 2010, 101, 447–463. [Google Scholar] [CrossRef]
  18. Kanamori, T.; Suzuki, T.; Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Mach. Learn. 2012, 86, 335–367. [Google Scholar] [CrossRef]
  19. Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2004; Volume 523. [Google Scholar]
  20. Law, J. Robust Statistics—The Approach Based on Influence Functions; Taylor & Francis: Abingdon, UK, 1986. [Google Scholar]
  21. Dong, H.; Yang, L. Kernel-based regression via a novel robust loss function and iteratively reweighted least squares. Knowl. Inf. Syst. 2021, 63, 1149–1172. [Google Scholar] [CrossRef]
  22. Fox, J.; Weisberg, S. Robust regression. R S-Plus Companion Appl. Regres. 2002, 91, 6. [Google Scholar]
  23. Rahimi, A.; Recht, B. Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst. 2007, 20, 1177–1184. [Google Scholar]
Figure 1. Posterior approximations of variables u (a) and v (b) using three methods.
Figure 1. Posterior approximations of variables u (a) and v (b) using three methods.
Mathematics 12 02962 g001
Figure 2. The MSE results of a synthetic problem with the original KBF, importance weighted KBF, and iteratively re-weighted importance KBF.
Figure 2. The MSE results of a synthetic problem with the original KBF, importance weighted KBF, and iteratively re-weighted importance KBF.
Mathematics 12 02962 g002
Figure 3. The result of the proposed KBF with three different weight functions. “weight_Tu” denotes KBF with the Tukey weight function. “weight_ ex” represents KBF with the s-induced weight function. “weight_tan” denotes the logistic weight function.
Figure 3. The result of the proposed KBF with three different weight functions. “weight_Tu” denotes KBF with the Tukey weight function. “weight_ ex” represents KBF with the s-induced weight function. “weight_tan” denotes the logistic weight function.
Mathematics 12 02962 g003
Figure 4. Posterior approximations of the latent variable using three methods, with training set to 70. (a) Observation noise of σ Y = 0.01 ; (b) observation noise of σ Y = 0.1 .
Figure 4. Posterior approximations of the latent variable using three methods, with training set to 70. (a) Observation noise of σ Y = 0.01 ; (b) observation noise of σ Y = 0.1 .
Mathematics 12 02962 g004
Figure 5. The MSE results of Lorenz 96 using three methods with a test set of 70 and observation noise set at 0.1.
Figure 5. The MSE results of Lorenz 96 using three methods with a test set of 70 and observation noise set at 0.1.
Mathematics 12 02962 g005
Figure 6. The MSE results of Lorenz 96 using different λ values with a test size of 100 and observation noise set at 0.1.
Figure 6. The MSE results of Lorenz 96 using different λ values with a test size of 100 and observation noise set at 0.1.
Mathematics 12 02962 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, X. An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing. Mathematics 2024, 12, 2962. https://doi.org/10.3390/math12192962

AMA Style

Liu X. An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing. Mathematics. 2024; 12(19):2962. https://doi.org/10.3390/math12192962

Chicago/Turabian Style

Liu, Xin. 2024. "An Iteratively Reweighted Importance Kernel Bayesian Filtering Approach for High-Dimensional Data Processing" Mathematics 12, no. 19: 2962. https://doi.org/10.3390/math12192962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop