1. Introduction
Optimal filtering is devoted to the problem of optimal estimation of the state or signal of a system. The classical Wiener filtering method is a frequency-dominant filtering method that employs the spectral decomposition technique of the stationary l1-dom process. However, Wiener filtering is limited by its static stochastic processing, the need to store all historical data, and high computational complexity; thus, its application is somewhat limited [
1]. The Kalman filter (KF) proposed later is a time-domain filtering method. It employs state-space techniques and can be computed recursively in real time. It suits time-varying systems, non-stationary stochastic processes, and multidimensional signal filtering. Moreover, KF overcomes the shortcomings and limitations of classical Wiener filters and is widely used in many fields, such as target tracking [
2,
3], relative orbit estimation [
4,
5], and navigation [
6].
When the system obtains the observation vector, the instability of the system and the large amount of clutter and noise in the observation background [
7] often lead to data points that seriously deviate from the target truth value in the observation value—that is, the outlier [
8]. It is characterized by irregularity, short duration, and large amplitude. Under the interference of outliers, the filtering algorithm is prone to overflow, increased data processing errors, and even divergence in filtering [
9,
10].
The present studies on outliers can generally be categorized into two main types. One method is to incorporate it into conventional noise and investigate its probability density. Due to its heavier tails compared to the Gaussian distribution, the distribution of the outlier is referred to as the heavy-tailed distribution [
11]. Then, based on its statistical properties, the filtering procedure appropriate for a non-Gaussian distribution can be employed [
12]. The other method is to regard them as statistically irregular outliers for distinct identification and processing [
13]. In the application, the outlier is often accompanied by the abrupt failure of the system or the sudden change of the external environment, which lacks predictable regularity. It is challenging to implement the offline method in a real-time application [
14]. Moreover, the introduction of low-probability outliers into the distribution will make it difficult for the statistical characteristics of the observations to reflect the statistical properties when no outlier occurs, resulting in reduced robustness of the filter [
15]. Therefore, it is of theoretical significance and implementation prospect to investigate how to identify and process outliers instantaneously.
To precisely detect outliers, the authors of [
16] proposed a method based on innovation, where innovation indicates the difference between a state estimation value and a predicted value. Ref. [
17] proposes a semidefinite program for outlier detection. The above filtering algorithms tend to detect outliers and then correct them, which is hardly useful in the face of prominent outliers or missing observations. Ref. [
18] reveals that for outlier-prone processes, the optimal filtering approach is nonlinear, even if the system dynamics are linear. Since machine learning methods are well-suited for finding complex, nonlinear relationships in high-dimensional function spaces, this class of methods is currently the optimal choice. Ref. [
19] combines a deep neural network with new information as input to train a Long Short-Term Memory (LSTM) network [
20] to obtain the initial estimate, and its uncertainty of system noise variance is utilized as a new measurement in the filter. This filtering method uses the network output directly as a measure; the filtering effect is limited by the type of noise used for training and is highly dependent on the stability of the network. There are also many neural network-based methods used for state prediction [
21,
22], but these methods are unable to make accurate predictions for data containing noise [
23,
24]. When dealing with the challenge of filtering complex and variable object states, it is hard to create a precise state prediction model [
25]. The observation results significantly impact the filtering outcomes, making aircraft tracking a prime illustration of such issues [
26]. In addition, transformers [
27,
28] exhibit superior capabilities in global information extraction and processing compared to traditional neural networks like LSTM, particularly in handling time series prediction tasks. Therefore, an outlier-robust filter that combines accurate prediction with outlier processing capabilities is of great research value [
29].
In this paper, based on the characteristics of the existing outlier-robust filtering methods, combined with the characteristics of knowledge and data-driven state prediction methods with high prediction accuracy and strong real-time performance, the Transformer-based Outlier-Robust Kalman Filter (TORKF) is proposed. This method can accurately identify outliers in the observations and modify them with a high-precision prediction method, which improves the accuracy and robustness of the filtering algorithm. The main innovations in this study can be summarized as follows:
(1) Instead of investigating the probability density of outliers, an Outlier-Robust Kalman Filter (ORKF) based on innovation is proposed to identify and process outliers precisely.
(2) To address the issue of prediction errors caused by incomplete matching between the state space model and the actual state of objects, a Transformer-based Prediction Error Compensation (TPEC) model is proposed.
(3) A Prediction Error Covariance Correction (PECC) method is proposed to correct the misalignment of the prediction error covariance matrix caused by outliers.
The remainder of this paper is organized as follows.
Section 2 briefly describes the outlier-robust filtering problem and its modeling method.
Section 3 provides the details of the proposed TORKF algorithm. The experimental results and discussions are presented in
Section 4. The conclusions are drawn in
Section 5.
2. Problem Description
This paper considers the state space model as the following decentralized linear system [
30]:
where
is the n-dimensional state vector of the system
k moment,
is the m-dimension measurement vector for the system
k moment,
is the system state function for the system, and
is the system measurement function.
is the
moment system model noise, and
is the
k moment that does not contain the observation noise of outliers. Both are zero-average Gaussian white noises and are independent of each other; the covariance matrices are
and
.
Outliers are challenging to predict and unrelated to other noise; they also have large amplitudes. Therefore, the observation equation is redefined as Equation (3).
where
is the outlier, which is bounded.
is statistically independent of
and
. The covariance matrix of
is
.
Define the sequence of the input vector as
; the observation marker as
, and the outlier-robust filtration method as
. The current moment outlier-robust filtering result and the present moment true value
are equal to the minimum similar error, meeting both of the objectives, as shown in Equation (4).
3. Proposed Algorithm
The illustration of the proposed TORKF is reflected in
Figure 1. The ORKF based on innovation, the TPEC model, and the PECC method are all designed for TORKF. After completing its initialization, TORKF first detects whether there is an observation input. In the absence of input and before the termination condition is met, it treats the absence of input as if it were an outlier. With input available, the filter proceeds with prediction. The prediction result and the observation serve as inputs for outlier detection, and the filter updates itself in the absence of outliers. When an outlier is detected or there is no observation, the TPEC model is employed to adjust the prediction result, following which the prediction error covariance is corrected to obtain the final filtered output result.
A more detailed description of the filtering process is shown in
Figure 2, which includes complete judgment logic and clear inputs and outputs.
3.1. Outlier-Robust Kalman Filter
Considering the differential linear model as described in the problem, the core of KF is to estimate the difference of the error to the minimum and to guarantee the impartiality of the linear optimum estimate, namely,
where
is the state estimation of the KF. When outliers are present in the observational noise, the observation error no longer conforms to the statistical properties of the zero-average Gaussian white noise. Then, it is challenging to use the output of the KF to satisfy the principle of the minimum difference. Therefore, outliers are evaluated for the current observation vector of the input. Like KF, ORKF can also be presented in two parts: time update and measurement update.
3.1.1. Time Update
During the time update, the ORKF optimizes linear estimates based on the state equation and adjusts the state error covariance. The equations are shown in Equations (6) and (7), respectively.
where
is the n-dimensional state vector of the system
moment,
is the n-dimension prediction vector for the system
k moment,
is the state transition matrix, and
is the system state covariance matrix of the system at the moment
.
is the
k moment system prediction error covariance matrix, and
is the
moment model error covariance matrix.
3.1.2. Measurement Update
In the measurement update phase, the outlier detection must first be carried out on the input quantity measurement. The purpose of outlier detection is to identify the observation vector containing anomalous noise. To accurately identify outliers, this study uses the innovation variance as the identification vector and the prediction covariance as the identification control vector. The innovation
is calculated as follows:
where
is the n-dimensional observation vector of the system
k moment;
is the measurement matrix; and
is the outlier, which is bounded.
is statistically independent of
and
. The covariance matrix of
is
.
When there is no outlier, the prediction covariance matrix is as follows:
When there is an outlier, the prediction covariance matrix is shown in Equation (11).
Select
as the matrix for determining whether the innovation vector contains outliers or not.
is shown in Equation (12).
The judgment matrix
is given in Equation (13). The judgment equation is shown in Equation (14).
where
is the mark vector of an outlier,
is the outlier, and
indicates that no outlier exists.
is the coefficient vector used for determining an outlier, which depends on the statistical characteristics of the outlier; when the outlier’s statistical characteristics are apparent, they are determined according to their differences. When the statistical characteristics of
are ambiguous, the selection of the criteria can be referred to as the three-sigma rule [
31].
If there is no outlier, the filtering operation is carried out using Equations (15)–(17).
When an outlier is detected, TORKF is updated, as shown in
Figure 2. The termination condition is determined according to the continuous circumstance of the outlier, and the termination judgment value
depends on the capability of the TPEC model below.
3.2. Transformer-Based Error Compensation Model
In this article, the TPEC model is designed to improve the predictive model’s prediction accuracy and improve the filter’s overall robustness and filtration precision due to the inadequate forecasting model precision. The TPEC model in this article can be divided into three parts: input matrix construction, feature extraction, and Multilayer Perceptron (MLP). The illustration of the TPEC model structure is shown in
Figure 3.
This article provides two methods for constructing the input matrix in the input matrix construction portion.
When the regularity of the time series is robust, the procedure of the extraction of time sequence characteristics is used. Defining the model’s
k frame’s input matrix is
, which is composed of
, and normalization matrix
.
is shown in Equation (18).
where
is the input vector of a single frame, and
T is the total number of frames. Define
, contains
m values, and
. is a
dimension diagonal matrix. and are shown in Equations (19) and (20).
where
is the state vector at the moment
, and
is the difference between the state vector at the moment
and the state vector at the previous moment.
is the measurement error in the absence of outliers.
When a single sequence contains more complete characteristics, according to the predictive vector input , the dimension is s, and each dimension is treated as a word vector. can be divided into a dimension matrix as the input of the characteristic extraction model.
The transformer model has the characteristics of a rapid forward transmission speed, low structural complexity, and high characteristic extraction efficiency. Therefore, this article employs the transformer encoder element to extract the characteristics of the filter prediction vector. Attention is the premise for constructing the transformer model [
32], whose calculation formula is shown in Equation (21).
where
Q and
K are queries and keys of dimension
, and
V indicates the values of dimension
.
The transformer model is based on multi-headed attention, incorporating reconciliation, localization, MLP, and other structures. The model is shown in
Figure 3. The encoder of the structure is made up of a series of N-layer networks, each of which contains two sub-layers of the multi-headed attention mechanism and the prefacing neural network.
The transformers encoder is used to extract the characteristics of the input matrix. Normally, the choice of the number of attention heads is limited by the size of the graphics memory and the running time of the algorithm. Therefore, although a larger number means that more dimensional feature information is extracted, it is necessary to balance the physical constraints with the richness of the extracted information, and the choice is usually made among an even number of heads, which is often taken to be in the range of 2–18. The choice of layers for the encoder is similar. Regarding the number of layers in an MLP, it is usually necessary to experiment with the size of the data volume and the complexity of the nonlinear relationships, with 2–3 layers often being a good starting point for experimentation.
Considering the machine learning process of transformers encoder and MLP as a
TEC function,
is the same dimension as the state vector, and
is compensated for using the status vector. and the final output
state vector.
is shown in Equations (22) and (23).
where
M is the transformation diagonal matrix of the output vector of the transformer model with the compensation vector.
3.3. Prediction Error Covariance Correction Method
When outliers appear in the observation vector, at the same time as compensating for the prediction error, the prediction error covariance matrix updated in the Kalman filter forecast phase will also be misaligned, so the error covariance matrix needs to be adjusted according to the compensation result of the predicting error to ensure its subsequent filtration robustness. The illustration of the PECC method structure is shown in
Figure 4.
The input of the PECC method is the prediction compensation vector, the predictive vector, and the prediction error covariance matrix. The input was concatenated and flatted by the pre-processor. The output of the pre-processor is processed by a three-layer MLP and receives the error adjustment matrix
.
is defined as a prediction error covariance correction matrix, which is a diagonal matrix used to rectify the observation error vector after the prediction error compensation when outliers appear in the observational vector. It is also used to correct the prediction error covariance. The prediction error covariance correction matrix is indicated in Equation (24).
The corrected state error covariance is included in the error covariance differential correction, and the error covariance difference is obtained in the form of Equation (25).
The pseudo code of the TORKF is shown in Algorithm 1.
Algorithm 1 TORKF method |
Require: Observation error matrix Rk, number of time series length T ≥ 1, system state and measurement equations f, h, covariance matrices Qk and Rk of motion and measurement noise. Ensure: predicted object states for t = 1, … T Initialize: The error covariance matrix P, System state vector x for k = 1 to T do
predict as (6) and (7)
outlier detect as (14) if outlier:
else:
update as (15)–(17) end for t = K to end time do
Predict as (6) and (7)
Outlier detected as (14)
if outlier:
calculate the error compensation Ck as (22)
calculate the state vector xk as (23) calculate the compensation Γk as (24) calculate the error covariance matrix as (25) else:
update as (15)–(17) end |
4. Experiment and Analysis
To verify the effectiveness and advancement of the transformer-based field resistance filtration method, this study uses a three-dimensional aircraft position tracking model to test. The model is shown in Equations (26) and (27).
where
U and
Q are shown as Equations (28) and (29), respectively.
where
is the filter time interval, and
is the motorization frequency, which expresses the maneuverability of an aircraft.
is shown in Equation (30), and
is shown in Equation (31).
where
is the acceleration at the current moment, and
is the average acceleration.
4.1. Datasets and Environments
The datasets used for this experiment utilize flight data generated by the pilot on the semi-physical simulator during one-on-one struggles and overlapping detection errors to test the algorithm. The semi-physical simulator consists of a one-to-one reproduction of the aircraft operating unit, wide-angle scene reproduction hardware, and a server. The pilot is also a professional pilot with many years of flight experience, and the data has a high degree of real scene reproducibility. Flight datasets consist of two categories of data: low-speed dataset (dataset number Ⅰ) and high-speed datasets (dataset number Ⅱ), the specifications of which are shown in
Table 1.
These two datasets used in the experiment were generated by two distinct models of aircraft, where the low-speed dataset corresponds to the propeller aircraft and the high-speed dataset to the jet aircraft.
The experiment conducted on Dataset Ⅰ and Dataset Ⅱ utilized 90% of the data for model training and allocated the remaining 10% for verification and test purposes. The following experimental results are from the test dataset. The experimental computer was configured for an Intel i7-13700K CPU, an NVIDIA RTX 4090 GPU, and 32 GB of running memory. The system environment used for the simulation experiment was Windows 11, implemented under the Pytorch 2.4 framework using the Python 3.11.
4.2. Experiment Details
The aircraft coordinate system used in the experiment is a geographic coordinate system. During the experiment, the sampling cycle T was set at 0.5 s. The Motorization frequency was set to 0.5. All the neural network models were trained using the same loss function, optimizer, and learning rate. The L2 loss function and the Adam optimizer were selected during model training, and the learning rate was set to 0.001. Considering the limitation of computational resources and the need to satisfy the real-time requirement, and at the same time, more iterations not increasing accuracy, we chose 12 attention heads and a 6-layer encoder. The 3-layer MLP network has a good balance of accuracy and computation time in both TPEC and PECC. The input is a 128-frame matrix, as shown in Equation (32). All neural network models are trained with the same input for 50 epochs.
The Root Mean Square Error (RMSE) is utilized as an evaluation criterion. Suppose that the current frame is
k,
is filter output, and
is the true value of frame
k. N is all the time frames that were computed. The RMSE of frame k is calculated as
4.3. Overall Evaluation
In the experiment, TORKF is compared with the following robust filters.
(1) CS-MAEKF [
33]: a current statistical model used for maneuvering acceleration using an adaptive extended Kalman filter algorithm.
(2) WRPF [
34]: a statistically robust particle filter that exhibits high statistical efficiency and good robustness to outliers.
(3) LSTM-KF [
35]: a hybrid LSTM and KF model.
(4) MMCKF [
36]: a filter that combines an M-estimation and information-theoretic-learning (ITL)-based Kalman filter under impulsive noises.
Table 2 presents a comprehensive comparison between TORKF and state-of-the-art filters in terms of RMSE, running time, and maximum runtime per frame. Compared to the other algorithms, TORKF achieved significant reductions in RMSE, reaching maximum improvements of 133.11 m and 242.88 m across two datasets, with minimum reductions of 2.1 m and 5.67 m, respectively. Moreover, regardless of the dataset, the runtime of TORKF remains below 100 ms, satisfying the conditions for real-time operation.
Figure 5 and
Figure 6 show the RMSE performance of the test results for all comparison methods in Dataset I and Dataset II, respectively. For different datasets, we give the RMSE of the airplane’s x, y, and z positions and the average position RMSE of the three axes over time. Comparing
Figure 5 and
Figure 6, it can be observed that the filters’ outputs show similar patterns of RMSE variation when faced with aircraft with different maneuvering capabilities, maximum speeds, and maximum accelerations, as well as different noise and outliers in datasets.
Specifically, WRPF converges slower than other KF-based filtering methods due to its different Monte Carlo-based filtering method from KF. The accuracy of the filters that use the outlier-robust method or neural network method is higher than those that do not use these methods. The transformer-based error compensation model and the PECC method can also greatly improve the filtering accuracy. TORKF achieves the best filtering results in both datasets.
Under different datasets and motion characteristics, the RMSEs of ORKF, CS-MAEKF, and MMCKF have similar trends. The RMSE of MMCKF and CS-MAEKF is significantly higher than that of ORKF. The RMSE of WRPF increases more sluggishly due to the selection of 100 sampling points, which have a better ability to fit the noise. However, the RMSE of ORKF converges and starts to be significantly smaller than that of WRPF after about 200 s.
In particular, although MMCKF utilizes M-estimation and an ITL-based method to enhance its robustness, the algorithm shows poor accuracy when handling data containing numerous outliers in both datasets. Upon reviewing the filtering results, we observed that while the method prevents the results from diverging, addressing an outlier in one frame deviates the results over subsequent multiple frames. As a result, the RMSE of MMCKF is significantly larger compared to other methods.
To validate the effectiveness of our proposed three methods, this paper not only compares with existing state-of-the-art approaches but also conducts ablation studies. The comparison methods used in the ablation study involve the combination of the proposed methods, including ORKF, Attention-based Outlier-Robust Kalman Filter (AORKF), prediction-error-covariance correction Outlier-Robust Kalman Filter (PORKF), and LSTM-based prediction-error-covariance correction Outlier-Robust Kalman Filter (LPORKF). In the above methods, AORKF is enhanced by adding the TPEC model to ORKF, PORKF is the combination of ORKF and the PECC method, and LPORKF is an algorithm that integrates LSTM and PECC into ORKF.
4.4. Ablation Study
4.4.1. Innovation-Based ORKF Ablation Study
The effectiveness of the innovation-based ORKF is demonstrated through the comparative analysis presented in
Table 2. Compared to other outlier-resistant Kalman filters such as CS-MAEKF and MMCKF, ORKF achieves significant performance improvements, reducing RMSE by at least 53.84 m and 82.51 m in the two datasets, respectively. Moreover, ORKF also consumes approximately the same amount of time.
The convergence speed of filters is also a crucial metric for assessing the performance of filters. According to
Figure 5 and
Figure 6, ORKF converges faster than CS-MAEKF, MMCKF, and WRPF in both datasets.
4.4.2. Transformer-Based Error Compensation Model Ablation Study
After applying the prediction error covariance correction.
Table 3 shows that the RMSE of the AORKF method is reduced by 2.67 m and 10.91 m in Dataset Ⅰ and Dataset Ⅱ compared to ORKF, respectively. The LSTM-KF employs LSTM for end-to-end processing of observations with outliers, demonstrating certain effectiveness in this approach. Compared to ORKF, the RMSE results of LSTM-KF are reduced by 0.84 m and 8.88 m in Dataset Ⅰ and Dataset Ⅱ, respectively. However, compared to AORKF, LSTM-KF still has a higher RMSE, with values of 1.83 m and 2.03 m, respectively.
The experimental results indicate that the filtering accuracy and robustness are superior to those achieved by the filtering algorithm alone, whether using deep learning methods for error compensation or for observation data processing.
Figure 7 and
Figure 8 show the comparison of RMSE before and after implementing the PECC method with different fixed compensation coefficients for dataset I and dataset II, respectively. The RMSE trends over time and convergence rates of ORKF, LSTM-KF, and AORKF in both datasets are basically the same, as shown in
Figure 7 and
Figure 8. After the convergence of the filtering results, the data of different dimensions in both datasets show the same accuracy characteristics. AORKF consistently outperforms the other two methods in both the synthesized RMSE and the RMSE of the X, Y, and Z axes, which reflects the superiority of the transformer-based error compensation model compared to the filter prediction method and the LSTM method.
4.4.3. Prediction Error Covariance Correction Method Ablation Study
The PECC method compensates for error covariance misalignment caused by missing or eliminated observation vectors.
Figure 7 and
Figure 8 compare RMSE before/after PECC implementation and with different compensation matrices across datasets. The RMSE of the input data and PORKF remains invariant to compensation coefficients, while ORKF’s variation demonstrates that an optimal coefficient minimizing RMSE exists for different datasets and motion laws.
The experiments in
Figure 7 and
Figure 8 aim to verify whether there exists an optimal compensation coefficient other than 1 under different error compensation coefficients, which results in the highest filtering accuracy. Based on the experimental results, it can be seen that there does exist an optimal compensation coefficient that is not 1 under different datasets and noises. Although the PECC method proposed in this paper cannot reach the optimal value in each case, it can still improve the robustness and accuracy of the filter in the face of outliers.
According to the results shown in
Table 3, after using the PECC method, the accuracy and robustness of ORKF, LSTM-KF, and AORKF are greatly improved in both datasets. Specifically, the use of the PECC method in TORKF resulted in a reduction in the RMSE of AORKF by 0.33 m and 3.64 m in Dataset Ⅰ and Dataset Ⅱ, respectively. The RMSE of LPORKF is lower compared to LSTM-KF, and lower for PORKF compared with ORKF. The accuracy improvement brought about by the PECC method is significant. It is noticed that the more the error compensation model reduces the error, the more the accuracy improvement brought by the PECC method will be attenuated to some extent.
It can be observed that AORKF, which employs an attention-based approach, is more effective compared to LSTM. Additionally, it should be noted that regardless of whether considering the average runtime or the maximum runtime per frame, AORKF consistently requires more time than TORKF. Upon analysis, this is attributed to the introduction of the PECC method into the training process, which enhances the network’s runtime efficiency.
4.5. 3D Comparison Results
The 3D comparison results for 303 s of continuous flight data are presented in
Figure 9. The test data comes from a complete maneuver trajectory from Dataset Ⅱ and contains the true data as well as all the algorithms tested in this paper.
As shown in
Figure 9, under the interference of outliers far away from the true value, the MMCKF has large deviations and converges slowly. Other filtering algorithms have better convergence properties, so the filtering results are always near the true value.
Upon analyzing all experimental results, it is evident that the proposed improvement methods in this paper bring the filtering results closer to the true value when compared to existing algorithms. The outlier detection model accurately detects and effectively eliminates outliers. The transformer model effectively learns the motion characteristics of the trajectory and compensates for them. Additionally, the prediction error covariance correction method improves the accuracy of the filtering results.
5. Conclusions
(1) A Transformer-based Outlier-Robust Kalman Filter is proposed in this work. The outlier-robust method is utilized to detect the outliers precisely. The transformer-based error compensation model is developed to improve the prediction accuracy, and the PECC method is proposed to correct the error covariance further to enhance the accuracy and robustness of the filter.
(2) The proposed method was validated on two representative aircraft tracking datasets featuring complex motion models and outlier-contaminated observations: the propeller and jet aircraft datasets. Experimental results demonstrate that TORKF achieves a reduction in the RMSE of over 12.7% compared to current state-of-the-art methods while exhibiting faster convergence speed and enhanced robustness.
(3) It is noted that the average running time of the method proposed in this paper is longer than that of the compared methods, although it still meets the timeliness requirement for aircraft tracking.
(4) In future research, we might be able to propose a more time-efficient model and extend it to other application areas, such as autonomous driving or robotics, which have a high degree of similarity to the information obtained from aircraft sensors.