Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values

Li, Jiabao; Wang, Chengjun; Su, Wenhang; Ye, Dongdong; Wang, Ziyang

doi:10.3390/fractalfract9030181

Open AccessArticle

Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values

by

Jiabao Li

^1,2,3

,

Chengjun Wang

^1,2,3

,

Wenhang Su

^1,2

,

Dongdong Ye

^3,4 and

Ziyang Wang

^5,*

¹

School of Artificial Intelligence, Anhui University of Science & Technology, Huainan 232001, China

²

Anhui Artificial Intelligence Laboratory, Artificial Intelligence Research Institute of Hefei Comprehensive National Science Center, Hefei 230022, China

³

School of Artificial Intelligence, Anhui Polytechnic University, Wuhu 241000, China

⁴

Anhui Key Laboratory of Mine Intelligent Equipment and Technology, Anhui University of Science & Technology, Huainan 232001, China

⁵

Department of Computer Science, University of Oxford, Oxford OX1 3QG, UK

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(3), 181; https://doi.org/10.3390/fractalfract9030181

Submission received: 17 January 2025 / Revised: 13 March 2025 / Accepted: 14 March 2025 / Published: 16 March 2025

(This article belongs to the Special Issue Applied Fractional Calculus in Machine Learning and Biomedical Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Missing values in time series data present a significant challenge, often degrading the performance of downstream tasks such as classification and forecasting. Traditional approaches address this issue by first imputing the missing values and then independently solving the predictive tasks. Recent methods have leveraged self-attention models to enhance imputation quality and accelerate inference. These models, however, predict values based on all input observations—including the missing values—thereby potentially compromising the fidelity of the imputed data. In this paper, we propose the Uncertainty-Aware Self-Attention (UASA) model to overcome these limitations. Our approach introduces two novel techniques: (i) A self-attention mechanism with a partially observed diagonal that effectively captures complex non-local dependencies in time series data—a characteristic also observed in fractional-order systems. This approach draws inspiration from fractional calculus, where non-integer-order derivatives better characterize complex dynamical systems with long-memory effects, providing a more comprehensive mathematical framework for handling temporal data. And (ii) uncertainty quantification in data imputation to better inform downstream tasks. The UASA model comprises an upstream component for data imputation and a downstream component for time series prediction, trained jointly in an end-to-end fashion to optimize both imputation accuracy and task-specific objectives simultaneously. For classification tasks, the UASA model demonstrates remarkable performance even under high missing data rates, achieving a ROC-AUC of

99.5 %

, a PR-AUC of

58.5 %

, and an F1-SCORE of

49.3 %

. For forecasting tasks on the AUST-Gait dataset, the UASA model achieves a Mean Squared Error (MSE) of 0.72 under

0 %

missing data conditions (i.e., complete data input). Under the end-to-end training strategy evaluated across all missing data rates, the model achieves an average MSE of 0.74, showcasing its adaptability and robustness across diverse missing data scenarios.

Keywords:

time series predictive; self-attention; uncertainty; missing values

1. Introduction

Time series predictive tasks such as classification, forecasting, and event detection are fundamental for extracting valuable information from sequential data across various fields, including environmental monitoring [1,2,3], financial forecasting [4,5], and healthcare [6,7,8,9]. However, missing values remain one of the most significant challenges in time series modeling, as they can severely diminish the reliability and accuracy of predictive models. These missing values often arise from sensor malfunctions [10,11,12], communication failures [13,14], or other operational disruptions [15,16,17]. Consequently, effectively handling missing data is crucial for enhancing model performance and ensuring robust predictive analytics in real-world applications.

Existing approaches to time series predictive tasks often address the challenge of missing data by first learning the underlying data distribution and then imputing the missing values before proceeding to downstream tasks [18,19]. Recently, self-attention mechanisms have gained traction in the data imputation stage for their ability to enhance imputation quality while accelerating inference [20,21]. These methods process sequences of multivariate time series data through self-attention layers that consider correlations between all dimensions and time steps. However, a recent work claiming state-of-the-art (SOTA) performance (SAITS [22]), argues that self-attention models should not consider correlations involving missing dimensions and proposes a self-attention mechanism with a masked diagonal to exclude these influences. While this approach achieves SOTA data imputation performance, it potentially discards valuable information from observed dimensions. To mitigate information loss, we propose a novel self-attention model with a Partially Masked Diagonal (PMD), which processes the time series data dimension by dimension. It discards attention scores computed with missing dimensions by masking the corresponding diagonal values, thus preserving useful information from observed data while avoiding the negative influence of missing values.

Another significant challenge is effectively handling the uncertainty inherent in data imputation, especially when dealing with complex and highly fluctuating data distributions [23,24]. Traditional imputation models tend to predict the expected value of the missing data, which can lead to the loss of critical information about data volatility and variability [25]. This oversimplification not only smooths out essential fluctuations but can also misguide downstream tasks that rely on capturing the dynamic nature of the time series. For example, in financial applications such as swap contracts involving assets with volatile stock prices, a profit rate prediction model that ignores data volatility may produce inaccurate forecasts, potentially resulting in substantial financial losses. To quantify the uncertainty of the imputed data, we propose a method that masks a small percentage of observed data and performs inference multiple times. By repeatedly masking different subsets of input data and inferring, we obtain multiple predictions for each missing value; the mean serves as the imputation, and the standard deviation quantifies the uncertainty.

The self-attention model with PMD and the uncertainty quantification comprise the main components of our proposed method, i.e., the Uncertainty-Aware Self-Attention (UASA) model. The UASA model consists of two primary components: an upstream model and a downstream model. The upstream model is responsible for imputing missing values while simultaneously generating an uncertainty map. The imputed values and uncertainty maps are then passed to the downstream model to perform downstream tasks such as classification, forecasting, or anomaly detection. The entire UASA framework is trained in an end-to-end manner, jointly optimizing the objectives of data imputation and task-specific performance, ensuring that both processes inform and improve one another. This paper makes five key contributions:

We introduce a self-attention model with PMD to minimize the negative influence of missing values on the imputation process while preserving useful information from observed data.
We develop an uncertainty quantification method by performing multiple imputation inferences with randomly masked input data, enabling the estimation of uncertainty associated with each imputed value.
We empirically demonstrate that an end-to-end training approach allows for the simultaneous optimization of the upstream and the downstream models, outperforming traditional methods that separately learn the models.
We develop a public dataset AUST-Gait for clinical gait recognition and prediction dataset using a single kinematic sensor.
We propose the UASA model, which achieves state-of-the-art performance in classification, forecasting, and rare event detection tasks on the AUST-Gait, PhysioNet-2012, and Multi-Modal Gait Database benchmarks.

2. Related Work

2.1. Time Series Prediction

Time series predictive tasks aim to learn models to complete downstream tasks such as classification [26,27] and forecasting [28]. Early works solve the time series classification by transforming the time series data into a new feature space [29,30] and classify them using the nearest neighbor algorithm [31] or a diversity of ensemble methods [32,33,34,35]. Current deep learning methods solve the classification problem by learning Multi-Layer Perceptron (MLP) [36,37], convolutional neural network (CNN) [38] or Recurrent Neural Network (RNN) [39] with an end-to-end manner. Che et al. utilize a CNN model with an adversarial learning objective to predict risk according to patients’ historical medical records. A study on physiological signal classification [40] shows the advantage of CNN in extracting features and maintaining spatial relationships.

The building blocks in time series forecasting models are similar to the models for classification. Convolution filters can preserve the past information for forecasting [41], and dilated CNN layers are further developed to capture long-term dependencies [42,43,44]. These features enable the models not only to make accurate short-term predictions but also to effectively model data trends over longer time periods [45,46]. In recent years, research on Transformers and their variants has achieved significant advancements in the field of time series forecasting [47,48], markedly enhancing model performance. Specifically, the Informer model [49] addresses the computational efficiency challenges of standard Transformers [50] in long-sequence processing by introducing sparse attention mechanisms and more efficient data representations. Similarly, the Longformer [51] and Reformer [52] further optimize the efficiency of long-sequence data processing through innovative sparse self-attention and reversible structures. These advancements not only improve predictive accuracy but also significantly reduce computational resource consumption, offering more effective solutions for handling long-sequence data.

Currently, many emerging models demonstrate immense potential in capturing long-term dependencies in time series and enhancing computational efficiency. For instance, the Temporal Fusion Transformers model [53] integrates dynamic decoders with static metadata, endowing the model with flexibility and interpretability, thereby enabling precise predictions in complex time series contexts. Additionally, the N-BEATS framework [54], through a neural basis expansion analysis independent of specific time series attributes, has exhibited superior performance and adaptability across various real-world applications. Furthermore, Autoformer [55] is renowned for its ability to focus on capturing long-term dependencies while ensuring efficient computation, leveraging an adaptive autocorrelation mechanism to enhance the accuracy and efficiency of long-sequence forecasting. The Frequency Enhanced Decomposed Transformer [56] excels in handling highly periodic and structurally complex time series data by employing frequency-domain enhancement and task decomposition methods. While these models have achieved significant strides in improving prediction accuracy and computational efficiency, there remains room for enhancement in handling high-dimensional data and further optimizing computational efficiency. These challenges present valuable opportunities for future research endeavors.

Our proposed UASA is similar to these end-to-end methods for solving time series classification, but we use an expressive self-attention model to extract features and maintain the relationships between the time steps and the dimensions.

2.2. Time Series Imputation

UASA has strong connections to time series imputation, which models the time series data and generates values from the learned distribution to predict the missing values. Previous work utilizes a unidirectional RNN to capture the dependency between the current time step and its previous time steps [57], or uses a bidirectional RNN to consider the global correlations for time series imputation [58,59]. Due to RNN’s autoregressive nature, an error happening in an earlier time step will be compounded and affect the accuracy of successive imputations [60]. NAOMI [61] reduces the compounding error by using a non-autoregressive model, but it reduces the model’s inference speed.

Another drawback of the RNN-based method is its inability to capture complex time series data distribution. Luo et al. optimize an RNN-based generator with a computation-intensive two-stage optimization for adversarial learning and noise matching. Later, they propose a one-stage method E²GAN with an auto-encoder generator to accelerate model training and improve the imputation quality, and [62] et al. propose a real-data forcing method to reduce the computation in the noise matching stage. However, the GAN-based method is infamous for its non-convergence and mode collapse issues [63].

Recent self-attention models successfully address the problems. CDSA [64] has a cross-dimensional self-attention model to impute spatiotemporal data with cross-dimension correlation. Empirical results documented in the original study demonstrate that CDSA performs robustly across a variety of datasets, including the METR-LA traffic speed prediction and KDD-2018 air quality and meteorological datasets, confirming its potential for broader applications. DeepMVI [65] proposes a transformer architecture with a convolutional window feature and a kernel regression for multi-dimensional time series imputation. On large-scale datasets, DeepMVI reduces computational complexity and enhances imputation accuracy, improving precision in predicting missing values in multi-dimensional time series and demonstrating broad applicability. NRTSI [66] builds a time-value tuple for each time step in the series for the transformer architecture. By mapping time and value, NRTSI refines the capture of subtle variations in time series data. It significantly improves imputation robustness and accuracy in noisy, irregular datasets, showcasing its applicability in complex data processing. CATSI [67] utilizes the information from the context and applies bidirectional LSTM to capture past and future information. Distinctively, the UASA model captures the context information using a novel self-attention model and further solves unreliable information from missing values. The most related work to our proposed UASA is SAITS, which designs multiple imputation blocks with a diagonally masked self-attention to jointly predict the missing values. Our work enhances SAITS to reduce information loss by the diagonally masked self-attentions and proposes an end-to-end learning method.

2.3. Uncertainty Estimation

Uncertainty estimation has become a foundational component in time series imputation and prediction, providing a mechanism to quantify a model’s confidence in its predictions. One of the most prominent approaches is MC Dropout [68], which employs stochastic Dropout during inference to approximate a Bayesian posterior. An alternate approach is Deep Ensembles, proposed as a highly scalable and robust method [69,70]. By training multiple independent models and calculating the variance of their outputs, this approach has shown great promise in various domains, but its computational cost is considerably higher.

The Bayesian Neural Network (BNN) framework takes uncertainty estimation a step further by introducing prior distributions over model parameters [71,72]. Through Bayesian inference, BNNs provide theoretically sound uncertainty estimates, particularly valuable in small-scale or high-risk tasks. However, the computational complexity of traditional Bayesian inference methods poses a challenge for scaling to large datasets.

Recent advancements have refined these classical methods. Concrete Dropout [73] enhances MC Dropout by learning optimal dropout rates and positions during training, thereby improving its adaptability and effectiveness across different scenarios. Bootstrap Ensembles [74] extends traditional ensemble techniques by employing bootstrapping strategies combined with neural networks. Moreover, Deep Evidential Regression has been introduced as a game-changer [75], directly outputting predictive distributions from a single model without the need for sampling or model ensembles.

Uncertainty estimation has also found tailored applications in time series tasks. For instance, Time series Temporal Variational [76] integrates a temporal variational autoencoder to simultaneously address sequence reconstruction and uncertainty quantification. Enabling even greater scalability, Transformer-based models such as the Temporal Fusion Transformer (TFT) [55] and conditional attention mechanisms [77] have incorporated uncertainty estimation.

The UASA model facilitates reliable uncertainty estimation with both imputation and forecasting tasks while further validating the importance of uncertainty quantification in improving downstream application performance.

3. Methods

The UASA model consists of an upstream model and a downstream model, as illustrated in Figure 1. The upstream model is designed to impute the missing values and generate an uncertainty map measuring the confidence of each imputation. The downstream model then utilizes these outputs to perform downstream tasks such as classification and forecasting. Section 3.1 presents the problem setting. Next, Section 3.2 and Section 3.3 introduce the upstream model and the downstream model, respectively. Finally, Section 3.4 presents an end-to-end learning method to update the two models simultaneously. The frequently used notations are listed in Table 1.

3.1. Problem Setup

The time series predictive tasks aim to build a model that can estimate the current state or future states given a time series. The input of the predictive model is multivariate time series with time step T and dimension D, denoted as

X = [X_{1}, X_{2}, \dots, X_{T}] \in R^{T \times D}

, where the t-th step is

X_{t} = {[X_{t, 1}, X_{t, 2}, \dots, X_{t, D}]}^{⊤} \in R^{D}

.

In this paper, we aim to solve the time series predictive tasks with missing values. This means that any step or dimension in the input can be missing. Accordingly, a missing mask matrix

M \in R^{T \times D}

is introduced to indicate the presence of the values in X:

M_{t, d} = \{\begin{matrix} 1 & if X_{t, d} is observed, \\ 0 & otherwise . \end{matrix}

where

M_{t, d}

denotes the presence of the value in the dimension d of the time step t in X.

Given the inputs X and M, the predictive tasks aim to find a model that can complete tasks such as forecasting future steps, classifying time series into the correct class, or detecting rare but disruptive events. To address these challenges, we propose the UASA model.

3.2. UASA Upstream Model

The UASA upstream model is a network featuring a self-attention layer with PMD. This section introduces the PMD self-attention layer and the complete UASA upstream model architecture.

3.2.1. Self-Attention with Partially Masked Diagonal

In the conventional self-attention model, the input time series X and its missing mask matrix M can be concatenated and then transformed into a query matrix

Q \in R^{d_{k} \times T}

, a key matrix

K \in R^{d_{k} \times T}

, and a value matrix

V \in R^{d_{v} \times T}

. The self-attention process can be written as

Attention (Q, K, V) = V Softmax (\frac{K^{⊤} Q}{\sqrt{d_{k}}}),

(1)

where the Softmax function is applied to each column of the attention matrix, ensuring that the elements of each column sum to 1.

In this model, the attention score for any input

X_{t}

is influenced not only by itself but also by all other inputs, but ideally, when the input

X_{t}

contains missing values, its output should not depend on the missing values. To reduce the missing value dependency, we propose self-attention with PMD, where the input is passed dimension by dimension, and the diagonal elements corresponding to a missing value are masked to

- \infty

. As shown in Figure 2, the reshaped input time series

X \in R^{T D}

and its corresponding missing task matrix

M \in R^{T D}

are transformed into a query matrix

Q \in R^{d_{k} \times T D}

, a key matrix

K \in R^{d_{k} \times T D}

, and a value matrix

V \in R^{d_{v} \times T D}

. The self-attention process becomes

AttentionPMD (Q, K, V) = V Softmax (PMD (\frac{K^{⊤} Q}{\sqrt{d_{k}}})) = V A,

(2)

where A is attention weights. Different from the conventional self-attention model whose token is a time step, here, a dimension is a token and any dimension has a matched column in Q, K, and V. To reduce the prediction’s dependency on the missing values, we propose to remove the diagonal elements whose computation involves these missing values. For instance, the diagonal element in the i-th row and the i-th column is computed by the dimension (i mod D) of the state at time step

⌊ i / D ⌋

. The diagonal element should be removed if

X_{⌊ i / D ⌋, i \mod D}

is missing, i.e.,

M_{⌊ i / D ⌋, i \mod D} = 0

. Here, we define the PMD function as

PMD (X) [i, i] = \{\begin{matrix} X_{i, i}, & if M_{⌊ i / D ⌋, i \mod D} = 1 \\ - \infty, & otherwise \end{matrix} for i \in {1, \dots, T D} .

(3)

In this step, we eliminate the influence of a missing value on its corresponding output. To improve the model’s capability to capture the data distribution, we use a multi-head PMD attention layer in UASA: each input X has h PMD attention matrices. The multi-head PMD attention layer is formulated as:

MultiHeadAttentionPMD (x) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{O},

(4)

where

{head}_{i} = AttentionPMD (Q, K, V)

, and

W^{O} \in R^{h T D \times T D}

transform the concatenated attention matrices to a vector of length

T D

, where the missing positions indicated by the missing mask matrix M are filled with values by considering all other values besides itself.

The UASA upstream model involves three main components that determine its computational cost: (1) the PMD model, (2) feed-forward network layers, and (3) repeated forward passes for uncertainty estimation. Specifically, each time step and dimension is treated as a token, so if the time series has T steps and D dimensions, the effective token length is

N = T \times D

. A single self-attention layer requires

O (N^{2} \cdot d_{k})

operations per head, where

d_{k}

is the key/query dimension. With h attention heads and L stacked layers, this scales to

O (L \times h \times {(T D)}^{2} \times d_{k})

for one forward pass. The feed-forward sub-blocks in each layer add a lower-order term of

O (L \times T D \times d_{ff})

. Finally, to generate the uncertainty map, the model repeats the forward pass K times with different artificial masks—thus multiplying the overall cost by

(K + 1)

. Although PMD introduces an extra masking operation, it adds minimal overhead since masking specific diagonal elements is essentially a constant-time step per token. Consequently, UASA retains the same

O ({(T D)}^{2})

complexity profile typical of attention architectures and remains practical for moderate T and D in real-world time series tasks.

3.2.2. Model Forward Propagation

After introducing the PMD self-attention layer, we now introduce the forward propagation of the USUA model. As shown in Figure 1, the UASA model receives a value

X_{t, d} \in R

and a mask

M_{t, d} \in R

. They are firstly concatenated and projected to

d_{m}

dimensions. Added with a broadcasted positional embedding, we obtain a new vector

h_{1} \in R^{1 \times d_{m}}

:

h_{1} = Concat (X_{t, d}, M_{t, d}) W_{1} + b_{1} + Embed (t),

(5)

where the parameter

W_{1} \in R^{2 \times d_{m}}

and

b_{1} \in R^{1 \times d_{m}}

. Sequentially, the vector

h_{1}

is fed into our PMD self-attention layers and a feed-forward network:

h_{2} = MultiHeadAttentionPMD (h_{1}) W_{2} + b_{2},

(6)

where the parameter

W_{2} \in R^{T D \times d_{m}}

and

b_{2} \in R^{1 \times d_{m}}

. In this layer, the hidden states

h_{1}

and

h_{2}

have the same shape and we stack the self-attention layer for N times. Sequentially, we have

h_{1 : 1 + N}

with the parameters

W_{1 : 1 + N}

and

b_{1 : 1 + N}

. Finally, we add two linear layers to project

h_{2 + N} \in R^{d_{m}}

to

R^{D}

, in the same way as the input:

{\hat{X}}_{t, d} = ReLU (h_{1 + N} W_{2 + N} + b_{2 + N}) W_{3 + N} + b_{3 + N} .

(7)

At this step, we have the output

{\hat{X}}_{t, d}

for every dimension in the multivariate time series data X.

When a large percentage of input values is missing, the model tends to generate uncertain data and negatively influence the downstream tasks. To this end, we perform another K forward propagation to obtain an uncertainty map, which indicates the level of uncertainty for each imputed value. For the k-th forward propagation, we artificially mask some percentage of the observed values and obtain a data representation

{\tilde{X}}^{k}

. After K forward propagation, we have

{\tilde{X}}^{1}, {\tilde{X}}^{2}, \dots, {\tilde{X}}^{K}

. The corresponding uncertainty map is

U_{t, d} = \{\begin{matrix} 0, & if M_{t, d} = 1, \\ \frac{1}{K} \sum_{k = 1}^{K} {({\tilde{X}}_{t, d}^{k} - {\bar{X}}_{t, d})}^{2}, & otherwise, \end{matrix}

(8)

where

{\bar{X}}_{t, d} = \frac{1}{K} \sum_{k = 1}^{K} {\tilde{X}}_{t, d}^{k}

. To summarize, the inference of UASA includes one forward propagation with the original input X and M for a data representation

\hat{X}

and K forward propagation with artificially masked inputs for an uncertainty map U.

3.3. UASA Downstream Model

In this part, we introduce the UASA downstream model, which receives the data representation

\hat{X}

and the uncertainty map U from the upstream model, and generate results for the targeted tasks. Here, we leverage an RNN to process the learned representation

\hat{X}

and the uncertainty map U.

The downstream model receives a time series

{\hat{X}}_{1}

, …,

{\hat{X}}_{T}

and an uncertainty series

U_{1}

, …,

U_{T}

. To handle the sequential nature of the input data and capture long-term dependencies, we use a stack of Long Short-Term Memory (LSTM) units in the downstream model:

h_{t}, c_{t} = LSTM (Concat ({\hat{X}}_{t}, U_{t}), h_{t - 1}, c_{t - 1}) for t = 1, \dots, T .

(9)

h_{0}

and

c_{0}

are zeros. After processing the input through the LSTM, the final hidden state

h_{T}

is passed through a fully connected layer. Here, we use the classification task as the downstream task, the output of the connected layer is fed to a Softmax function, which produces the likelihood for each class. Mathematically, this is represented as

y = Softmax (h_{T} W_{D} + b_{D}),

(10)

where

W_{D}

and

b_{D}

are the weights and biases of the fully connected layer.

3.4. Learning Objective

To learn an effective representation for the downstream task, we propose an end-to-end objective, which simultaneously updates the upstream model with parameter

θ

and the downstream model with parameter

ϕ

:

L (θ, ϕ) = L_{m a i n} (θ, ϕ) + α_{1} L_{M I T} (θ) + α_{2} L_{O R T} (θ) .

(11)

The first term

L_{m a i n}

is the objective of the downstream tasks. The second term

L_{M I T}

refers to the masked imputation task (MIT), encouraging the upstream model to impute the missing value with plausible values. The third term

L_{O R T}

is the observed reconstruction task (ORT), expecting the upstream model to reconstruct the observed data.

α_{1}

and

α_{2}

are hyperparameters to control the strength of the regularization. The details of

L_{m a i n}

,

L_{M I T}

, and

L_{O R T}

are detailed as follows.

The main objective is optimized to solve the downstream tasks. The downstream tasks could be a classification problem or a forecasting problem. For classification problems with C classes, the objective is the cross-entropy:

L_{m a i n} (θ, ϕ) = E_{(X, y) \sim D} [log M_{θ, ϕ} (y ∣ X, M)],

(12)

where

M_{θ, ϕ} (y ∣ X, M)

is the probability on the class y given the time series X and its missing mask matrix M. When we aim to solve a forecasting problem, the objective is to minimize the mean square error (MSE):

L_{m a i n} (θ, ϕ) = E_{(X, y) \sim D} [| | y - M_{θ, ϕ} (y ∣ X, M) {| |}_{2}^{2}],

(13)

where

M_{θ, ϕ} (y ∣ X, M)

is the predicted next state of the given time series X and its missing mask matrix M.

This masked imputation task (MIT) objective aims to impute the missing values in the time series with plausible data. We artificially mask a percentage of the observed values as unobserved, and use the upstream model to predict the masked values with the rest of the observed values. The artificially masked values are indicated by a matrix

I_{t, d} = \{\begin{matrix} 1, & M_{t, d} = 1 and X_{t, d} is artificially masked, \\ 0, & otherwise . \end{matrix}

(14)

The objective aims to minimize the MSE between the ground truth of the artificially masked values and the predicted values from the upstream model. The objective function is defined as follows:

L_{M I T} (θ) = \frac{| | (M_{θ} (X) - X) ⊙ I {| |}_{F}^{2}}{{| | I | |}_{F}^{2}},

(15)

where

| | \cdot {| |}_{F}^{2}

is squared Frobenius norm, the sum of the squares of all the elements in a matrix.

After model processing, observed values may be different from their corresponding output. In this step, the observed reconstruction task (ORT) objective encourages the upstream output to reconstruct the input data and converge to the real data distribution. We define a matrix indicating the values that are observed and not artificially masked:

O_{t, d} = \{\begin{matrix} 1, & if M_{t, d} = 1 and X_{t, d} is not artificially masked, \\ 0, & otherwise . \end{matrix}

(16)

O_{t, d} = 1

means that the value is observable and not artificially masked. The objective function is defined as follows:

L_{O R T} (θ) = \frac{| | (M_{θ} (X) - X) ⊙ O {| |}_{F}^{2}}{{| | O | |}_{F}^{2}} .

(17)

This objective minimizes the MSE between the observed, unmasked data and their predictions.

4. Experimental Results

In the experiments, time series prediction was benchmarked using the newly developed AUST-Gait dataset, which encompasses classification, forecasting, and rare event detection tasks. To comprehensively evaluate the performance of the UASA model, we also incorporated the Multi-Modal Gait Database [78], a dataset focused on analyzing the natural walking behaviors of healthy participants. The similarity between these two datasets in gait analysis allows us to validate the model’s effectiveness under varying experimental conditions. Additionally, we utilized the publicly available PhysioNet-2012 dataset [79], which contains extensive physiological time series data widely used in clinical research.

During the experiments, all time series data processing and model training were conducted using an NVIDIA RTX 3090 GPU, a widely accessible piece of hardware in many research environments. Based on our experience, the NVIDIA RTX 3090 GPU is sufficient for non-real-time applications. For uncertainty estimation via ten forward passes (K = 10), a single run consumed less than 1 s. This indicates that the overall resource consumption is relatively low, making the experiments reproducible under standard workstation conditions and feasible for broader implementation.

Through the comparative analysis of these three datasets, the UASA model demonstrated SOTA performance across AUST-Gait, the Multi-Modal Gait Database, and PhysioNet-2012, further confirming the model’s generalization capabilities and applicability. Ablation studies indicate that the performance enhancements of the UASA model can be attributed to the PMD mechanism, uncertainty quantification, and the end-to-end learning approach, all of which contribute to the model’s efficacy in handling complex time series data.

4.1. Dataset Description

The performance of the UASA model was evaluated using three datasets, focusing on its ability to handle missing data, classification tasks, and prediction tasks. The first dataset, PhysioNet-2012, is significant in time series forecasting due to its comprehensive physiological signal collection. The second dataset, AUST-Gait, provides professional gait data reflecting various clinical patterns. The third dataset, the Multi-Modal Gait Database, enhances understanding of the model’s performance across different gait patterns. By integrating these datasets, a thorough assessment of the model’s adaptability and effectiveness under various conditions can be achieved, deepening insights into the UASA model’s performance.

4.1.1. AUST-Gait Dataset

To achieve accurate gait pattern recognition and prediction, a clinical gait recognition and prediction dataset was constructed using a single kinematic sensor. Data collection and prediction validation were carried out using a hip-joint lower limb exoskeleton system that integrates high-performance power units and IMU sensors, encompassing various daily walking modes. We recruited 65 healthy adults (age:

23 \pm 2

years, height:

1.74 \pm 0.05

m, weight:

80 \pm 8.6

kg), excluding individuals under 20 years of age, those with physical activity restrictions due to injury or illness, and those with a history of lower limb surgery, neurological disorders, or cardiovascular diseases. The study was approved by the Biomedical Research Ethics Committee of Anhui University of Science and Technology. All participants provided written informed consent in accordance with the Declaration of Helsinki, allowing their participation in the study and the public release of gait data and images. The AUST-Gait dataset is publicly available. Github repository: https://github.com/LIbbbao/AUST_gait (accessed on 13 March 2025).

All participants underwent appropriate training by professional clinicians and performed four different walking modes under supervised conditions. The four walking modes include the following:

Flat Ground (FG): Mimicking daily walking patterns, participants walked back and forth 3 times on a 6 m long, $1.2$ m wide track.
Stair Ascent (SA): Participants ascended a staircase with 10 steps, each step having a height of 15 cm, repeating this action three times.
Stair Descent (SD): Participants descended a staircase with 10 steps, each step having a height of 15 cm, repeating this action three times.
Inclined Treadmill (IT): Participants walked on a treadmill set at a 20° incline for 45 s.

By integrating IMU sensors with hip joint velocity and angle sensors, a comprehensive set of kinematic features was collected, including acceleration, angular velocity, Euler angles, and hip joint velocity and angles (as shown in Figure 3). These features provide high-quality data inputs for deep learning algorithms. Figure 4 illustrates the data collection setup, where volunteers wore lower-limb exoskeleton devices under standardized conditions to complete data acquisition for four gait patterns.

During the experiments, we strictly performed a person-wise split when dividing the dataset. Specifically, the data in the training and testing sets came exclusively from different individuals to ensure that no data from the same individual could simultaneously appear in both the training and testing sets.

4.1.2. PhysioNet-2012 Dataset

The PhysioNet-2012 dataset, a publicly available clinical dataset, is widely used for predicting outcomes in Intensive Care Unit (ICU) patients. The dataset is derived from the MIMIC-II database of the Beth Israel Deaconess Medical Center (BIDMC) at the Massachusetts Institute of Technology. It includes detailed records of over 8000 ICU patients. The types of data encompass demographic information, time series physiological measurements, laboratory results, treatment details, and patient outcomes.

The time series data in the dataset cover approximately 42 physiological variables, including heart rate, blood pressure, respiratory rate, and oxygen saturation. The duration of data records for each patient ranges from several hours to several days. The data are typically stored in Comma-Separated Values (CSV) format and organized by patient ID. Given the extensive amount of physiological time series data, preprocessing steps such as data cleaning, handling missing values, and standardization are crucial.

4.1.3. Multi-Modal Gait Database

The Multi-Modal Gait Database is a publicly available urban gait dataset focused on analyzing the natural walking behavior of 20 healthy participants aged between 18 and 69. Each participant wore a full-body suit equipped with 9 IMU sensors and foot pressure sensors. The data were sampled at a frequency of 60 Hz, accumulating a total of 846,715 data points. This study conducted detailed time series data analysis using 16 pressure sensors and triaxial accelerometers from various body parts to identify multiple walking patterns, including flat walking, stair climbing, and slope walking. The data are organized in CSV format, categorized by participant ID, with recording times ranging from a few minutes to several hours, providing a solid foundation for data analysis and pattern recognition research.

4.2. Baselines

Median: Median imputation, by replacing missing time series values with the median of adjacent observations such as using the population median for missing data, offers a straightforward and efficient solution for continuous predictive variables [80].
M-RNN: M-RNN effectively captures dynamic patterns through a multi-range temporal modeling mechanism, utilizing specialized neural structures to handle incomplete time series data. It demonstrates exceptional practicality and effectiveness in enhancing the accuracy of clean energy predictions [81].
SAITS: SAITS applies a pure self-attention mechanism for time series imputation, overcoming limitations of traditional RNN [22]. It achieves significant accuracy improvements and efficiency over leading models, with a compact architecture that supports diverse predictive tasks.
Transformer: Transformers use self-attention to capture long-range dependencies, enabling efficient global modeling. With parallel processing and multi-head attention, they excel in tasks like forecasting, anomaly detection, and data imputation.
CATSI: The CATSI method employs bidirectional LSTM networks to enhance time series data imputation accuracy, particularly in medical data tasks [67]. It effectively captures global contextual information and manages temporal dependencies, offering precise and continuous imputation. This approach provides a fresh perspective and shows significant potential across various applications.
TFT: TFT introduces a hybrid architecture that combines attention mechanisms for temporal patterns and gating mechanisms for variable selection. This design enhances the model’s ability to handle multivariate time series tasks, providing both accurate predictions and interpretable results across diverse applications, particularly in forecasting and anomaly detection [53].

4.3. Classification Tasks

Classification is a pivotal downstream task in time series prediction and serves as an essential benchmark for evaluating model performance. In our study, we employed two distinct datasets, the AUST-Gait dataset and the PhysioNet-2012 dataset, to thoroughly assess the classification capabilities of the UASA model. The AUST-Gait dataset comprises 92,110 samples distributed across four classes: FG, SA, SD, and IT. This dataset was partitioned into a

70 %

training set, a

20 %

test set, and a

10 %

validation set. Each input feature was characterized by time steps

T = 10

and dimension

D = 13

. To emulate real-world scenarios of missing data, we randomly omitted

20 %

,

35 %

,

50 %

,

65 %

, and

80 %

of the data values, and experimental variance (standard deviation) was obtained through five independent runs of the same model with fixed data splits. Standard deviation bars (shown in Figure 5 and Figure 6) represent these repeated experiments. Performance was evaluated using ROC-AUC, PR-AUC, and F1-SCORE metrics. As illustrated in Figure 5, the UASA model demonstrated remarkable robustness against varying levels of data incompleteness. Specifically, at a

20 %

missing rate, the UASA model achieved ROC-AUC, PR-AUC, and F1-SCORE of 0.98, 0.22, and 0.92, respectively. Even with

80 %

data missingness, these scores remained high at 0.86, 0.20, and 0.45. In contrast, baseline models such as Median, M-RNN, Transformer, and SAITS underperformed noticeably under higher missing data proportions, underscoring the UASA model’s superior ability to handle incomplete datasets—an essential trait for practical applications.

The PhysioNet-2012 dataset was employed to further evaluate the performance of the UASA model in classification tasks. This dataset is a classic benchmark in the field of time series prediction, used for training binary classifiers to predict patient survival outcomes. Unlike the AUST-Gait experiments, the PhysioNet-2012 dataset inherently contains missing values, reflecting the complexity of real-world clinical data. The model’s performance was assessed based on these inherent patterns of missing data.

Figure 6 empirically demonstrates that the UASA model achieves SOTA performance in terms of ROC-AUC, PR-AUC, and F1-SCORE. Specifically, the UASA model achieved a ROC-AUC of 0.88, a PR-AUC of 0.52, and an F1-SCORE of 0.44. Despite the inherent missing values in the dataset, the UASA model maintained its performance advantage, demonstrating its robustness and reliability in handling imperfect datasets. This resilience highlights the model’s capability to provide consistent predictive accuracy in critical healthcare applications, particularly where data quality may be unstable and incomplete.

The Multi-Modal Gait Database was utilized to further evaluate the performance of the UASA model in classification tasks. The Multi-Modal Gait Database inherently contains missing values, reflecting the complexities of real-world gait data. The model’s performance was assessed based on these intrinsic patterns of missing data, ensuring a comprehensive understanding of its effectiveness and robustness in practical applications. Table 2 provides a detailed comparison of the performance of various models on the Multi-Modal Gait Database. The comparison reveals that the UASA model still achieves SOTA performance across all metrics.

4.4. Forecasting Tasks

To assess the performance of the USUA model in forecasting future time steps, a dataset comprising 92,110 samples was constructed from the AUST-Gait dataset. In this task, the algorithms are required to predict the future 10 time steps given the past 10 time steps. The average MSE (AVE_MSE) across the different time steps is calculated by

AVE_MSE = \frac{1}{10} \sum_{t = 1}^{10} | | X_{t} - {\hat{X}}_{t} {| |}_{2}^{2},

(18)

where

{\hat{X}}_{t}

is the predicted future step at time step t. The results are presented in Figure 7, demonstrating that UASA maintains the lowest error across different data missing rates. Furthermore, the prediction error across different time steps with a data missing rate of

20 %

is evaluated by

MSE (t) = | | X_{t} - \hat{X} {| |}_{2}^{2} .

(19)

Figure 8 illustrates the performance of UASA in multi-step forecasting tasks. The left panel shows the average MSE across different prediction step sizes (1 to 10) under a

20 %

missing rate. As the step size increases, all models experience increasing prediction errors. UASA, however, exhibits the slowest error growth. The middle plot shows the prediction errors under a

50 %

missing rate. Although the missing rate increases, UASA consistently outperforms baseline models across all prediction step sizes. The right plot shows the prediction errors under a

65 %

missing rate. Even with a high missing rate, UASA demonstrates significantly lower error growth compared with baseline models, showcasing its robustness. These results demonstrate that UASA achieves significantly better error growth curves than baseline methods in both increasing prediction steps and varying missing rates, demonstrating its superior long-term stability and generalization capabilities in sequence forecasting.

4.5. Fall Detection Tasks

Fall events, though rare, are high-risk in gait analysis. To evaluate our model’s effectiveness in detecting these events, we curated a dataset with diverse activities, including flat surface walking, stair climbing, stair descending, and inclined treadmill walking. By integrating fall data with existing datasets, we assess the model’s performance under various masking levels.

This fall detection task is formulated as a binary classification problem involving sparse labels—only about 5% of the labeled windows contain a fall. We segment the time series into fixed-length windows and assign each window a label of either “fall” or “non-fall”. The model’s goal is to predict whether a fall event occurs in each window, addressing both imbalance (due to the rarity of fall events) and temporal segmentation in a unified framework.

Initially, classification performance is evaluated using the AUC-ROC curve, a key metric for binary classification. The x-axis represents the False Positive Rate (FPR) and the y-axis the True Positive Rate (TPR). An AUC value close to 1 indicates superior performance, calculated as:

AUC = \int_{0}^{1} TPR (x) d FPR (x) .

Figure 9 shows the UASA model’s AUC-ROC curves under different masking conditions (

0 %

,

20 %

,

35 %

). At

20 %

masking, the AUC is nearly 1, indicating high accuracy without missing data. Even at

20 %

and

35 %

masking, AUC values remain strong at 0.987 and 0.968, showing resilience despite data omissions.

To analyze false alarms, confusion matrices are used, which include False Positives (FP), True Negatives (TN), and False Negatives (FN). The FPR is calculated as

FPR = \frac{FP}{FP + TN} .

Figure 10 presents confusion matrices for different masking levels. Notably, at

0 %

masking, the UASA model has zero false positives (FP = 0), demonstrating high reliability. At

20 %

and

35 %

masking, false positives are minimal at 1 and 2, maintaining a low false alarm rate. In summary, the UASA model excels in classification performance and maintains a low false alarm rate across data omission levels, highlighting its potential in fall detection applications where safety and rapid response are critical.

4.6. Ablation Studies

To rigorously evaluate the contributions of individual components within the UASA model, a systematic series of ablation experiments was conducted on the AUST-Gait dataset, which is widely used for time series classification tasks. By incrementally excising key modules, the specific contributions of each component to the predictive accuracy and robustness of time series analysis were quantified. Table 3 details the performance variations following the removal of each component.

4.6.1. Key Model Components

Table 3 illustrates the ablation study results of the UASA model on the classification task of the AUST-Gait dataset, highlighting the exceptional performance of the complete model across various metrics, particularly in handling complex datasets. When the PMD self-attention layer was removed, the model’s accuracy plummeted from

97.7 %

to

92.5 %

, and the F1-SCORE dropped from

94.1 %

to

92.4 %

. This underscores the critical role of the PMD mechanism in addressing missing data. Additionally, the removal of the uncertainty map function and the reduction in the number of attention heads led to reductions in the accuracy of

3.2 %

and

1.2 %

, respectively. These findings further emphasize the importance of these components in enhancing the model’s stability and precision. Although the attention heads mechanism had a relatively minor impact, causing just a

1.2 %

decrease in accuracy, it still contributed to incremental performance gains, reinforcing the overall architecture.

Overall, the PMD mechanism with the uncertainty map and the attention heads provides empirical support for the UASA model design. In complex time series tasks, the model achieved a high accuracy of

97.7 %

and an F1-SCORE of

94.1 %

. This study validates the role of key components in the UASA model and guides future efforts to enhance predictive accuracy and robustness in data analysis tasks, laying a solid foundation for expanding the model’s applications.

4.6.2. Training Strategy Comparison

This study compares the performance of the “end-to-end” strategy with the “Train Separately” strategy, as detailed in Table 4. The “end-to-end” approach demonstrates significant advantages by conducting training and testing in a unified process, thereby fully leveraging global data insights to optimize model parameters. In contrast, the “Train Separately” strategy, which handles training and testing in distinct phases, may offer benefits in specific scenarios but underperforms in this study. This underperformance is likely due to its inability to capture comprehensive global features. The findings indicate that the “end-to-end” strategy is more effective for handling complex time series data, providing crucial insights for selecting training strategies in similar tasks and underscoring the importance of global optimization in model training.

Overall, the synergistic integration of the UASA model components results in a robust framework for time series prediction, particularly effective in scenarios involving missing data. This unified training strategy demonstrates substantial superiority over traditional methods, achieving higher accuracy and reduced prediction error, thereby underscoring its potential in complex data environments.

4.7. Sensitivity to Hyperparameters

The hyperparameter K in the UASA model determines the number of forward passes required to generate the uncertainty map. To evaluate the impact of K on classification accuracy and forecasting performance (measured by Average Mean Squared Error (MSE)), we conducted sensitivity analysis experiments. The results are summarized in Table 5.

The results show that as K increases, the classification task accuracy improves steadily, but the rate of improvement diminishes. This indicates that increasing the number of forward passes enhances the quality of uncertainty estimation and improves model performance, especially for smaller values of K. However, when K becomes larger, the performance gains plateau, and further increases in K yield diminishing returns. Similarly, for the forecasting task, as K increases, the average MSE decreases, but the improvements gradually diminish and stabilize.

From Table 5, it can be observed that while K = 20 achieves slightly better results, the performance gains compared with K = 10 are marginal. Considering that a higher K significantly increases computational complexity, we recommend selecting K = 10 as it provides a balance between performance and computational cost. At K = 10, the model achieves near-optimal classification accuracy and forecasting error with significantly reduced computational overhead.

We also analyzed the sensitivity of the hyperparameters

α_{1}

and

α_{2}

, which control the weights of the MIT and ORT objectives, respectively. The experiments were conducted under two missing data scenarios: 50% missing data (Table 6) and 30% missing data (Table 7). In the 50% missing data scenario, classification performance strongly depends on the balance between

α_{1}

and

α_{2}

. The highest accuracy is achieved when

α_{1} \approx α_{2}

. In the experiments,

α_{1}

=

α_{2}

= 5 results in 96.8% classification accuracy, which outperforms other combinations. However, further increasing the weights (e.g., to 10) produces diminishing returns and may slightly lower accuracy. These results indicate that under high data missingness, balancing the optimization efforts between MIT and ORT is crucial for achieving optimal classification performance.

In the 30% missing data scenario, the model benefits more from emphasizing the reconstruction task, as data quality is relatively higher. The best classification performance occurs when

α_{1} < α_{2}

(e.g.,

α_{1}

= 2,

α_{2}

= 5 or

α_{1}

= 5,

α_{2}

= 10), achieving accuracies of 97.1% and 97.6%, respectively. While balanced weights (e.g.,

α_{1}

=

α_{2}

= 5) perform slightly worse than the optimal configurations, the difference is minimal, showcasing the robustness of weight balancing.

5. Conclusions and Discussions

This study presents the UASA model, a novel approach addressing missing values and uncertainty in time series data by integrating uncertainty estimation with a partial diagonal zeroing mechanism. Through both data imputation and prediction tasks, the proposed model has shown outstanding performance, surpassing state-of-the-art benchmarks in classification, forecasting, and rare event detection tasks.

In our experiments on the AUST-Gait dataset, the UASA model achieved a ROC-AUC of 99.5%, a PR-AUC of 58.5%, and an F1-SCORE of 49.3% in classification tasks, even under high missing data rates. For forecasting tasks, the model obtained an MSE of 0.71 under 0% missing data and maintained a low average MSE of 0.74 across all missing rates, underscoring its robustness and adaptability. Ablation studies further highlight the importance of the PMD self-attention layer and multi-head architecture: removing the PMD mechanism decreased the model’s accuracy from 97.7% to 92.5%, emphasizing its pivotal role in capturing complex time series features and modeling long-term dependencies.

Despite these promising results, certain limitations warrant further investigation. In particular, handling very long sequences can pose computational challenges due to the quadratic scaling inherent to self-attention mechanisms, and the partial diagonal zeroing strategy may require additional refinements to capture extremely long-range dependencies effectively. Moreover, while our current experiments indicate that the UASA model’s computational demands are relatively modest for most non-real-time applications, resource-constrained or real-time scenarios may require more efficient uncertainty quantification (e.g., using dropout rather than multiple forward passes) and model compression techniques such as structured pruning and model distillation. These optimizations would help reduce inference latency and memory usage without compromising performance. Model optimization is crucial for broader deployment scenarios. Future research will explore quantization techniques (e.g., float16) to reduce computational overhead, alongside knowledge distillation methods to achieve a better balance between performance and efficiency, thereby enhancing the model’s adaptability and applicability in real-time and resource-constrained environments.

In future research, there is potential to further extend UASA for domain-specific applications such as healthcare monitoring and financial market forecasting, where it can leverage convolutional neural networks to better capture intricate temporal patterns and employ reinforcement learning to enhance the self-attention decision-making process. We also intend to explore the model’s behavior on diverse time series types and varying sequence lengths, seeking strategies to mitigate potential performance or memory bottlenecks. By addressing these challenges, we aim to solidify UASA’s practical utility and generalizability across a wide range of real-world time series applications.

Author Contributions

J.L., data collection, methodology, experiments, visualization, and writing; C.W., funding acquisition, project administration, resources, supervision, and proofreading; W.S., data collection; D.Y., methodology and writing; Z.W., methodology, project administration, and writing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the University Synergy Innovation Program of Anhui Province (grant numbers GXXT-2022-053); Anhui New Era Education Quality Project (Graduate Education); Provincial Graduate students “Innovation and Entrepreneurship Star” (grant number 2022cxcyzx127); and the Open Fund of Anhui Key Laboratory of Mine Intelligent Equipment and Technology (No. ZKSYS202201).

Institutional Review Board Statement

This study was approved by the Biomedical Research Ethics Committee of Anhui University of Science and Technology (Anhui, China) in accordance with the Declaration of Helsinki. All methods were carried out in accordance with relevant guidelines and regulations. Written informed consent was obtained from all individual patients included in the study.

Data Availability Statement

The original data presented in the study are openly available in https://github.com/LIbbbao/AUST_gait (accessed on 13 March 2025). Any further question is available by the first author or corresponding author on request.

Acknowledgments

The authors thank the volunteers who contributed to the experiment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Garg, S.; Krishnamurthi, R. A survey of long short term memory and its associated models in sustainable wind energy predictive analytics. Artif. Intell. Rev. 2023, 56, 1149–1198. [Google Scholar] [CrossRef]
Tanase, M.A.; Mihai, M.C.; Miguel, S.; Cantero, A.; Tijerin, J.; Ruiz-Benito, P.; Domingo, D.; Garcia-Martin, A.; Aponte, C.; Lamelas, M.T. Long-term annual estimation of forest above ground biomass, canopy cover, and height from airborne and spaceborne sensors synergies in the Iberian Peninsula. Environ. Res. 2024, 259, 119432. [Google Scholar] [CrossRef]
He, R.; Zhang, L.; Chew, A.W.Z. Modeling and predicting rainfall time series using seasonal-trend decomposition and machine learning. Knowl.-Based Syst. 2022, 251, 109125. [Google Scholar] [CrossRef]
Barra, S.; Carta, S.M.; Corriga, A.; Podda, A.S.; Recupero, D.R. Deep learning and time series-to-image encoding for financial forecasting. IEEE/CAA J. Autom. Sin. 2020, 7, 683–692. [Google Scholar] [CrossRef]
Yilmaz, M.; Keskin, M.M.; Ozbayoglu, A.M. Algorithmic stock trading based on ensemble deep neural networks trained with time graph. Appl. Soft Comput. 2024, 163, 111847. [Google Scholar] [CrossRef]
Romanowski, K.; Law, M.R.; Karim, M.E.; Campbell, J.R.; Hossain, M.B.; Gilbert, M.; Cook, V.J.; Johnston, J.C. Healthcare utilization after respiratory tuberculosis: A controlled interrupted time series analysis. Clin. Infect. Dis. 2023, 77, 883–891. [Google Scholar] [CrossRef] [PubMed]
Morid, M.A.; Sheng, O.R.L.; Kawamoto, K.; Abdelrahman, S. Learning hidden patterns from patient multivariate time series data using convolutional neural networks: A case study of healthcare cost prediction. J. Biomed. Inform. 2020, 111, 103565. [Google Scholar] [CrossRef]
Li, J.; Wang, Z.; Wang, C.; Su, W. GaitFormer: Leveraging Dual-Stream Spatial–Temporal Vision Transformer via a Single Low-Cost RGB Camera for Clinical Gait Analysis. Knowl.-Based Syst. 2024, 295, 111810. [Google Scholar] [CrossRef]
Wang, Z.; Deligianni, F.; Voiculescu, I.; Yang, G.-Z. A Single RGB Camera Based Gait Analysis with a Mobile Tele-Robot for Healthcare. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021; pp. 6933–6936. [Google Scholar]
Wang, Y.; Zhou, X.; Ao, Z.; Xiao, K.; Yan, C.; Xin, Q. Gap-filling and missing information recovery for time series of MODIS data using deep learning-based methods. Remote Sens. 2022, 14, 4692. [Google Scholar] [CrossRef]
Campagner, A.; Barandas, M.; Folgado, D.; Gamboa, H.; Cabitza, F. Ensemble Predictors: Possibilistic Combination of Conformal Predictors for Multivariate Time Series Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7205–7216. [Google Scholar] [CrossRef]
Wang, P.; Zhang, Q.; Qu, H.; Xu, X.; Yang, S. Time series prediction for production quality in a machining system using spatial-temporal multi-task graph learning. J. Manuf. Syst. 2024, 74, 157–179. [Google Scholar] [CrossRef]
Orang, O.; Bitencourt, H.V.; de Souza, L.A.F.; de Oliveira Lucas, P.; Silva, P.C.L.; Guimarães, F.G. Multiple-Input Multiple-Output Randomized Fuzzy Cognitive Map Method for High-Dimensional Time Series Forecasting. IEEE Trans. Fuzzy Syst. 2024, 32, 3703–3715. [Google Scholar] [CrossRef]
Li, W.; Yao, Z.; Pan, X.; Wei, Z.; Jiang, B.; Wang, J.; Xu, M.; Cui, Y. A ground-independent method for obtaining complete time series of in situ evapotranspiration observations. J. Hydrol. 2024, 632, 130888. [Google Scholar] [CrossRef]
Ji, S.; Ni, H.; Hu, T.; Sun, J.; Yu, H.; Jin, H. DT-CEPA: A digital twin-driven contour error prediction approach for machine tools based on hybrid modeling and sparse time series. Robot. Comput.-Integr. Manuf. 2024, 88, 102738. [Google Scholar] [CrossRef]
Karkaria, V.; Goeckner, A.; Zha, R.; Chen, J.; Zhang, J.; Zhu, Q.; Cao, J.; Gao, R.X.; Chen, W. Towards a digital twin framework in additive manufacturing: Machine learning and bayesian optimization for time series process optimization. J. Manuf. Syst. 2024, 75, 322–332. [Google Scholar] [CrossRef]
Huang, C.S.; Le, Q.T.; Su, W.C.; Chen, C.H. Wavelet-based approach of time series model for modal identification of a bridge with incomplete input. Comput.-Aided Civ. Infrastruct. Eng. 2020, 35, 947–964. [Google Scholar] [CrossRef]
Kuranga, C.; Muwani, T.S.; Ranganai, N. A multi-population particle swarm optimization-based time series predictive technique. Expert Syst. Appl. 2023, 233, 120935. [Google Scholar] [CrossRef]
Zhang, M.; Ding, D.; Pan, X.; Yang, M. Enhancing time series predictors with generalized extreme value loss. IEEE Trans. Knowl. Data Eng. 2021, 35, 1473–1487. [Google Scholar] [CrossRef]
Tu, Y.; Tu, F.; Yang, Y.; Qian, J.; Wu, X.; Yang, S. Optimization of battery charging and discharging strategies in substation DC systems using the dual self-attention network-N-BEATS model. Sci. Prog. 2024, 107, 00368504241274999. [Google Scholar] [CrossRef]
Rußwurm, M.; Körner, M. Self-attention for raw optical satellite time series classification. ISPRS J. Photogramm. Remote Sens. 2020, 169, 421–435. [Google Scholar] [CrossRef]
Du, W.; Côté, D.; Liu, Y. SAITS: Self-Attention-based Imputation for Time Series. Expert Syst. Appl. 2023, 219, 119619. [Google Scholar] [CrossRef]
Siddique, J.; Harel, O.; Crespi, C.M. Addressing Missing Data Mechanism Uncertainty Using Multiple-Model Multiple Imputation: Application to a Longitudinal Clinical Trial. Ann. Appl. Stat. 2012, 6, 1814. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Zhang, Y.; Wang, K.; Lin, X.; Zhang, W. Missing Data Imputation with Uncertainty-Driven Network. Proc. ACM Manag. Data 2024, 2, 117. [Google Scholar] [CrossRef]
Luo, Y.; Zhang, Y.; Cai, X.; Yuan, X. E²GAN: End-to-End Generative Adversarial Network for Multivariate Time Series Imputation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 3094–3100. [Google Scholar] [CrossRef]
Bagnall, A.; Lines, J.; Bostrom, A.; Large, J.; Keogh, E. The Great Time Series Classification Bake Off: A Review and Experimental Evaluation of Recent Algorithmic Advances. Data Min. Knowl. Discov. 2017, 31, 606–660. [Google Scholar] [CrossRef]
Fawaz, I.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P.-A. Deep Learning for Time Series Classification: A Review. Data Min. Knowl. Discov. 2019, 33, 917–963. [Google Scholar] [CrossRef]
Lim, B.; Zohren, S. Time-Series Forecasting with Deep Learning: A Survey. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2021, 379, 20200209. [Google Scholar] [CrossRef] [PubMed]
Bostrom, A.; Bagnall, A. Binary Shapelet Transform for Multiclass Time Series Classification. In Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXII: Special Issue on Big Data Analytics and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2017; pp. 24–46. [Google Scholar] [CrossRef]
Kate, R.J. Using Dynamic Time Warping Distances as Features for Improved Time Series Classification. Data Min. Knowl. Discov. 2016, 30, 283–312. [Google Scholar] [CrossRef]
Lines, J.; Bagnall, A. Time Series Classification with Ensembles of Elastic Distance Measures. Data Min. Knowl. Discov. 2015, 29, 565–592. [Google Scholar] [CrossRef]
Baydogan, M.G.; Runger, G.; Tuv, E. A Bag-of-Features Framework to Classify Time Series. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2796–2802. [Google Scholar] [CrossRef]
Bagnall, A.; Lines, J.; Hills, J.; Bostrom, A. Time-Series Classification with COTE: The Collective of Transformation-Based Ensembles. In Proceedings of the 2016 IEEE 32nd International Conference on Data Engineering (ICDE), Helsinki, Finland, 16–20 May 2016; pp. 1548–1549. [Google Scholar] [CrossRef]
Lines, J.; Taylor, S.; Bagnall, A. HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 1041–1046. [Google Scholar] [CrossRef]
Lines, J.; Taylor, S.; Bagnall, A. Time Series Classification with HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles. ACM Trans. Knowl. Discov. Data 2018, 12, 52. [Google Scholar] [CrossRef]
Wang, Z.; Yan, W.; Oates, T. Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, AK, USA, 14–19 May 2017. [Google Scholar]
Geng, Y.; Luo, X. Cost-Sensitive Convolution Based Neural Networks for Imbalanced Time-Series Classification. arXiv 2018, arXiv:1801.04396. [Google Scholar]
Che, Z.; Cheng, Y.; Zhai, S.; Sun, Z.; Liu, Y. Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), Orleans, LA, USA, 18–21 November 2017; pp. 787–792. [Google Scholar] [CrossRef]
Che, Z.; Purushotham, S.; Cho, K.; Sontag, D.; Liu, Y. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef]
Faust, O.; Hagiwara, Y.; Hong, T.J.; Lih, O.S.; Acharya, U.R. Deep Learning for Healthcare Applications Based on Physiological Signals: A Review. Comput. Methods Programs Biomed. 2018, 161, 1–13. [Google Scholar] [CrossRef] [PubMed]
Borovykh, A.; Bohte, S.; Oosterlee, C.W. Conditional Time Series Forecasting with Convolutional Neural Networks. arXiv 2018, arXiv:1703.04691. [Google Scholar]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Wang, Z.; Voiculescu, I. Quadruple Augmented Pyramid Network for Multi-Class COVID-19 Segmentation via CT. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Virtual, 1–5 November 2021; pp. 2956–2959. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Li, Z.; Cai, R.; Fu, T.Z.J.; Hao, Z.; Zhang, K. Transferable time-series forecasting under causal conditional shift. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1932–1949. [Google Scholar] [CrossRef]
Yang, Z.; Yan, W.; Huang, X.; Mei, L. Adaptive temporal-frequency network for time-series forecasting. IEEE Trans. Knowl. Data Eng. 2020, 34, 1576–1587. [Google Scholar] [CrossRef]
Shen, L.; Wang, Y. TCCT: Tightly-coupled convolutional transformer on time series forecasting. Neurocomputing 2022, 480, 131–145. [Google Scholar] [CrossRef]
Deihim, A.; Alonso, E.; Apostolopoulou, D. STTRE: A Spatio-Temporal Transformer with Relative Embeddings for multivariate time series forecasting. Neural Netw. 2023, 168, 549–559. [Google Scholar] [CrossRef]
Cui, Y.; Li, Z.; Wang, Y.; Dong, D.; Gu, C.; Lou, X.; Zhang, P. Informer model with season-aware block for efficient long-term power time series forecasting. Comput. Electr. Eng. 2024, 119, 109492. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Luo, Y.; Cai, X.; Zhang, Y.; Xu, J.; Yuan, X. Multivariate Time Series Imputation with Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31. [Google Scholar]
Yoon, J.; Zame, W.R.; van der Schaar, M. Estimating Missing Data in Temporal Data Streams Using Multi-Directional Recurrent Neural Networks. IEEE Trans. Biomed. Eng. 2019, 66, 1477–1490. [Google Scholar] [CrossRef] [PubMed]
Cao, W.; Wang, D.; Li, J.; Zhou, H.; Li, L.; Li, Y. BRITS: Bidirectional Recurrent Imputation for Time Series. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: New York, NY, USA, 2018; Volume 31. [Google Scholar]
Venkatraman, A.; Hebert, M.; Bagnell, J. Improving Multi-Step Prediction of Learned Time Series Models. Proc. AAAI Conf. Artif. Intell. 2015, 29, 1. [Google Scholar] [CrossRef]
Liu, Y.; Yu, R.; Zheng, S.; Zhan, E.; Yue, Y. NAOMI: Non-Autoregressive Multiresolution Sequence Imputation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
Zhang, Y.; Zhou, B.; Cai, X.; Guo, W.; Ding, X.; Yuan, X. Missing Value Imputation in Multivariate Time Series with End-to-End Generative Adversarial Networks. Inf. Sci. 2021, 551, 67–82. [Google Scholar] [CrossRef]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; Chen, X. Improved Techniques for Training GANs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
Ma, J.; Shou, Z.; Zareian, A.; Mansour, H.; Vetro, A.; Chang, S.-F. CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged Time Series Imputation. arXiv 2019, arXiv:1905.09904. [Google Scholar]
Bansal, P.; Deshpande, P.; Sarawagi, S. Missing Value Imputation on Multidimensional Time Series. In Proceedings of the International Conference on Very Large Data Bases (VLDB), Vancouver, BC, Canada, 28 August–1 September 2023. [Google Scholar]
Shan, S.; Li, Y.; Oliva, J.B. NRTSI: Non-Recurrent Time Series Imputation. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Yin, K.; Feng, L.; Cheung, W.K. Context-aware Time Series Imputation for Multi-analyte Clinical Data. J. Healthc. Inform. Res. 2020, 4, 411–426. [Google Scholar] [CrossRef] [PubMed]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. Adv. Neural Inf. Process. Syst. 2017, 30, 6402–6413. [Google Scholar]
Wang, M.; Yang, R.; Chen, X.; Sun, H.; Fang, M.; Montana, G. GOPLan: Goal-Conditioned Offline Reinforcement Learning by Planning with Learned Models. arXiv 2024, arXiv:2310.20025. [Google Scholar]
Hernández-Lobato, J.M.; Adams, R. Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1861–1869. [Google Scholar]
Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight Uncertainty in Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 1613–1622. [Google Scholar]
Gal, Y.; Hron, J.; Kendall, A. Concrete Dropout. Adv. Neural Inf. Process. Syst. 2017, 30, 3581–3590. [Google Scholar]
Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B. Deep Exploration via Bootstrapped DQN. Adv. Neural Inf. Process. Syst. 2016, 29, 4026–4034. [Google Scholar]
Amini, A.; Schwarting, W.; Soleimany, A.; Rus, D. Deep Evidential Regression. Adv. Neural Inf. Process. Syst. 2020, 33, 14927–14937. [Google Scholar]
Ma, X.; Tao, C.; Kuang, K.; Tang, Z.; Zhou, Y. T-TV: Improving Time-Series Imputation with Temporal Variational Autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual Event, 19–25 June 2021; pp. 52–61. [Google Scholar] [CrossRef]
Xu, S.; Zhang, S.; Du, B. Learning Conditional Attention for Time-Series Forecasting and Imputation. Neural Netw. 2022, 151, 10–22. [Google Scholar] [CrossRef]
Losing, V.; Hasenjäger, M. A Multi-Modal Gait Database of Natural Everyday-Walk in an Urban Environment. Sci. Data 2022, 9, 473. [Google Scholar] [CrossRef]
Goldberger, A.L.; Amaral, L.A.N.; Glass, L.; Hausdorff, J.M.; Ivanov, P.C.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.-K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 2000, 101, e215–e220. [Google Scholar] [CrossRef]
Berkelmans, G.F.N.; Read, S.H.; Gudbjörnsdottir, S.; Wild, S.H.; Franzen, S.; van der Graaf, Y.; Eliasson, B.; Visseren, F.L.J.; Paynter, N.P.; Dorresteijn, J.A.N. Population Median Imputation Was Noninferior to Complex Approaches for Imputing Missing Values in Cardiovascular Prediction Models in Clinical Practice. J. Clin. Epidemiol. 2022, 145, 70–80. [Google Scholar] [CrossRef]
Mirza, A.F.; Mansoor, M.; Usman, M.; Ling, Q. A Comprehensive Approach for PV Wind Forecasting by Using a Hyperparameter Tuned GCVCNN-MRNN Deep Learning Model. Energy 2023, 283, 129189. [Google Scholar] [CrossRef]

$Fractalfract 09 00181 g001$

Figure 1. The architecture of the UASA model. On the left are the upstream and the downstream models, which predict missing values and complete downstream tasks such as classification and forecasting. The right side details the upstream model, which receives the time series X and its missing mask matrix M and outputs the time series

\hat{X}

with predicted missing values and the uncertainty map U indicating the uncertainty of every imputation.

Figure 1. The architecture of the UASA model. On the left are the upstream and the downstream models, which predict missing values and complete downstream tasks such as classification and forecasting. The right side details the upstream model, which receives the time series X and its missing mask matrix M and outputs the time series

\hat{X}

with predicted missing values and the uncertainty map U indicating the uncertainty of every imputation.

$Fractalfract 09 00181 g001$

$Fractalfract 09 00181 g002$

Figure 2. Self-attention with partially masked diagonal. It begins by receiving time series data and its missing mask matrix dimension by dimension and uses a linear transformation to generate query Q, key K, and value V vectors. Q is multiplied by K and scaled to obtain attention scores. The PMD strategy assigns

- \infty

to the diagonal attention scores that are computed from missing values. Softmax normalizes the attention scores as attention weights A, which are then multiplied by V to form new feature representations.

Figure 2. Self-attention with partially masked diagonal. It begins by receiving time series data and its missing mask matrix dimension by dimension and uses a linear transformation to generate query Q, key K, and value V vectors. Q is multiplied by K and scaled to obtain attention scores. The PMD strategy assigns

- \infty

to the diagonal attention scores that are computed from missing values. Softmax normalizes the attention scores as attention weights A, which are then multiplied by V to form new feature representations.

$Fractalfract 09 00181 g002$

$Fractalfract 09 00181 g003$

Figure 3. The Illustration of Dataset Features.

$Fractalfract 09 00181 g003$

$Fractalfract 09 00181 g004$

Figure 4. On-site Photos in Data Collection Process.

$Fractalfract 09 00181 g004$

$Fractalfract 09 00181 g005$

Figure 5. Illustration of the classification results in AUST-Gait dataset. The x-axis is the percentage of the dropped values and the y-axis is the score from their corresponding evaluation metrics. The I-shaped lines extending from the top of each bar represent the standard deviation.

$Fractalfract 09 00181 g005$

$Fractalfract 09 00181 g006$

Figure 6. Illustration of the classification results in PhysioNet-2012. The x-axis is the percentage of the dropped values and the y-axis is the score from their corresponding evaluation metrics. The I-shaped lines extending from the top of each bar represent the standard deviation.

$Fractalfract 09 00181 g006$

$Fractalfract 09 00181 g007$

Figure 7. Illustration of prediction results from the AUST-Gait Dataset. The circular charts represent the MSE for different models at various mask values. The segments in each chart correspond to different models, and the values indicate the MSE. The UASA model consistently achieves the lowest error across different mask values.

$Fractalfract 09 00181 g007$

$Fractalfract 09 00181 g008$

Figure 8. Illustration of the forecasting results. The x-axis is the predicted steps and the y-axis shows its mean square error to the ground truth. The three subfigures, from left to right, show the results for the dataset with missing rates of

20 %

,

50 %

, and

65 %

, respectively.

Figure 8. Illustration of the forecasting results. The x-axis is the predicted steps and the y-axis shows its mean square error to the ground truth. The three subfigures, from left to right, show the results for the dataset with missing rates of

20 %

,

50 %

, and

65 %

, respectively.

$Fractalfract 09 00181 g008$

$Fractalfract 09 00181 g009$

Figure 9. AUC-ROC curves for fall detection with varying masking values:

0 %

,

20 %

, and

35 %

. The x-axis shows the FPR, and the y-axis shows the TPR. The UASA model achieves AUC values near 1, indicating superior performance.

Figure 9. AUC-ROC curves for fall detection with varying masking values:

0 %

,

20 %

, and

35 %

. The x-axis shows the FPR, and the y-axis shows the TPR. The UASA model achieves AUC values near 1, indicating superior performance.

$Fractalfract 09 00181 g009$

$Fractalfract 09 00181 g010$

Figure 10. Confusion matrices for fall detection with masking values of

0 %

,

20 %

, and

35 %

. The x-axis and y-axis denote predicted and actual classes. The UASA model shows low false alarm rates, demonstrating high reliability in early fall detection.

Figure 10. Confusion matrices for fall detection with masking values of

0 %

,

20 %

, and

35 %

. The x-axis and y-axis denote predicted and actual classes. The UASA model shows low false alarm rates, demonstrating high reliability in early fall detection.

$Fractalfract 09 00181 g010$

Table 1. Mathematical Notations Used in This Article.

Notation	Description
$X_{t}$	The t-th step of time series X.
$X_{t, d}$	The d-th dimension of $X_{t}$ .
$M_{t, d}$	Missing mask matrix, indicating if $X_{t, d}$ is observed.
$\hat{X}$	Output of the upstream model given X.
${\tilde{X}}^{k}$	Output of the k-th upstream model inference given partially masked X.
$U_{t, d}$	Uncertainty matrix, indicating the uncertainty of the imputed value of $X_{t, d}$ .
$I_{t, d}$	Matrix indicating if $X_{t, d}$ is observed but artificially masked as unobserved.
$O_{t, d}$	Matrix indicating if $X_{t, d}$ is observed and unmasked.

Table 2. Comparison of different models on the Multi-Modal Gait Database.

Model	ROC-AUC	PR-AUC	F1-SCORE
UASA	98.7 ± 0.9%	58.5 ± 1.3%	49.3 ± 0.9%
SAITS	98.0 ± 1.2%	57.7 ± 1.5%	48.2 ± 1.1%
CATSI	97.8 ± 1.5%	57.5 ± 1.4%	47.8 ± 1.3%
Transformer	96.3 ± 1.0%	56.4 ± 0.8%	46.7 ± 1.6%
Median	95.1 ± 1.1%	55.2 ± 0.9%	44.4 ± 1.4%
M-RNN	94.5 ± 1.2%	54.1 ± 1.2%	42.9 ± 1.4%

Table 3. Ablation Study Results.

Component	Task Type	Accuracy	ROC-AUC	F1-SCORE
Full Model	Classification	97.7 ± 0.9%	99.5 ± 0.8%	94.1 ± 1.2%
Without PMD Mechanism	Classification	92.5 ± 1.1%	94.3 ± 1.1%	90.2 ± 0.9%
Without Uncertainty Map	Classification	94.5 ± 1.0%	96.6 ± 0.9%	91.8 ± 1.3%
Reduced Attention Heads	Classification	96.5 ± 1.2%	97.4 ± 1.5%	92.4 ± 1.1%

Table 4. Training Strategy Performance Analysis.

Strategy	Classification Task Accuracy	Prediction Task MSE
End-to-End	97.7 ± 0.9%	0.74
Train Separately	95.4 ± 1.1%	0.87

Table 5. Sensitivity to hyperparameter K.

	K = 3	K = 5	K = 10	K = 20
Classification Accuracy	96.5%	97.2%	97.6%	97.7%
Forecasting Average MSE	0.82	0.79	0.75	0.74

Table 6. Sensitivity to hyperparameter

α_{1}

and

α_{2}

in classification task (50% value missed).

Table 6. Sensitivity to hyperparameter

α_{1}

and

α_{2}

in classification task (50% value missed).

	$α_{1} = 2$	$α_{1} = 5$	$α_{1} = 10$
$α_{2} = 2$	96.8%	97.0%	95.2%
$α_{2} = 5$	96.9%	96.8%	94.3%
$α_{2} = 10$	95.0%	94.5%	92.1%

Table 7. Sensitivity to hyperparameter

α_{1}

and

α_{2}

in classification task (30% value missed).

Table 7. Sensitivity to hyperparameter

α_{1}

and

α_{2}

in classification task (30% value missed).

	$α_{1} = 2$	$α_{1} = 5$	$α_{1} = 10$
$α_{2} = 2$	96.2%	97.3%	97.5%
$α_{2} = 5$	97.1%	97.4%	97.7%
$α_{2} = 10$	96.8%	97.6%	97.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Wang, C.; Su, W.; Ye, D.; Wang, Z. Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values. Fractal Fract. 2025, 9, 181. https://doi.org/10.3390/fractalfract9030181

AMA Style

Li J, Wang C, Su W, Ye D, Wang Z. Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values. Fractal and Fractional. 2025; 9(3):181. https://doi.org/10.3390/fractalfract9030181

Chicago/Turabian Style

Li, Jiabao, Chengjun Wang, Wenhang Su, Dongdong Ye, and Ziyang Wang. 2025. "Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values" Fractal and Fractional 9, no. 3: 181. https://doi.org/10.3390/fractalfract9030181

APA Style

Li, J., Wang, C., Su, W., Ye, D., & Wang, Z. (2025). Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values. Fractal and Fractional, 9(3), 181. https://doi.org/10.3390/fractalfract9030181

Article Menu

Uncertainty-Aware Self-Attention Model for Time Series Prediction with Missing Values

Abstract

1. Introduction

2. Related Work

2.1. Time Series Prediction

2.2. Time Series Imputation

2.3. Uncertainty Estimation

3. Methods

3.1. Problem Setup

3.2. UASA Upstream Model

3.2.1. Self-Attention with Partially Masked Diagonal

3.2.2. Model Forward Propagation

3.3. UASA Downstream Model

3.4. Learning Objective

4. Experimental Results

4.1. Dataset Description

4.1.1. AUST-Gait Dataset

4.1.2. PhysioNet-2012 Dataset

4.1.3. Multi-Modal Gait Database

4.2. Baselines

4.3. Classification Tasks

4.4. Forecasting Tasks

4.5. Fall Detection Tasks

4.6. Ablation Studies

4.6.1. Key Model Components

4.6.2. Training Strategy Comparison

4.7. Sensitivity to Hyperparameters

5. Conclusions and Discussions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI