1. Introduction
The volume of time-series data is rapidly growing with various applications in a wide variety of domains. Considerable developments have been noted in several fields, such as signal processing and machine learning [
1,
2,
3,
4]. Recently, deep learning models for time-series data have demonstrated remarkable performances [
5,
6,
7,
8,
9,
10].
Most of these models adopt a supervised learning approach, which has to collect a massive amount of data with high-quality data annotation. Therefore, we explore a time-series unsupervised learning approach to tackle data acquisition problems.
Unsupervised learning attempts to identify meaningful generalized properties from unlabeled data. Unsupervised learning has recently attracted significant attention, particularly in computer vision. The contrastive learning method is prominent among various unsupervised learning methods [
11,
12,
13,
14,
15,
16,
17]. In addition, recent attempts have been made to remove negative pairs, which is a problem in the contrastive learning method [
15,
18].
However, unsupervised learning with time-series data has not been studied as extensively in computer vision, and some challenges remain in existing methods. Most time-series data are unpredictable and nonstationary [
19,
20], thus existing methods are limited with regard to extracting meaningful generalized properties.
Unsupervised learning-based time-series models can be broadly categorized into two approaches, those that learn inter-sample modality representations [
21,
22] and those that learn inter-temporal modality representations [
23,
24]. Inter-sample modality representation derives relationships between two samples. In contrast, intra-temporal modality representation derives features according to time within the same samples.
Most previous studies focused on training specific modality representations. In addition, the contrastive learning method requires attentive treatment while collecting proper negative pairs.
Therefore, in this paper, we propose the Bootstrap Inter–Intra Modality at Once (BIMO) method, which is an unsupervised learning method for multivariate time series that simultaneously explores inter–intra modality representations without negative pairs. The proposed BIMO method comprises three neural networks: the main network and two auxiliary networks (i.g., inter-auxiliary and intra-auxiliary networks). These three networks interact and learn from each other.
From given raw time-series data, two transformed samples are generated using an augmentation strategy: (1) the input to the main network and (2) the input to the inter-auxiliary network. The input of the main network generates another sample, which is the input of the intra-auxiliary network, using a subsampling strategy. The main network simultaneously predicts the representation of the two samples generated from the two auxiliary networks. The proposed BIMO method learns the complementary properties in both modalities efficiently and simultaneously by adjusting the weight of each auxiliary network dynamically.
We measured the performance of the learned representation with various datasets to validate the generalizability of the proposed method. Here, we used univariate UCR datasets [
25], which are well-known time-series datasets. We showed that the proposed BIMO method is universal, comparable to state-of-the-art (SOTA) time-series supervised methods, and superior to previous time-series unsupervised methods.
We also evaluated the performance of the proposed method on multivariate UEA datasets [
26]. Here, we found that the proposed BIMO method is suitable for representation learning with multivariate time-series data. We then used a real-world wearable stress and affection detection (WESAD) dataset to demonstrate the noise robustness of the proposed BIMO method.
Our primary contributions are summarized as follows. (1) We propose a unsupervised learning-based time-series simple method that trains the main network using two auxiliary networks while exploring inter–intra modality representations simultaneously. (2) We remove the constraints for negative pairs from contrastive learning-based time-series data analysis. (3) We present various comprehensive analyses to extract robust features, considering inter–intra modality representations, from the unsupervised learning perspective of time-series data. (4) We utilize various datasets to verify that the proposed BIMO method is universal, robust against noise, and outperforms contemporary SOTA methods.
2. Materials and Methods
BIMO’s goal is to be easily used in downstream tasks by discovering the most significant modalities for representation learning in all domains of time-series data. This study was inspired by existing work on SOTA contrastive learning-based unsupervised learning methods [
15,
23,
27].
As shown in
Figure 1, the proposed BIMO method consists of the main network and two auxiliary networks. The main network consists of an encoder
, a projector
, and a predictor
, and each auxiliary network comprises an encoder and a projector. The main network learns to have a similar distribution between two values
,
from the respective projectors of the two auxiliary networks and a value
from the predictor of the main network.
It is a significant issue to simultaneously learn both inter- and intra-modality representations. We trained the proposed BIMO method to learn inter–intra modality representation efficiently and stably based on the fundamental concept, i.e., high-level features comprise low-level and intermediate-level features [
28].
An overview of the training process in the proposed BIMO method is given in Algorithm 1. The complexity of the proposed BIMO method is , while the complexity of USRL as a existing SOTA method is at least .
While training, we first used a hard constraint in the inter-auxiliary network to learn sufficient low-level coarse information, i.e., the time characteristics within samples, from the intra-auxiliary network. As the number of epochs increased, we gradually applied a hard constraint to the intra-auxiliary network and not to the inter-auxiliary network. Therefore, the proposed BIMO method sufficiently learns fine-grained features, i.e., the correlation between two augmented samples, from the inter-auxiliary network.
Therefore, BIMO learns low-level features sufficiently at the initial training step, and gradually learns high-level features.
Algorithm 1 BIMO’s training procedure |
- Input:
Time series set , Number epochs M - Output:
Trained - 1:
Initialization initialize weights - 2:
- 3:
repeat - 4:
for to N with do - 5:
generate , from different augmentation , - 6:
extract in - 7:
extract among subseries of v of length - 8:
, , - 9:
, - 10:
, - 11:
- 12:
- 13:
- 14:
update weights using - 15:
update weights using moving exponential average - 16:
end for - 17:
- 18:
until m = M
|
2.1. BIMO’s Components
Given time-series data, , where N is the volume of data, which comprises a token , T ordered real values.
The proposed BIMO method consists of three networks, and each network uses a set of weights: , , and .
A sample x generates two augmented views and , which apply two augmentations and (5). For the augmentation strategy, we employ a magnitude domain augmentation method, which transforms the values of the time-series data, and a time domain augmentation method, which transforms the time-series data sequence. Here, v is the input of the main network, and is the input of the inter-auxiliary network; , which is the input of the intra-auxiliary network, is subsampled from v (6–7), and M is the number of epochs.
2.2. Training Details
We first forward the three generated samples (8–10). The main and inter-auxiliary networks learn representation through the generated samples from the same time-series data in different augmentation approaches. Therefore, the proposed BIMO method learns to have similar distributions between from the predictor of the main network and from the projector of the inter-auxiliary network ().
The inputs of the intra-auxiliary network are subsamples from the input of the main network. Hence, the samples are highly likely to have similar distributions since they are in similar periods. The proposed BIMO method also learns to have similar distribution between from the predictor of the main network and from the projector of the intra-auxiliary network ().
First, we train the main network with the intra-auxiliary network in a high ratio and the inter-auxiliary network in a low ratio to learn the low-level coarse information at an initial time based on the fundamental principles of deep learning [
28] (
). Then, we gradually decrease the ratio of the intra-auxiliary network and increase the ratio of the inter-auxiliary network in every epoch. We only minimize the loss function with a single weight,
, in each training step (
). The other weights (i.g.,
and
) prevent network collapse using slowly moving average methods, which is
(
).
The output of the main network is
, the output of the inter-auxiliary networks is
, and the output of the intra-auxiliary networks is
. Each output
,
, and
applies
-normalization and becomes
,
, and
, respectively. Thus, the training objective aims to minimize the differences between
and
as well as
and
. Losses are defined as follows:
Equations (1) and (2) represent the inter and intra losses, respectively, and Equation (
3) represents the total loss.
and
in Equation (
3) exchange between
v and
to symmetrize the losses, where
m denotes a training epoch.
2.3. Architecture and Optimization
Time-series data have to accommodate varying lengths and be efficient in terms of time and memory, as such data are often updated in real time. Thus, we used a dilated causal convolution network [
23,
29,
30] as a backbone to fulfil the requirements.
The dilated causal convolution network comprises 20 layers, each of which exponentially increases the dilation parameter: for the i-th layer. We employ an adaptive max-pooling layer as the last layer to squeeze the temporal dimension and output a vector of a fixed size. Here, representation r is projected into a multilayer perceptron (MLP), , comprising two layers, and projection p is forwarded into another MLP, , which has the same structure as . We used the output dimensions of 512 and 320 for the first and second layers of the MLPs, respectively. For the auxiliary networks, we began with the exponential moving average parameter and increased it to 1 during training.
3. Results and Discussion
We performed classification tasks to evaluate the proposed BIMO method’s validity in representation learning. We used typical time-series datasets: univariate UCR datasets [
25] and multivariate UEA datasets [
26]. We also used a public wearable dataset, the WESAD dataset [
31], to validate BIMO’s robustness against noisy data. The encoder was trained on an unlabeled training set, and the learned encoder was used to perform a classification task. In addition, we trained a simple single-layer linear classifier on a labeled training set [
32,
33,
34].
3.1. Implementation
Sample Generation: Time-series augmentation can be divided into magnitude-based and time-based methods. In this study, we used the time-series augmentation set
t and
, which comprises magnitude-based magnitude warping and scaling methods, and time-based time-slicing and time-warping methods [
35,
36].
The time-series subsampling strategy is based on the literature [
23]. We randomly extracted a part of the samples by selecting the length and starting point. We selected different lengths and starting points for each epoch and trained them with various lengths of subsamples to learn a sufficient inter-temporal modality representation.
Encoder Selection: Time-series data should comprise temporal orders, which are required to consider temporal information, accommodate unequal lengths, and be efficient in terms of both time and memory. Note that deep convolutional neural networks (CNNs) do not consider temporal information and are difficult to apply to data of various lengths. Long short-term memory (LSTM) is inefficient in terms of time and memory. Thus, we used exponentially dilated causal convolutions to handle these issues [
23,
29,
30].
To verify the conformity of our encoder selection, we measured the classification performance on the UCR datasets using dilated causal convolutions, ResNet, and a two-layer LSTM encoder. Each model outperformed the other two on 65%, 35%, and 5% of the first 20 UCR datasets, respectively. This result confirmed that the encoder with dilated causal convolution was the most suitable for the proposed BIMO method. The accuracy results are detailed in
Table 1.
3.2. Univariate Time Series
We validated the proposed BIMO method’s performance using the 85 initially released UCR datasets, which are representative univariate time-series datasets [
25]. (1) We compared the BIMO method’s performance to that of existing SOTA unsupervised models, (2) with the existing SOTA supervised models, and (3) compared the performance depending on combinations of the auxiliary networks.
Overall Performance: In terms of performance, we compared the proposed BIMO method with unsupervised models for time series, i.e., USRL (which utilizes triplet loss) [
23], DTW (which employs a kernel-based estimation method) [
37], and RWS (which uses a similarity matrix) [
38], as shown in
Table 2.
We also compared BIMO with supervised models, i.e., PF (which uses a decision tree ensemble) [
39], BOSS (which employs a dictionary-based classifier) [
5], InceptionTime (ITime) [
7], and HIVE-COTE (which uses ensemble methods) [
8]. As shown in
Figure 2, we compared performance based on the average rank according to the accuracy results on the UCR datasets. All accuracy results are detailed in
Table 2.
For the unsupervised models, the proposed BIMO method obtained the best rank scores: 3.71, 3.91, and 6.11 for BIMO, USRL, and DTW, respectively. For the supervised models, BIMO showed the third-highest score: 2.41, 2.52, 3.71, 3.73, and 3.91 for HIVE-COTE, ITime, BIMO, BOSS, and PF, respectively. These results demonstrate that BIMO is superior to existing SOTA unsupervised models and comparable to well-known supervised models.
Inter–Intra Modality Representation Ablation: We compared performance depending on the combination of auxiliary networks based on the average rank according to the accuracy results on the UCR datasets. We used a single auxiliary network, e.g., an inter-auxiliary or intra-auxiliary network, and multiple auxiliary networks, e.g., inter-auxiliary and intra-auxiliary networks. As shown in
Table 3, we compared the performance in terms of the average rank score. More detailed overall accuracy results are shown in
Table 4.
Given multiple auxiliary networks, we employed the static and dynamic loss functions. During training, the static loss function had an equal ratio of inter-auxiliary and intra-auxiliary networks (Inter and Intra). The dynamic loss function had different ratios of the inter- and intra-auxiliary networks for every epoch. Herein, the main network was initially trained with the inter-auxiliary network at a higher ratio than that of the intra-auxiliary network. Then, the ratio of the intra-auxiliary network was increased gradually (Inter ↦ Intra). In contrast, the main network was trained with the intra-auxiliary network in a higher ratio than that used for the inter-auxiliary network at first; gradually, the ratio of the inter-auxiliary network was increased (Intra ↦ Inter), which is the training method of BIMO.
As shown in
Table 3, the
Intra ↦
Inter method obtained the best rank score. We confirmed that the initial training trained the intra-modality representations sufficiently, which are the relatively low-level features, and then the inter modality representations, which are the relatively high-level features. The proposed dynamic training method made the main network evenly learn both modality representations.
Representation Metric Space: We also validated the performance of representation learning for some UCR datasets using embedding visualization with dimensionality reduction. The results are shown in
Figure 3.
3.3. Multivariate Time Series
We validated the performance of BIMO for UEA datasets. Here, we compared the performance of BIMO with USRL and DTW. The accuracy results are shown in
Table 5. The BIMO, USRL, and DTW models, respectively, showed the best accuracies for approximately 50%, 32%, and 18% of the datasets. Overall, BIMO’s performance is comparable to that of SOTA unsupervised models for multivariate time series.
3.4. Robustness to Noisy Data
Most real-world time-series data contain some noise. Typically, the photoplethysmogram (PPG) signal, which is also referred to as the blood volume pulse, contains many noises. A PPG signal is simple and highly useful in daily life since it can be easily measured from the wrist. However, it is difficult to apply in an end-to-end deep learning model because it is susceptible to many internal and external noises of the measurement environment [
40,
41]. Therefore, most existing PPG-based studies have focused on signal processing and feature engineering [
4,
31,
42,
43,
44].
In this study, we validated the noise robustness of BIMO, which is an end-to-end deep learning model, using noisy PPG signals. We used a PPG signal from the WESAD dataset [
31]. The WESAD dataset is labeled with four emotional states: baseline, stress, amusement, and meditation. We performed a classification task with leave-one-subject-out cross-validation, stress versus nonstress, where nonstress is defined by combining the state baseline and amusement states [
31].
We compared the performance with BIMO and existing SOTA supervised learning models for PPG, which is a weak feature engineering method [
31] and a strong feature engineering method named OMDP [
4]. The weak feature engineering-based method uses a peak detection algorithm, which is computed by simple statistical features. OMDP employs a two-step signal processing method in terms of both time and frequency and an ensemble-based peak detection method; it extracts diverse features from detected peaks.
As a result, we found that BIMO outperformed the supervised learning methods (
Table 6), indicating that BIMO is comparable to previous SOTA models. This is a very meaningful result, since BIMO opens up the possibility that unsupervised end-to-end data-driven feature learning is also possible for noisy time-series data.