Skip to Content
EntropyEntropy
  • Article
  • Open Access

17 August 2021

PFC: A Novel Perceptual Features-Based Framework for Time Series Classification

,
,
and
1
College of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
2
College of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
*
Author to whom correspondence should be addressed.

Abstract

Time series classification (TSC) is a significant problem in data mining with several applications in different domains. Mining different distinguishing features is the primary method. One promising method is algorithms based on the morphological structure of time series, which are interpretable and accurate. However, existing structural feature-based algorithms, such as time series forest (TSF) and shapelet traverse, all features through many random combinations, which means that a lot of training time and computing resources are required to filter meaningless features, important distinguishing information will be ignored. To overcome this problem, in this paper, we propose a perceptual features-based framework for TSC. We are inspired by how humans observe time series and realize that there are usually only a few essential points that need to be remembered for a time series. Although the complex time series has a lot of details, a small number of data points is enough to describe the shape of the entire sample. First, we use the improved perceptually important points (PIPs) to extract key points and use them as the basis for time series segmentation to obtain a combination of interval-level and point-level features. Secondly, we propose a framework to explore the effects of perceptual structural features combined with decision trees (DT), random forests (RF), and gradient boosting decision trees (GBDT) on TSC. The experimental results on the UCR datasets show that our work has achieved leading accuracy, which is instructive for follow-up research.

1. Introduction

In the information age, massive amounts of data have been generated over time. These data are closely related to many studies. In mathematics, a time series is a series of data points indexed in time order. Most commonly, a time series [1] is a sequence taken at successive equally spaced points in time. Time series contains information on time dimension and data dimension, and it exists in many fields such as economy, life science, military science, space science, geology and meteorology, and industrial automation. Time series classification [2,3,4] is an essential task that has attracted widespread attention. Normally, time series classification refers to assign time series patterns to a specific category, for example, judge whether it will rain or not through a series of temperature data [5] or determine whether the patient has Parkinson’s disease through a period of physiological data [6,7]. Dau et al. [8] proposed UCR Time Series Classification Archive (UCR) for this task, including 128 datasets from different fields such as ECG, Sensor, and Image. In order to understand TSC more intuitively, Figure 1 shows some representative datasets in UCR. These datasets almost cover the existing TSC tasks, show the morphological structure of various time series, and lay the foundation for researchers to explore general classification methods. In order to solve this problem, many methods have been proposed, which can be divided into five categories according to their different cores: dictionary-based, distance-based, interval-based, shapelet-based, and kernel-based.
Figure 1. Three representative datasets from the UCR Time Series Classification Archive. Due to the large size of the original dataset, only some samples are shown as examples.
The dictionary-based method refers to the idea of natural language processing. Researchers believe that a time series is a special sentence composed of discrete words or words. How to segment and map the time series into characters is the first issue that needs to be considered. There are three main time series symbolization methods: Piecewise Aggregate Approximation (PAA) [9,10], Symbolic Aggregate approXimation (SAX) [11,12], and Symbolic Fourier Approximation (SFA) [13]. Subsequently, the Bag-of-SFA Symbols (BOSS) method based on the bag-of-words model was proposed [14]. This method records high-frequency symbol features and uses them to distinguish different types of time series samples. Matthew et al. [15] and James et al. [16] further proposed Contract BOSS (cBOSS) and Spatial BOSS (S-BOSS). In addition, Word Extraction for Time Series Classification (WEASEL) [17] is also a typical dictionary-based method composed of a supervised symbolic time series representation for discriminative word generation and the Bag of Patterns (BOP) [18] model for building a discriminative feature vector.
Many TSC methods focus on the distance between time series. Generally, a time series can be regarded as a point in a multi-dimensional space, and the dimension of this multi-dimensional space depends on the length of the time series. Different types of time series will have different aggregations. At this time, distance is an effective way to distinguish. K-Nearest Neighbors (KNN) and the Elastic Ensemble (EE) [19] are two commonly used methods. Ben et al. [20] proposed Proximity Forest to model a decision tree forest that uses distance measures to partition data. It should be noted that since most distance calculations use the form of “one to one”, samples of equal length are necessary. For unequal length sequences, dynamic time warping (DTW) [21,22,23] is a robust calculation method, which can avoid differences in length and shape. Combining KNN and DTW is a way to take advantage of both at the same time [24,25].
In reality, different types of time series may have precisely the same statistical characteristics such as mean, variance, standard deviation, and so on [26]. In order to avoid this problem, the interval-based method focuses on local features rather than overall features. Deng et al. [27] proposed a Time Series Forest (TSF) model that converts time series into statistical features of sub-sequences and uses random forest for classification. Cabello et al. [28] further constructed Supervised Time Series Forest (STSF), an ensemble of decision trees built on intervals selected through a supervised process. Random Interval Spectral Ensemble (RISE) is a popular variant of time series forest [29]. RISE differs from time series forest in two ways. First, it uses a single time series interval per tree. Second, it is trained using spectral features extracted from the series instead of summary statistics. Since RISE relies on frequency information extracted from the time series, it can be defined as a frequency-based classifier.
The shapelet-based method draws inspiration from pattern recognition. Shapelets are defined in [30,31] as “subsequences that are in some sense maximally representative of a class”. Informally, if we assume a binary classification setting, a shapelet is discriminant if it is present in most classes and absent from the series of the other class. However, any subsequence may be distinguishable, and the length of the subsequence is arbitrary, which means that all samples and their subsequences need to be checked through a sliding window, and the search space for shapelets is enormous. In response to this problem, Ji et al. [32,33] proposed a fast shapelets selection algorithm.
Building on the recent success of convolutional neural networks for time series classification, Dempster et al. [34] realize that simple linear classifiers using random convolutional kernels achieve state-of-the-art accuracy with a fraction of the computational expense of existing methods. Therefore, they proposed ROCKET, a kernel-based time series classification method. This is a new direction for TSC, which can both reduce computational complexity and improve accuracy.
By analyzing the five classification methods, we realized that the existing algorithms are essentially trying to find efficient distinguishing features by learning all the original information of the sample, which leads to high computational complexity and resource consumption. In fact, for human beings, it does not require all the information to distinguish time series. On the contrary, we only pay attention to a few critical data points, which are enough to describe the approximate outline of time series samples and present a significant distribution. This paper proposes a classification framework based on perceptual features, which can extract support points of morphological structure from the original time series and further obtain interval-level and point-level features for classifiers such as decision trees. The contributions of our work are described below.
  • An improved algorithm called globally restricted matching perceptually important points (GRM-PIPs) is proposed, which avoids the redundancy caused by sequential matching in traditional important point extraction.
  • How many data points are necessary to describe complete information? We conducted in-depth research on this question and verified our opinions through mathematical proofs and experiments.
  • The data points extracted by GRM-PIPs can divide the time series into sub-sequences similar to shapelets. We used statistical features such as mean, standard deviation, slope, skewness, and kurtosis to enhance discrimination further.
  • Most classifiers learn the information of the original time series, which is not suitable for perceptual features. Therefore, we matched a suitable classifier and proposed a complete perceptual features-based framework.
The remainder of this paper is organized as follows. In Section 2, related work about PIPs, decision trees, random forests, and gradient boosting decision trees are presented. Section 3 describes the details about PFC, including GRM-PIPs, perceptual feature extraction, and classifiers adaptation. Section 4 presents the experimental setup and performance of the approach we proposed, as well as a comparison of the experiments and performances. A discussion about the differences in experimental results is also given in Section 4. Finally, the conclusions and directions for future research are given in Section 5.

3. Perceptual Features-Based Framework

This section will introduce the perceptual features-based framework (PFC) in detail, divided into three parts: time series preprocessing with GRM-PIPs, feature extraction, and classifier. These parts have a precise sequence in our framework.

3.1. Time Series Preprocessing with GRM-PIPs

The purpose of this part is to traverse the time series and extract a certain number of PIPs. Based on the traditional PIPs algorithm, we determined the global optimal selection strategy and proposed a restrictive selection method. The relevant definition is as follows.
Definition 2.
Globally Restricted Matching Perceptually Important Points.
Given a time series sample T = x 1 , y 1 , x 2 , y 2 , , x n , y n , n > 2 , n Z + , an empty list L p is set to save the extracted perceptually important points, the interval between adjacent PIPs is defined as δ with δ Z + , δ 4 . Commonly, when the number of extracted PIPs m is large enough ( m = n ), all points in T will be identified as PIPs, but if the parameter δ is considered, the upper limit of PIPs will be further restricted. The calculation steps of GRM-PIPs are as follow.
Step 1: Put the first point P 1 x 1 , y 1 and the last point P n x n , y n in T as initial two PIPs into L p .
Step 2: Set a temporary PIP P t , which can be any point in T, and calculate the vertical distance V D t between P t and the line P 1 P n . P t divides the sequence { P 1 , ..., P n } into two subsequences: { P 1 , ..., P t } and { P t , ..., P n }. If the length of any subsequence is less than δ, the current P t should not be considered, and a new point is set as P t to continue the calculation until a P t is found that can maximize V D t and satisfy that the length of all subsequences is greater than δ, then save this P t in L p as the third PIP.
Step 3: The fourth PIP is the point that maximizes the vertical distance to its adjacent PIPs (which are either the first and the third, or the third and the second PIP) and controls the length of all segmented subsequences are greater than δ. It is also necessary to save the fourth PIP into the L p .
Step 4: For each new PIP, use the same recursive method as the fourth PIP, repeat Step 3 until the length of L p is equal to m.
GRM-PIPs ensure a well-distribution of PIPs in the entire time series by adding a restriction on the interval length. A simple example in Figure 5 is shown to distinguish between the traditional PIPs and the GRM-PIPs proposed by us.
T = 0 , 1 , 2 , 10 , 9 , 10 , 9 , 9 , 6 , 4 , 3 , 1 , 5 , 3 , 10 , 10 , 8 , 9 , 10 , 11 , 9 , 6 , 3 , 0
Figure 5. GRM-PIPs and the traditional PIPs algorithms were used to extract PIPs from time series sample T.
In this example, we set a time series sample T in (7) with the length n = 23 . Figure 5 shows that the morphological structure of T is composed of two peaks and one trough. Seven PIPs were extracted from it. There is an apparent difference between the results of GRM-PIPs and PIPs, which are highlighted by red and green circles, respectively. Traditional PIPs are easy to fall into local optima because there is no interval constraint, and the selected PIPs do not contribute to the depiction of the overall structure. GRM-PIPs avoids this problem and accurately extracts PIPs that are more conducive to generalizing structural features.
In GRM-PIPs, affected by the length of interval δ , the number of extracted PIPs has an upper limit. In order to calculate the upper limit, we need to define the “quotient” first. Suppose there are two integers a and b , b 0 , there must be a pair of integers q and r satisfy a = q · b + r , and q can be called the quotient of a divided by b , abbreviated as q = Q a , b . In this way, the number of extracted PIPs can be calculated as follows:
2 m Q n , δ + 2
It is obvious that the upper limit is closely related to δ . In our research, we set δ = 4 because the subsequent feature extraction determines this value. We would explain the reason in detail in Section 3.2.

3.2. Feature Extraction

In this paper, we extract two features from time series, including point-level features F P and interval-level features F I .
The point-level feature is straightforward, which is the coordinates of PIPs. We found that for different classes of time series, the distributions of PIPs in the two-dimensional space are also significantly different. Most importantly, these special distributions are consistent on the training set and the test set. Therefore, the point-level feature is distinctive and consistent, should be taken seriously. Some representative UCR datasets shown in Figure 6 confirm our views.
Figure 6. The distributions of PIPs in two UCR datasets, which are Coffee (a) and ECGFiveDays (b). The figures above show that the PIPs extracted from the original sample are discriminative, while the figures below show that the distribution of PIPs is consistent on the training set and the test set.
On the other hand, PIPs can generate excellent time series segmentation. Many datasets have no significant differences in the distribution of PIPs. At this time, the interval-level features need to be supplemented to help the classifier further distinguish samples of different categories. There are five interval-level features used by us:
  • Arithmetic mean. The arithmetic mean (or simply mean) x ¯ of a sequence is the sum of all of the amplitudes divided by the length of the sequence n. This is a rough feature used to describe the average level of all data in the sequence. The calculation of the arithmetic mean follows Formula (4).
    x ¯ = 1 n i = 1 n x i = x 1 + x 2 + · · · + x n n
  • Standard deviation. In statistics, the standard deviation σ is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation plays an important role in distinguishing frequently fluctuating series from stable changing series. The calculation of this feature is shown below.
    σ = 1 n i = 1 n x i x ¯ 2
  • Slope. In mathematics, the slope or gradient of a line is a number that describes both the direction and the steepness of the line. Slope is calculated by finding the ratio of the “vertical change” to the “horizontal change” between (any) two distinct points on a line. We can also abstract any subsequence as a straight line connecting two adjacent PIPs and the trend can be judged by calculating the slope of the interval. For sequence S = x 1 , y 1 , , x n , y n , its slope can be calculated according to the following formula.
    m = Δ y Δ x = y n y 1 x n x 1
  • Kurtosis. In probability theory and statistics, kurtosis is a measure of the “tailedness” of the probability distribution of a real-valued random variable. The standard measure of a distribution’s kurtosis is a scaled version of the fourth moment of the distribution. Objectively speaking, kurtosis is not exactly the same as peakedness. Higher kurtosis means that the data has large deviations or extreme abnormal points, which deviate from the mean. However, in most cases, when the amplitude in a period of time in the time series is high, the corresponding kurtosis is high. In the calculation of kurtosis G 2 we use Standard unbiased estimator. It is worth noting that n represents the number of samples, and the formula needs to calculate n 3 . As part of the denominator, it is required to be n 3 0 , which means that n must be a positive integer greater than 3. This is why we require the parameter δ to be equal to 4.
    G 2 = k 4 k 2 2 = n 2 n + 1 m 4 3 n 1 m 2 2 n 1 n 2 n 3 · n 1 2 n 2 m 2 2 = n + 1 n n 1 n 2 n 3 · i = 1 n x i x ¯ 4 k 2 2 3 · n 1 2 n 2 n 3
  • Skewness. In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Skewness can be visually understood as the degree of inclination of the shape to the left or right. For example, in the two sequences shown in Figure 7, S 2 is almost obtained by mirror flipping of S 1 , which is an indistinguishable situation for the mean, standard deviation, slope, and kurtosis. The use of skewness makes up for this deficiency. The calculation formula of skewness G 1 is similar to kurtosis, is a scaled version of the third central moment.
    G 1 = k 3 k 2 3 / 2 = n 2 n 1 n 2 · b 1 = n 2 n 1 n 2 · m 3 σ 3 = n 2 n 1 n 2 · 1 n i = 1 n x i x ¯ 3 1 n 1 i = 1 n x i x ¯ 2 3 / 2
    Figure 7. An instance that cannot be distinguished by features such as mean and standard deviation. The sequence S 1 on the left is flipped to get the sequence S 2 on the right.

3.3. Classifer and the PFC Framework

In the TSC dataset, the data format is D = d a t a , l a b e l = T 1 , , T d , L 1 , , L d with d time series and corresponding labels. We extract m PIPs through GRM-PIPs, and get m 1 intervals, thereby converting the original dataset into the corresponding feature set F D = F P , F I . Subsequently, the training set in the F D is input into the classifier and the test set is used for verification.
We realize that F D is a high-level representation of raw data, essentially a combination of many features, and an explicit expression of morphological information. Therefore, we are more inclined to choose a classifier that is conducive to feature processing. In the PFC framework, we have selected three levels of classifiers, which are the decision tree as the basic estimator, the random forest with bagging idea, and the gradient boosting decision tree using boosting theory.
There are many ways to implement decision trees, such as ID3, C4.5, and CART. Under normal circumstances, the effect of CART is better than other methods, so we decided to implement CART. The reason for choosing RF and GBDT is that they are classifiers developed based on decision trees. RF conducts joint learning by constructing a large number of decision trees and integrates all classification results. RF equalizes the weights of all basic estimators, while GBDT gradually upgrades the weak classifiers to robust classifiers by iteratively changing the weights.
A schematic diagram of the PFC framework is shown in Figure 8. The innovation of our work is to propose GRM-PIPs, extract the combination of point-level and interval-level features, and use a suitable classifier to form a framework for TSC tasks. What we want to explore is the effect of the entire framework. Therefore, we did not make any special optimizations to the classifiers, and all the classifiers use traditional implementation methods. Further improvement of the classifier is our future work.
Figure 8. The schematic diagram of the PFC framework.

4. Performance Evaluation and Discussion

4.1. Experimental Design

The UCR archive has been widely used as a benchmark to evaluate TSC algorithms [8] (check details in http://www.timeseriesclassification.com, accessed on 1 May 2021). It currently contains 128 datasets, 15 of these are unequal length, 15 of there are missing values, and one (Fungi) has a single instance per class in the train files. Given this situation, in order to evaluate PFC, we select part of the UCR dataset. Since the two-category data is typically exclusive to each other, we divide the verification into two types, two-category and hybrid.
In the verification of two-category, we selected all the two-category datasets in UCR Archive and excluded the two with many missing values. Finally, 40 datasets were used for comparison experiments. Considering that PFC is a fast and straightforward classification method, it is unfair compared with some methods that use neural networks and consume substantial computing resources and time. Therefore, we exclude some deep learning algorithms for the benchmark model, such as ResNet and HIVE-COTE. The following five classification algorithms were selected for comparison, including the word extraction for time series classification (WEASEL), bag of symbolic-fourier approximation symbols (BOSS), time series forest (TSF), random interval spectral ensemble (RISE), and canonical time-series characteristics (Catch22). The results of these comparison algorithms have been officially recognized and released.
In the hybrid verification, we introduced some methods published recently as comparisons. These methods include extreme-SAX (E-SAX, 2020) [56], interval feature transformation (IFT, 2020) [57], and discriminative virtual sequences learning (DVSL, 2020) [58]. PFC is tested on the same datasets with these methods, including two-category datasets and multi-category datasets.
In addition, through the analysis of the experimental results, we would find answers to the following questions:
  • What is the appropriate number of PIPs? The more always means the better?
  • Does the number of PIPs have the same effect on different classifiers?
All experiments strictly follow UCR’s division of training set and test set. The classification accuracy is uniformly adopted as the metric. Some methods use classification errors and we convert them to accuracy. The number of time series correctly classified is defined as n c , and the total number of time series of test set denoted by n t . The calculation formula for classification accuracy ( A C C ) and error ( E R R ) is shown below.
A C C = n c n t , E R R = 1 A C C
Due to the randomness of RF and GBDT, the final experimental result is an average of 50 runs under the same parameters. At the same time, we do not do particular parameter optimization for DT, RF, and GBDT. DT uses the information gain to measure the quality of a split, and the nodes are expanded until all leaves are pure. There are 600 trees in RF, and the number of boosting stages to perform in GBDT is 600, too.

4.2. The Verification of Two-Category

The information of 40 two-category datasets in UCR Archive is listed in Table 2. Obviously, these datasets cover various situations such as short-sequence classification (Chinatown and ItalyPowerDemand), long-sequence classification (HandOutlines, HouseTwenty, and SemgHandGenderCh2), unbalanced training set and test set (ECGFiveDays and FreezerSmallTrain), and so on.
Table 2. Summary of 40 two-category datasets in UCR Archive.
The classification accuracy of the five benchmark methods and PFC on these datasets is shown in Table 3. We found that not all datasets have public results on the five benchmark methods, and the results of two datasets (FordB and HandOutLines) are missing. These two datasets were excluded when calculating the number of times to obtain the best accuracy, and the experimental results of the remaining 38 datasets were considered.
Table 3. Classification accuracy of PFC and five benchmarks on 40 two-category UCR datasets.
The PFC framework achieved the best accuracy in 13 of 38 UCR datasets. What is interesting is that when DT and GBDT are used as classifiers, 6 times catch the best, which is less than 10 times when RF is used. Nevertheless, their performance has been better than RISE, TSF and Catch22.
This seems to be a counter-intuitive result. As the most complex classifier, GBDT has not achieved the best results. However, this situation can be explained. We noticed that there is a significant difference in the number of PIPs extracted by the GRM-PIPs algorithm when the best results are obtained (for detail see Appendix A). When DT and GBDT achieve their best results, the number of PIPs is almost the same, while RF requires more PIPs to achieve higher accuracy. This means that the upper limit of RF performance is the highest among the three classifiers. This may be caused by no parameter optimization. GBDT and DT usually rely on adjusting parameters to improve accuracy, while RF is not sensitive to parameters, and a large number of random decisions can effectively compensate for parameter defects.
We conduct an in-depth analysis of the experimental results shown in Table 3, which are divided into two aspects:
  • The impact of the length of the time series on accuracy. We sort all the datasets according to their length, and the ones with a length less than 100 are classified as a group of G 1 , which contains 11 datasets. G 2 has 11 datasets, the corresponding length is greater than 100 but less than 300. G 3 covers 15 datasets ranging in length from 300 to 1000. The remaining three datasets whose length exceeds 1000 are set as G 4 . From G 1 to G 4 , the number of times that PFC achieves the best accuracy is 3, 6, 4, and 0, respectively. The results show that PFC is good at distinguishing time series samples whose length ranges from 100 to 1000. For samples with a length less than 100, GRM-PIPs can only extract 27 PIPs at most and generate 26 intervals, which results in the feature dimension being much larger than the original sequence dimension, and the information redundancy makes the classifier unable to obtain robust decision rules. On the other hand, since we set up to extract only 30 PIPs in the experiment, the features of samples longer than 1000 may be incompletely extracted.
  • Does the imbalance of the training set and test set affect the accuracy of PFC? As far as the current results are concerned, the training set and test set are not factors that affect accuracy.

4.3. The Hybrid Verification

In hybrid verification, we will compare with the TSC methods in three recently published papers. Since the datasets validated by these methods are different, we decided to compare them one by one and use the same datasets.
First, we test the performance of PFC and DVSL. Abhilash et al. [58] believed that the existing VSML methods employ fixed virtual sequences, which might not be optimal for the subsequent classification tasks. Therefore, they proposed DVSL to learn a set of discriminative virtual sequences that help separate time series samples in a feature space. Finally, this method was validated on 15 UCR datasets. The results of the comparative experiment are shown in Table 4.
Table 4. Comparison of PFC and DVSL on 15 UCR datasets.
The results show that PFC performed better on the same 15 UCR datasets and surpassed DVSL for the best accuracy in 12 of them. At the same time, we also notice that the accuracy of PFC is much lower than that of DVSL in datasets such as Beef. Figure 9 shows the distribution of PIPs in Beef. We can clearly find that only the distribution of L a b e l = 1 (represented by the red dots) is distinguishable, and the distributions of the other categories are highly similar. We believe that PFC can distinguish some samples with obvious distinguishing characteristics, but if these characteristics are highly similar in multiple types of samples, PFC will be invalid. Although this situation is accidental, PFC is based on morphological perception information, and it is difficult to process samples with small differences in morphology.
Figure 9. The original time series and the distribution of PIPs in Beef.
The second comparison method is IFT [57], which also uses PIPs. The difference is that IFT adopts information gain-based selection for interval features, which makes the whole method a special decision tree. Since both PFC and IFT perceive the importance of morphological features, this is a meaningful comparative experiment. IFT was validated on 22 UCR datasets, and we also tested on the same datasets. The comparison results are shown in Table 5.
Table 5. Comparison of PFC and IFT on 20 UCR datasets (exclude two datasets with missing values).
On these datasets, the performance of PFC almost completely surpasses IFT. However, one exception to the results, was the huge difference in the accuracy of the PFC and IFT on a dataset called ShapeletSim. The samples in ShapeletSim present a form similar to high-frequency sinusoidal signals, which causes most of the PIPs to be located at the peaks and troughs. At this time, the distribution of PIPs can describe the boundary of the sample only, a rectangle in Figure 10. The crux of the problem is not just the abnormality of these distributions, we realize that they lack the necessary distinguishability. On this dataset, the performance of IFT is almost perfect. The reason may be that its feature selection is different from our work, and these unique features play an important role in classification.
Figure 10. The distribution of PIPs in ShapeletSim. The distribution on the training set is on the left, and the right is the distribution on the test set.
Finally, we set our sights on E-SAX. One of the most popular dimensionality reduction techniques of time series data is the Symbolic Aggregate Approximation (SAX), which is inspired by algorithms from text mining and bioinformatics. E-SAX uses only the extreme points of each segment to represent the time series [56]. The essence of SAX is to reduce the dimensionality of time series, which is the same as PIPs. For these reasons, we chose E-SAX as the comparison method.
There are 45 UCR datasets used for comparison experiments, and all the results are listed in Table 6. It is important to point out that E-SAX originally used classification error E R R as the metric. In order to facilitate comparison, we convert the classification error E R R into classification accuracy A C C according to formula (9).
Table 6. Comparison of PFC and E-SAX on 45 UCR datasets. All results are converted to accuracy uniformly.
As shown in Table 6, the PFC achieves the most best ACC, and best performance in 34 out of 45 datasets.These datasets are divided into 17 two-category datasets and 28 multi-category datasets. PFC has achieved significant advantages in 13 two-category datasets and 21 multi-category datasets. Although PFC is still at a disadvantage in some datasets, we found that the results obtained by PFC are very close to E-SAX, which is based on the premise that we have not optimized any parameters and model structure. We believe that PFC still has the possibility of improvement.
This comparison experiment and the previous two-category verification have a very small gap in the number of datasets used. It is equivalent to removing part of the two-category datasets and introducing a large number of multi-category datasets based on the latter. However, the number of times that the PFC using RF as a classifier achieves the best accuracy has greatly increased, far exceeding the cases of DT and GBDT. RF can rely on a large number of decision trees to satisfy multi-classification tasks, and this advantage has been demonstrated.

4.4. Discussion on the Number of PIPs

This is a meaningful discussion, because most of the current papers ignore this problem. No matter what operations will be performed later, we usually extract m PIPs from the original time series at first. There are two questions that need to be answered at this time:
  • What is the appropriate number of m? The more always means the better?
  • Does the number of PIPs have the same effect on different classifiers?
The second question is relatively easy to answer. The data listed in Appendix A. gives us the answer: the same m has different effects on different classifiers. RF and GBDT always require a large number of PIPs to achieve high accuracy, but DT is not so demanding. RF and GBDT as ensemble methods must be suitable for more features, but on some simple datasets, DT can outperform them with a few PIPs.
In fact, the most difficult thing is to answer the first question. As shown in Figure 11, with the length of the dataset as the horizontal axis, we obtain the distribution of PIPs when the best accuracy is achieved on the corresponding dataset. The three distributions are similar, but for RF and GBDT, the appropriate number of PIPs is greater than DT.
Figure 11. The distribution of the number of PIPs in different classifers.
On the other hand, on the same dataset, the larger m does not mean the better. Through a large number of experimental records, we found that there is no specific rule. For some time series with quite different morphological structures, a small amount of PIPs is enough to highlight their differences. Conversely, more PIPs may cause information redundancy and confusion. When the morphological structure of the time series is complex, the situation is completely opposite, and more PIPs are needed to describe the characteristics of the sample.

5. Conclusions

The introduction of morphological structure features is an important improvement to the time series classification. Based on the way of human visual cognition, many studies have pointed out that the shape of time series can be described by a sequence of important turning points. Inspired by these studies, we proposed GRM-PIPs, which control the length of the interval. Then we used PIPs to segment the time series, and extracted the feature combination of interval-level and point-level. The introduction of three classifiers, DT, RF, and GBDT, completes the perceptual feature-based framework. Finally, we compared five benchmark methods and three recently published methods on a large number of UCR datasets. The experimental results show that our work has excellent performance on the TSC task. In addition, we demonstrated the threshold of the interval length and discussed the influence of the number of PIPs, which made up for the deficiency in these aspects.
In future work, we plan to add more different types of classifiers and optimize these classifiers. At the same time, further improvement of feature extraction is considered.

Author Contributions

S.W.: Conceptualization, methodology, programming, validation of the results, analyses, writing, review and editing, supervision, investigation, and data curation. X.W.: Resources, supervision, and project administration. M.L.: Supervision and review. D.W.: Investigation and review. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Planning Project of Shenzhen Municipality, Grant number JCYJ20190806112210067.

Institutional Review Board Statement

“Not applicable” for studies not involving humans or animals.

Data Availability Statement

The UCR dataset comes from https://www.cs.ucr.edu/~eamonn/time_series_data_2018/ (accessed on 1 May 2021). The complete data package can be downloaded from https://www.cs.ucr.edu/~eamonn/time_series_data_2018/UCRArchive_2018.zip (accessed on 1 May 2021). The briefing documents of the UCR dataset can be downloaded here (https://www.cs.ucr.edu/~eamonn/time_series_data_2018/BriefingDocument2018.pdf and https://www.cs.ucr.edu/~eamonn/time_series_data_2018/BriefingDocument2018.pptx) (accessed on 1 May 2021). More information about the UCR dataset such as baseline and comparison can be found in http://www.timeseriesclassification.com (accessed on 1 May 2021).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A. Summary of the Number of PIPs When the Best Results Are Obtained

Table A1. The best accuracy and the number of PIPs used at the time.
Table A1. The best accuracy and the number of PIPs used at the time.
ACC PIPs
No.PFC-DTPFC-RFPFC-GBDTPFC-DTPFC-RFPFC-GBDT
10.90000.95000.90009721
20.95000.90000.9500575
30.97950.97950.9767574
41.00001.00001.00003153
50.70000.74000.7640252212
60.75000.78990.7753465
70.79130.80580.79865429
80.81000.86000.8500676
90.99880.95010.99546186
100.71360.85300.873472929
110.64320.70250.727192226
120.97820.96210.9775676
130.92810.90810.9421667
140.95330.99330.9533577
150.95890.99360.9873111011
160.98101.00000.9810565
171.00001.00001.0000333
180.66670.77140.7486312424
190.88650.92160.935171311
200.68750.65630.6875261216
210.87400.92430.8740192119
220.94170.94850.9105565
230.75410.81970.8197633
240.74230.83160.8178121111
250.77640.85940.7572222316
260.71450.79950.80077811
270.93330.95560.95003293
280.84190.87970.8965161417
290.82500.88670.88673105
300.56670.59440.5667212921
310.84690.78040.84696126
320.80010.78270.790114914
330.91080.98240.9622477
No.PFC-DTPFC-RFPFC-GBDTPFC-DTPFC-RFPFC-GBDT
340.76750.89150.8333454
350.76920.83080.74625199
360.95430.97190.9543353
371.00001.00001.0000192719
380.70370.79630.722292610
390.71430.74030.7800151113
400.74970.82000.8097777

References

  1. Wei, W.W. Time series analysis. In The Oxford Handbook of Quantitative Methods in Psychology: Volume 2; Oxford University Press: Oxford, UK, 2006. [Google Scholar]
  2. Bagnall, A.; Lines, J.; Bostrom, A.; Large, J.; Keogh, E. The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances. Data Min. Knowl. Discov. 2016, 31, 606–660. [Google Scholar] [CrossRef]
  3. Fawaz, H.I.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P.A. Deep learning for time series classification: A review. Data Min. Knowl. Discov. 2019, 33, 917–963. [Google Scholar] [CrossRef]
  4. Geurts, P. Pattern Extraction for Time Series Classification. In Principles of Data Mining and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2001; pp. 115–127. [Google Scholar] [CrossRef]
  5. Elhoseiny, M.; Huang, S.; Elgammal, A. Weather classification with deep convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015. [Google Scholar] [CrossRef]
  6. Pham, T.D.; Wardell, K.; Eklund, A.; Salerud, G. Classification of short time series in early Parkinsons disease with deep learning of fuzzy recurrence plots. IEEE/CAA J. Autom. Sin. 2019, 6, 1306–1317. [Google Scholar] [CrossRef]
  7. Joshi, D.; Khajuria, A.; Joshi, P. An automatic non-invasive method for Parkinson’s disease classification. Comput. Methods Progr. Biomed. 2017, 145, 135–145. [Google Scholar] [CrossRef] [PubMed]
  8. Dau, H.A.; Bagnall, A.; Kamgar, K.; Yeh, C.C.M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C.A.; Keogh, E. The UCR time series archive. IEEE/CAA J. Autom. Sin. 2019, 6, 1293–1305. [Google Scholar] [CrossRef]
  9. Keogh, E.J.; Pazzani, M.J. Scaling up dynamic time warping for datamining applications. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00), Boston, MA, USA, 20–23 August 2000; ACM Press: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
  10. Keogh, E.; Chakrabarti, K.; Pazzani, M.; Mehrotra, S. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst. 2001, 3, 263–286. [Google Scholar] [CrossRef]
  11. Zhang, H.; Dong, Y.; Xu, D. Entropy-based Symbolic Aggregate Approximation Representation Method for Time Series. In Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 11–13 December 2020. [Google Scholar] [CrossRef]
  12. Sun, Y.; Li, J.; Liu, J.; Sun, B.; Chow, C. An improvement of symbolic aggregate approximation distance measure for time series. Neurocomputing 2014, 138, 189–198. [Google Scholar] [CrossRef]
  13. Schäfer, P.; Högqvist, M. SFA. In Proceedings of the 15th International Conference on Extending Database Technology (EDBT’12), Berlin, Germany, 27–30 March 2020; ACM Press: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
  14. Schäfer, P. The BOSS is concerned with time series classification in the presence of noise. Data Min. Knowl. Discov. 2014, 29, 1505–1530. [Google Scholar] [CrossRef]
  15. Middlehurst, M.; Vickers, W.; Bagnall, A. Scalable Dictionary Classifiers for Time Series Classification. In Intelligent Data Engineering and Automated Learning—IDEAL 2019; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 11–19. [Google Scholar] [CrossRef]
  16. Large, J.; Bagnall, A.; Malinowski, S.; Tavenard, R. On time series classification with dictionary-based classifiers. Intell. Data Anal. 2019, 23, 1073–1089. [Google Scholar] [CrossRef]
  17. Schäfer, P.; Leser, U. Fast and Accurate Time Series Classification with WEASEL. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
  18. Lin, J.; Khade, R.; Li, Y. Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 2012, 39, 287–315. [Google Scholar] [CrossRef]
  19. Lines, J.; Bagnall, A. Time series classification with ensembles of elastic distance measures. Data Min. Knowl. Discov. 2014, 29, 565–592. [Google Scholar] [CrossRef]
  20. Lucas, B.; Shifaz, A.; Pelletier, C.; O’Neill, L.; Zaidi, N.; Goethals, B.; Petitjean, F.; Webb, G.I. Proximity Forest: An effective and scalable distance-based classifier for time series. Data Min. Knowl. Discov. 2019, 33, 607–635. [Google Scholar] [CrossRef]
  21. Xi, X.; Keogh, E.; Shelton, C.; Wei, L.; Ratanamahatana, C.A. Fast time series classification using numerosity reduction. In Proceedings of the 23rd International Conference on MACHINE Learning (ICML’06), Pittsburgh, PA, USA, 25–29 June 2006; ACM Press: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  22. Górecki, T.; uczak, M. Non-isometric transforms in time series classification using DTW. Knowl. Based Syst. 2014, 61, 98–108. [Google Scholar] [CrossRef]
  23. Datta, S.; Karmakar, C.K.; Palaniswami, M. Averaging Methods using Dynamic Time Warping for Time Series Classification. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020. [Google Scholar] [CrossRef]
  24. Yu, D.; Yu, X.; Hu, Q.; Liu, J.; Wu, A. Dynamic time warping constraint learning for large margin nearest neighbor classification. Inf. Sci. 2011, 181, 2787–2796. [Google Scholar] [CrossRef]
  25. Forechi, A.; Souza, A.F.D.; Badue, C.; Oliveira-Santos, T. Sequential appearance-based Global Localization using an ensemble of kNN-DTW classifiers. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016. [Google Scholar] [CrossRef]
  26. Ryabko, D.; Mary, J. Reducing statistical time-series problems to binary classification. Adv. Neural Inf. Process. Syst. 2012, 3, 2069–2077. [Google Scholar]
  27. Deng, H.; Runger, G.; Tuv, E.; Vladimir, M. A time series forest for classification and feature extraction. Inf. Sci. 2013, 239, 142–153. [Google Scholar] [CrossRef]
  28. Cabello, N.; Naghizade, E.; Qi, J.; Kulik, L. Fast and Accurate Time Series Classification Through Supervised Interval Search. In Proceedings of the 2020 IEEE International Conference on Data Mining (ICDM), Sorrento, Italy, 17–20 November 2020. [Google Scholar] [CrossRef]
  29. Lines, J.; Taylor, S.; Bagnall, A. HIVE-COTE: The Hierarchical Vote Collective of Transformation-Based Ensembles for Time Series Classification. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016. [Google Scholar] [CrossRef]
  30. Ye, L.; Keogh, E. Time series shapelets. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09), Paris, France, 28 June–1 July 2009; ACM Press: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
  31. Hills, J.; Lines, J.; Baranauskas, E.; Mapp, J.; Bagnall, A. Classification of time series by shapelet transformation. Data Min. Knowl. Discov. 2013, 28, 851–881. [Google Scholar] [CrossRef]
  32. Ji, C.; Liu, S.; Yang, C.; Pan, L.; Wu, L.; Meng, X. A Shapelet Selection Algorithm for Time Series Classification: New Directions. Procedia Comput. Sci. 2018, 129, 461–467. [Google Scholar] [CrossRef]
  33. Ji, C.; Zhao, C.; Liu, S.; Yang, C.; Pan, L.; Wu, L.; Meng, X. A fast shapelet selection algorithm for time series classification. Comput. Netw. 2019, 148, 231–240. [Google Scholar] [CrossRef]
  34. Dempster, A.; Petitjean, F.; Webb, G.I. ROCKET: Exceptionally fast and accurate time series classification using random convolutional kernels. Data Min. Knowl. Discov. 2020, 34, 1454–1495. [Google Scholar] [CrossRef]
  35. Yu, J.; Yin, J.; Zhou, D.; Zhang, J. A Pattern Distance-Based Evolutionary Approach to Time Series Segmentation. In Intelligent Control and Automation; Springer: Berlin/Heidelberg, Germany, 2006; pp. 797–802. [Google Scholar] [CrossRef]
  36. Chung, F.; Fu, T.; Luk, W.; Ng, V. Flexible time series pattern matching based on perceptually important points. In Workshop on Learning from Temporal and Spatial Data in International Joint Conference on Artificial Intelligence; The Hong Kong Polytechnic University: Hong Kong, China, 2001; pp. 1–7. [Google Scholar]
  37. Phetchanchai, C.; Selamat, A.; Rehman, A.; Saba, T. Index Financial Time Series Based on Zigzag-Perceptually Important Points. J. Comput. Sci. 2010, 6, 1389–1395. [Google Scholar] [CrossRef]
  38. Chi, X.; Jiang, Z. Feature recognition of the futures time series based on perceptually important points. In Proceedings of the 2012 2nd International Conference on Computer Science and Network Technology, Changchun, China, 29–31 December 2012. [Google Scholar] [CrossRef]
  39. Lintonen, T.; Raty, T. Self-learning of multivariate time series using perceptually important points. IEEE/CAA J. Autom. Sin. 2019, 6, 1318–1331. [Google Scholar] [CrossRef]
  40. Fu, T.C.; Chung, F.L.; Ng, C.M. Financial Time Series Segmentation based on Specialized Binary Tree Representation. In Proceedings of the 2006 International Conference on Data Mining (DMIN 2006), Las Vegas, NV, USA, 26–29 June 2006; pp. 3–9. [Google Scholar]
  41. Azimifar, M.; Araabi, B.N.; Moradi, H. Forecasting stock market trends using support vector regression and perceptually important points. In Proceedings of the 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 29–30 October 2020; pp. 268–273. [Google Scholar] [CrossRef]
  42. Fenton, N.; Neil, M. Decision Analysis, Decision Trees, Value of Information Analysis, and Sensitivity Analysis. In Risk Assessment and Decision Analysis with Bayesian Networks; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 347–369. [Google Scholar] [CrossRef]
  43. Kamiński, B.; Jakubczyk, M.; Szufel, P. A framework for sensitivity analysis of decision trees. Cent. Eur. J. Oper. Res. 2017, 26, 135–159. [Google Scholar] [CrossRef] [PubMed]
  44. Quinlan, J. Simplifying decision trees. Int. J. Man-Mach. Stud. 1987, 27, 221–234. [Google Scholar] [CrossRef]
  45. Kretowski, M. Decision Trees in Data Mining. In Studies in Big Data; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 21–48. [Google Scholar] [CrossRef]
  46. Qiu, W.; Liu, X.; Wang, L. Forecasting shanghai composite index based on fuzzy time series and improved C-fuzzy decision trees. Expert Syst. Appl. 2012, 39, 7680–7689. [Google Scholar] [CrossRef]
  47. Zalewski, W.; Silva, F.; Maletzke, A.; Ferrero, C. Exploring shapelet transformation for time series classification in decision trees. Knowl. Based Syst. 2016, 112, 80–91. [Google Scholar] [CrossRef]
  48. He, Y.; Chu, X.; Wang, Y. Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 373–384. [Google Scholar] [CrossRef]
  49. Biau, G.; Scornet, E. Rejoinder on: A random forest guided tour. Test 2016, 25, 264–268. [Google Scholar] [CrossRef]
  50. Freund, Y.; Schapire, R.; Abe, N. A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 1999, 14, 1612. [Google Scholar]
  51. Wang, J.; Tang, S. Time series classification based on arima and adaboost. MATEC Web Conf. 2020, 309, 03024. [Google Scholar] [CrossRef][Green Version]
  52. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  53. Elish, M. Enhanced prediction of vulnerable Web components using Stochastic Gradient Boosting Trees. Int. J. Web Inf. Syst. 2019, 15, 201–214. [Google Scholar] [CrossRef]
  54. Johnson, N.E.; Bonczak, B.; Kontokosta, C.E. Using a gradient boosting model to improve the performance of low-cost aerosol monitors in a dense, heterogeneous urban environment. Atmos. Environ. 2018, 184, 9–16. [Google Scholar] [CrossRef]
  55. Džeroski, S.; Ženko, B. Is Combining Classifiers with Stacking Better than Selecting the Best One? Mach. Learn. 2004, 54, 255–273. [Google Scholar] [CrossRef]
  56. Fuad, M.M.M. Extreme-SAX: Extreme Points Based Symbolic Representation for Time Series Classification. In Big Data Analytics and Knowledge Discovery; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 122–130. [Google Scholar] [CrossRef]
  57. Yan, L.; Liu, Y.; Liu, Y. Interval Feature Transformation for Time Series Classification Using Perceptually Important Points. Appl. Sci. 2020, 10, 5428. [Google Scholar] [CrossRef]
  58. Dorle, A.; Li, F.; Song, W.; Li, S. Learning Discriminative Virtual Sequences for Time Series Classification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2018; ACM: New York, NY, USA, 2020; pp. 2001–2004. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.