*2.2. Feature Extraction Based on Stacked Denoising Autoencoder (SDAE) Network*

As shown in the left part of Figure 2, feature extraction was primarily based on unsupervised training to shorten the sequence length in the second column of Table 1. Here, labels were just a supplement to fine-tune in the second training stage of the SDAE network, which was generally stacked by multiple three-layer DAE models. Figure 3 shows the structure of a typical SDAE network, which was formed by stacking three sub-DAE networks. Because the noise was actively added to the input data, hidden layers in such networks can retain more robust sample features during the learning process [41]. Here, greedy layer-wise training [42] that can boost the network learning efficiency was a preferred solution to conduct the pre-training process. In the first stage of feature extraction, the initial features of the input sample can be forcibly extracted through the unsupervised learning network. To obtain a better feature extraction effect, labels of the input sample were used to establish a classification output layer to perform a supervised training. Thereafter, a feature extraction model based on the SDAE network can be obtained through training the dataset A in Table 1. When a new sample dataset B was fed into the trained model, the feature representation of the last hidden layer can be regarded as reduced-dimensional features of the original input.

**Figure 3.** Schematic diagram of the stacked denoising autoencoder (SDAE) network stacking process.

#### *2.3. Dimensionality Reduction Evaluation with Silhouette Coe*ffi*cients*

Through the above processing, a sequence having a length shorter than that of the original sequence listed in the second column in Table 1 can be obtained. It was straightforward that the effect and rationality of dimensionality reduction needed to be evaluated, indicating that the hyperparameter settings of the SDAE network should be assessed. Ideally, the feature vector generated due to dimensionality reduction should be able to represent the category information of the original sample to the greatest extent, namely, good feature extraction results should make the dimensionality reduction sequences belong to the same category closer, and the distance between the dimensionality reduction sequences belong to different categories farther. Silhouette coefficients [43] described as Equation (1) provide a single value measuring both the above two traits.

$$s\_i = \frac{b\_i - a\_i}{\max(a\_i, b\_i)}\tag{1}$$

where *si* is the silhouette coefficient for observation *i*, *ai* is the mean distance between *i* and all observations of the same class, and *bi* is the mean distance between *i* and all observations from the different classes. Silhouette coefficients range between −1 and 1, with 1 indicating dense, well-separated different categories. Therefore, the mean silhouette coefficient for all observations can be used to evaluate the impact of the selection of various key hyperparameters on the performance of feature extraction based on the SDAE network for the two UCR datasets in Table 1. The operation based on grid search [44] combined with cross-validation [45] can guarantee to find the most accurate set of hyperparameter settings within a specified range, but it required iterating through all possible parameter combinations, which was very time-consuming in the face of large datasets and multiple parameters of interest. Another feasible option was to optimize the hyperparameter set step by step. Considering the characteristics of the SDAE network, the key hyperparameters that are usually concerned are the number of network layers, the number of hidden layer nodes, and the noise level [41]. As shown in Figure 4, the number of hidden layers of the SDAE network can be firstly determined by the mean silhouette coefficient. Here, in the training process, the adaptive moment estimation optimizer [46] was used to search the right learning rate automatically, and the maximum number of training epochs can be controlled based on the early stopping [47] technique. Other hyperparameters under the different number of hidden layers were derived through trial-and-error under the control

of maximum number of training epochs and minimum reconstruction errors. As shown in Figure 4, when the number of hidden layers of the datasets CinCECGTorso and SemgHandMovementCh2 were set to two and three, respectively, the node number of the last hidden layer that reflected the dimensionality reduction effect can be further analyzed. Here, the hyperparameter configurations determined through trial and error in the previous step were used as the initial settings for the next tuning step and the key hyperparameter determined in the previous step remained constant in the subsequent tuning step. As shown in Figure 5, the number of nodes in the last hidden layer was expressed as a percentage of the original sequence length. After the number of hidden layers and the number of nodes in the last hidden layer were determined in turn, the reasonable value of the denoising coefficient [48] in the input layer of the SDAE network can be discussed. Figure 6 gave the relationship between different denoising coefficient and corresponding silhouette coefficient based on the tuning strategy of hyperparameters mentioned above. Hence, based on the hyperparameters determined by the maximum mean silhouette coefficients, the network structures of the SDAE for the two selected datasets used to obtain the dimensionality reduction sequences can be established.

**Figure 4.** Mean silhouette coefficients at different number of hidden layers.

**Figure 5.** Mean silhouette coefficients at different reduction dimensionality.

**Figure 6.** Mean silhouette coefficients at different denoising coefficients.

After the above step-by-step tuning of hyperparameters, the optimal mean silhouette coefficients of the datasets CinCECGTorso and SemgHandMovementCh2 with the feature extraction dimensions reduced to 20% of the original sequence length were 0.455 and 0.432, respectively. Figure 7 further shows that the capability of SDAE-based feature extraction was significantly better than that of other methods, although all the silhouette coefficient results did not exceed 0.5. Here, various comparison methods maintained a unified feature dimension reduction scale. Therefore, the SDAE network training process used for feature extraction for dataset B listed in Table 1 in this section was the premise for the subsequent similarity measure of reducing dimensionality sequences.

**Figure 7.** Comparison of different dimensionality reduction methods.

#### *2.4. Distance Measure Based on Improved Dynamic Time-Warping (DTW) Algorithm*

Since the features were extracted as dimensionality reduction sequences of equal length, the impact of high time complexity and low calculation efficiency can be effectively avoided when measuring distance based on the DTW algorithm. Although the reported window-based constraint methods have some positive effects on avoiding the DTW's matching path from falling into the suboptimum under certain circumstances, improvements against the influences of undesired warping [49] still deserve attention. Based on the DTW with a constraint of Sakoe–Chubaband [50] (hereinafter abbreviated as SDTW), warping offset distance (*d*WOD) was defined in the proposed improved DTW algorithm to further mitigate the effects of undesired warping. The defined *d*WOD was the area between the optimal matching path and the diagonal path under the SDTW algorithm. As shown in Figure 8, these two paths were derived from the distance matrix *D* of two equal-length sequences after feature extraction, and the *d*WOD described in Equation (2) can be shown as the cumulative sum of the differences between each point on the optimal matching path and each corresponding point to the unbiased state. By aligning the feature points of two sequences processed by the SDAE network, this method not only ensured that the matching path can recognize the slight warping of the time axis but also realized the constraint on the length of the matching path. Detailed definition of the distance matrix of DTW and the searching method of the optimal matching sequence based on dynamic programming can be found in [51]:

$$d\_{\rm WOD} = \sum\_{i=1}^{m} \left| w\_i - \dot{d}a(i) \right| \tag{2}$$

where *wi* and *dia*(*i*) represent the *i*-th point in the optimal matching path and the *i*-th point in the diagonal of the distance matrix *D*, respectively. The sum of *d*WOD and the distance based on the SDTW (*d*SDTW) was used as the distance metric of the improved DTW algorithm in Equation (3) and therefore *d*similarity was regarded as the result of the similarity measure:

$$d\_{\text{similarity}} = d\_{\text{SDTW}} + d\_{\text{WOD}} \tag{3}$$

**Figure 8.** Warping offset distance expressed by the diagram of the DTW distance matrix.

#### *2.5. Similarity Measure Evaluation with One Nearest Neighbor (1-NN) Classifier*

The bandwidth *r* defines the constraint range of the matching path in the distance matrix and suppresses the influence of undesired convergence in the matching path [52]. Because there was a correlation between the defined warping offset distance and the SDTW algorithm, as well as the SDTW-based distance and the constraint bandwidth *r*, different *r* not only affected the optimal matching path of the SDTW but also led to the change of *d*similarity. The *r* determined the efficacy of the proposed similarity measurement method. It was reported that the 1-NN classifier on labeled data was a feasible way to evaluate the efficacy of the selected distance metric and its classification accuracy directly reflected the effectiveness of the similarity measure [53]. Moreover, the 1-NN classifier can be used to search for a proper *r* and the idea was to train a labeled dataset with different bandwidth constraints based on two distance metrics *d*SDTW, *d*WOD, respectively. Then, two sets of classification error rates *E*SDTW(*r*) and *E*WOD(*r*) at different *r* through the 1-NN classifier model can be derived. We defined *E*SUM as the sum of *E*SDTW(*r*) and *E*WOD(*r*) and the constraint bandwidth *r* that minimized *E*SUM was considered to be the appropriate choice for calculating *d*similarity.

Figure 9 depicted the possibly typical variation of *E*SUM at different *r*. For cases I and IV, it was easy to determine the appropriate *r* based on the minimum *E*SUM. For case II, it can be considered that the constraint bandwidth did not affect the distance measured by the SDTW algorithm, and the first *r* corresponding to the minimum can be seen as the candidate. For the situation in case III that multiple candidate values within the convergence region corresponded to the same minimum value *E*SUM, the median of these candidate values was selected as *r*. Here, the general rules for determining and adjusting the preset range for *r* can refer to [52].

According to the data-processing procedure in the right part of Figure 2, the dataset B listed in Table 1 was further divided into sub-training and sub-test sets after the dimension reduction through the SDAE network. The dataset information used for the supervised learning of the 1-NN classifier was given in Table 2. Here, the sample size of the test set was made significantly larger than that of the training set according to the ratio commonly adopted in the dataset sheet of UCR archives [38]. First, the best *r* was searched based on the training results of the 1-NN classifier under two distance metrics. Figure 10 showed the variation of the classification error rate *E*SUM of two datasets with respect to *r* after the dimensionality reduction in Table 2, and *r* for datasets of CinCECGTorso and SemgHandMovementCh2 should choose 2 and 3, respectively. Next, the defined *d*similarity under the specified *r* was used as the distance metric of the 1-NN classifier to perform supervised training on the sub-training set. Also, other distance metrics can be applied in the 1-NN classifier to train the sub-training set. Furthermore, the performance evaluation of the similarity measure can be transformed into a comparison of the classification error rate of the 1-NN classifier under different distance measures. The generalization capacities of the 1-NN classifier with different distance metrics were compared in Figure 11 for the sub-test set by the classification error rate. The bar distribution reflected that the distance based on the improved DTW had lower classification error rates for the two sub-test datasets than that of the other distance measure functions, which also meant that the proposed distance metric was more suitable for similarity evaluation.

**Figure 9.** Typical variation of the sum of classification error rates at different constrains.


**Figure 11.** Performance comparison of 1-NN classifier under different distance metrics.
