*4.1. Performance*

Among the 720 segments, one of the ECG segments was significantly distorted by a motion noise; we excluded this segmen<sup>t</sup> and its label for further analysis. To evaluate the performance of a model, five-fold cross validation was used on both machine learning models and DeepER Net. This method commonly evaluates the predictability of a model [5]. In particular, the 719 segments were randomly shuffled and split into five folds. Furthermore, five-fold cross validation was applied for evaluation. The use of this cross validation scheme is independent of subjects, indicating that the segments extracted from a subject can be in both test set and training set. Because of the similarity between the test set and training set, this might lead to higher accuracy on the testing set.

We calculated the average performance of the machine learning models, as shown in Table 4, and then selected the best model for comparison purposes. Of these models, the random forest (RF) yielded the highest average accuracy (71.8 ± 2.3%), F1 score (0.67 ± 0.04), and AUC (0.80 ± 0.02). This was followed by the decision tree (DT), then SVM, the KNN, and finally the logistic regression (LR) showed the lowest performance. This highlights the fact that different models can give different performance, even when trained on the same handcrafted feature set, and that we need to find the most suitable model for each problem. In addition, the fact that the RF and LR demonstrated the highest and lowest performance, respectively, suggests that an ensemble model can be suitable for recognizing stress. However, the RF's AUC was less than 0.9, so it cannot be considered to be highly accurate [24].

Turning now to the performance of the proposed DeepER Net, we find that it showed the highest average accuracy (83.9 ± 2.3%), F1 score (0.81 ± 0.05), and AUC (0.92 ± 0.01). Compared with the RF, its average accuracy was 12.1% higher (*p*-value < 0.05 with paired *t*-test), its average F1 score was 0.14 higher (*p*-value < 0.05 with paired *t*-test), and its average AUC was 0.12 higher (*p*-value < 0.05 with paired *t*-test), clearly indicating that our deep learning approach was a substantial improvement. In addition, DeepER Net's AUC was greater than 0.9, so we can conclude that it is highly accurate for recognizing stress [24]. These results thus sugges<sup>t</sup> that our deep learning approach is a promising option for accurately recognizing stress. Loss and accuracy information of the proposed DeepER Net during training is shown in Figure S1.

**Table 4.** Average metrics after five-fold cross validation. We used Equations (2) and (3) to calculate the average accuracy, F1 score, and AUC, as well as their standard deviations, and show these results as average ± standard deviation. Abbreviations : SVM, support vector machine; RF, random forest; KNN, k-nearest neighbors; LR, logistic regression; DT, decision tree; AUC, area under the ROC curve; ROC, receiver operating characteristic.

