**3. Results**

We collected two self-reports (e.g., SAM and DT) from the participants after two levels of the mental arithmetic task and after the initial rest. Lower SAM and higher DT scores refer to stronger negative emotions and higher perceived stress, respectively. Table 4 shows the results of the self-reports. We calculated the difference of score based on thebaseline measurement (i.e., after initial rest) after the mental arithmetic task. SAM decreased after the tasks and the difference for the high level task was larger than that for the moderate level. The DT score increased, compared to the baseline measurement. Similar to the SAM score, a large difference in the DT score occurred after the high level arithmetic task.

**Table 4.** Difference in self-reported scores, compared to baseline measurement.


In this section, we show the results of the proposed model. It consists of the extracted feature maps and evaluation metrics, including a comparison with the other models and within the proposed model itself. As mentioned in Section 2.4, Type I training indicates the pretrained model using the driving data set. For Type II, the model was trained using the mental arithmetic data set without the pretrained model. In the case of Type III training, we used the same data set as in Type II to train the model, but based it on the pretrained model.

Firstly, we tested the conventional machine learning methods before evaluating the proposed model. We used conventional algorithms, including decision tree (DT), k-nearest neighbors (kNN), logistic regression (LR), random forest (RF), and support vector machine (SVM). All the algorithms were trained and validated with the same data set as the proposed model. Table 5 shows the accuracy of the conventional methods. All the machine learning algorithms could not reach a satisfying performance, in terms of accuracy, which means that trainable algorithms cannot learn the proper features using a raw ECG input.


**Table 5.** Accuracy of the conventional methods. (DT; Decision Tree, kNN; k-Nearest Neighbors, LF; Logistic Regression, RF; Random Forest, SVM; Support Vector Machine)

#### *3.1. Feature Representation*

We observed all the extracted features from each stage using the t-SNE method, which converts high-dimensional features (the number of components, width, and channel) to 2-dimensional features, which we can analyze using a scatter plot. Figure 3 shows the t-SNE scatter plots for the input (raw ECG) and the extracted features from each stage. Each point represents states of the label (i.e., rest and stress). The input is from a subject who participated in the mental arithmetic task and is sliced using a 10 s window. The proposed model, trained by Type III training, generated features in each stage. As shown in Figure 3, there was almost no difference between the stress- and rest-labeled ECGs. By considering the features passed through the stages, a distinction could be observed between the labels. The t-SNE plots imply that it is possible to distinguish the two labels clearly through the softmax classifier after the last stage.

**Figure 3.** The t-SNE plots of raw ECG and extracted features from the stages. Round points denote features of ECG labeled as rest, and crosses represent stress-labeled features. This figure shows only the extracted features from stage 1, stage 5, and the last stage.

#### *3.2. Performance of the End-to-End Model*

Figure 4 shows the accuracy of the proposed model for the binary classification of rest and stress. We compared the results of the three training types, based on the input windows. Overall, Type I, which was trained using the driving data set, showed the best performance for all the windows. It reached the highest mean accuracies at 89.38%, 87.16%, and 79.12% for the 10 s, 30 s, and 60 s windows, respectively. The accuracy of Type I training, 89.38%, was significantly different from both the Type II accuracy, 61.33%, and Type III accuracy, 69.71%, for the 10 s window (*p* < 0.001). Additionally, there was a significant difference between the accuracy of Type II and Type III training (*p* < 0.05). In the case of the 30 s window, Type I training achieved an accuracy of 87.16%, whereas Type II and Type III training achieved 68.38% and 72.13%, respectively (*p* < 0.001). For the 60 s window, the accuracies of Type I, 79.12%, and Type III, 79.50%, training were slightly different, but the accuracy of Type II training, 71.50%, was significantly different from that of Type III training (*p* < 0.05). Considering Type II and Type III training, which were both trained with the same data set (mental arithmetic), there were improvements of 12.01% and 10.06% in accuracy for the 10 s and 60 s windows (*p* < 0.05), respectively; while there was no significant improvement in the 30 s window at the 0.05 level.

**Figure 4.** Accuracy of the end-to-end model in binary classification. Types are grouped by each raw ECG window (i.e., 10 s, 30 s, and 60 s) fed to the model. \* and \*\* indicates that difference of the means is significant at the 0.001 and 0.05 level, respectively.

We plotted the ROC and PR curves by each type and window in Figure 5. The ROC curves need to be located above the baseline (*y* = *x*) to satisfy the model performance. We can observe that Type I training showed the best performance in the ROC and PR curves. Type III training demonstrated little improvement over Type II training, based on the curves. However, both the ROC and PR curves of Type III training are generally positioned higher than the curves of Type II. It is difficult to evaluate the performance of the model with the ROC and PR curves only. Therefore, we calculated the AUC of the ROC curves. It shows the performance on the numerical results, which makes it possible to compare the models.

*Sensors* **2019**, *19*, 4408

**Figure 5.** ROC and PR curves. Each line represents a curve from Type I, Type II, and Type III training, respectively. A cross refers to the performances of the conventional model. (**a**) ROC curves and (**b**) PR curves.

Table 6 shows the evaluation metrics, including the AUC, *F*1 score, sensitivity, and specificity. Both the mean and standard deviation were calculated based on cross-validation. We compared the performance between Type I and Type II training, which were both trained without any pretrained model, but trained with the different sizes of data sets (i.e., the driving and the mental arithmetic data sets). Type I training for the 10 s window shows the best performance for the AUC, *F*1 score, sensitivity, and specificity. It had a value of 0.938 for the AUC (*p* < 0.001), 0.922 for the *F*1 score (*p* < 0.001), and 0.930 for sensitivity (*p* < 0.001). Although the specificity of Type I training for the 10 s window, 0.854, showed the highest value, it did not show a significant difference from Type II training. Based on the mean values, Type III training showed an improvement over Type II training, except for specificity with the 10 s window. For the 10 s window, the improvements were 8.00% for the AUC, 19.90% for the *F*1 score (*p* < 0.001), and 29.77% for sensitivity (*p* < 0.05). For the 30 s window, the improvements were 5.07% for the AUC, 7.42% for the *F*1 score, 1.81% for sensitivity, and 16.61% for specificity. The 60 s window showed improvements of 18.66% for the AUC (*p* < 0.05), 13.23% for the *F*1 score (*p* < 0.05), 7.32% for sensitivity, and 20.71% for specificity. In summary, the transfer learning method improved performances by 11.57%, 10.57%, 13.52%, 12.96%, and 9.41% on average for accuracy, the AUC, the *F*1 score, sensitivity, and specificity, respectively, along all window lengths.


**Table 6.** Evaluation metrics.

#### *3.3. Comparison with Different Models*

We compared the proposed end-to-end model with conventional methods [9–11]. Rigas et al. [9] used physiological signals including HRV, SC, and respiration while using 10 s length of window. Smets et al. [11] additionally utilized skin temperature. Castaldo et al. [10] used non-linear HRV parameters, including the sample entropy (SampEn), recurrence plot mean line length (RPlmean), and shannon entropy (ShanEn). Figure 5a shows the comparison results to the proposed model using the ROC curves. Each blue cross is positioned at the best performance in [9–11]. To assess the model exactly, we compared it with [9,11], which used 10 s and 30 s windows, to the proposed model with the same window lengths. To best match Castaldo et al. [10], which used a 3 m window to extract the HRVs from the ECG, we compared the proposed model with the 1 m window. Based on Figure 5a, all blue crosses are positioned lower than Type I, or are similar to it. From the perspective of sensitivity and specificity, the proposed model shows better performance than the conventional methods for a certain range of thresholds.

Both Hwang et al. [12] and Saeed et al. [13] utilized DNNs to classify stress. The comparison results are shown in Table 7. Hwang et al. [12] used a CNN and LSTM with a raw ECG signal and achieved an 87.39% and 73.96% accuracy for each case. Their architecture consisted of one convolutional layer and two LSTM layers. Our proposed model shows improvements in accuracy of 3.10% and 18.00% for each case, with the same window (10 s). Saeed et al. [13] used raw HR signals derived from the ECG and raw SC signals. They used the same driving data set from Healey et al. [14] to train and evaluate their model. The model [13] showed the best performance, with a value of 0.918 for the area under the ROC curve, while the proposed model reached 0.938.


**Table 7.** Comparison with models featuring a DNN algorithm.

#### **4. Discussion and Conclusions**

We have proposed a novel end-to-end architecture that uses raw ECG signals for stress detection and validated its performance with two different data sets. We believe that our model could replace the conventional machine learning-based methods in several ways. First, in terms of model simplicity, our model has an advantage over conventional methods, which require a few additional steps, such as preprocessing, feature selection, and feature extraction, before classification. As our model was built with an end-to-end architecture, it does not necessarily require such additional steps. The end-to-end architecture enables the detection of stress by automatically extracting features without feature selection. We observed that the successive deep convolutional layers extract distinguishable features, as shown in Figure 3. Second, in the same vein, our model may not depend on the performance of these steps. The methods that use HRV parameters depend highly on the performance of the R-peak detection algorithm. Considering stress managemen<sup>t</sup> in daily life, R-peak detection in ECG signals recorded in real-world environments may require additional steps, as proposed in [33]. In addition to the independence of the model, our results showed that the detection performance of the proposed model was superior to that of the conventional methods [9–11], as shown in Figure Figure 5a and Table 5. With raw ECG signals, conventional machine learning methods did not show acceptable performance in detecting stress, and rarely can be trained with non-linear inputs (i.e., raw signals). Finally, whereas the HRV parameters require at least a short-term (5 m) or long-term window (24 h) to properly reflect the stress response, our model used much shorter windows (10 s, 30 s, and 60 s). Our approach demonstrates a practically applicable system for daily stress management. As it takes an average of 2.490 ms to estimate stress state by inputs of raw ECGs, it is possible to apply the proposed model in the real-world to detect stress in real-time.

Despite the advantages described above, the performance of the DNN depends highly on the size of the data set used to train a neural network. To investigate the effect of the data set size on stress detection, we compared three different types of models with different training strategies. As expected, the model Type I, which was trained using a larger data set, showed better performance than the model Type II, trained using a smaller data set. There was a size difference of more than four times between the driving [14] and the mental arithmetic data sets. For the last type of model (Type III), we utilized the pretrained model, trained using the driving data set, to train the model with a smaller data set. From the comparison between Type II and Type III training, Type III, which used the pretrained model, showed an improvement over Type II, which did not. Although the size of the mental arithmetic data set might not be large enough to train the neural network, it is possible to achieve a fine-tuned model, based on pretraining with a larger data set. However, it could not reach the performance of the Type I model, trained using a larger data set, which presented the best performance. Unlike other domains of data, such as speech or image, a sufficient amount of physiological data may not be easily accessed or obtained. Thus, our approach can be utilized to train a DNN with a smaller data set, based on the pretrained model.

In this study, we used two different data sets (i.e., driving and mental arithmetic) under the ambulatory and laboratory environments for model development and validation. Mental arithmetic is one of the representative test paradigms used to assess mental stress. It was proved, by two questionnaires (self-assessment manikin and distress thermometer), that mental arithmetic induced a mental load in

the participants. However, to develop a stress managemen<sup>t</sup> method for daily life, there is a need to validate the method out-of-laboratory, as well. Thus, we also chose the driving data set to assess stress out-of-laboratory. Although these data sets cannot represent all of the stress situations that can occur in everyday life, such as workload stress, physical stress, anxiety, and so on, we demonstrated an end-to-end architecture to detect mental stress for both in- and out-of-laboratory environments. However, there were still limitations in this study. Although the two sensors used in the two data sets were individual, in view of generalization, the model needs to be validated by using ECG from diverse sensors, including other electrode configurations. We have fed other data sets, which were different from those using during training, into the model. This showed high-biased results about a specific type of stress and recording sensor dependency. Even though bias or dependency remains, transfer learning from one data set to other may provide a solution to break the limited applicability in real-world settings. As mentioned above, all of the data sets used in this study were acquired during specific stressful tasks. However, ECGs during daily activities are necessary for considering daily monitoring of stress. In future studies, we will apply this model to detect other stressful events, such as workload stress or anxiety, and will apply it to multi-class problems or continuous level recognition. Additionally, we will investigate how to augmen<sup>t</sup> physiological signals to train a neural network to overcome the limitations of the data set.

**Author Contributions:** Conceptualization, H.-M.C., S.-Y.D., and I.Y.; data curation, S.-Y.D., H.P., and H.-M.C.; funding acquisition, I.Y.; investigation, H.P. and H.-M.C.; methodology, H.-M.C. and S.-Y.D.; software, H.-M.C.; supervision, S.-Y.D. and I.Y.; writing–original draft, H.-M.C.; writing–review and editing, S.-Y.D. and I.Y.

**Funding:** This research was supported in part by the Bio Medical Technology Development Program of the National Research Foundation (NRF) funded by the Korean government, MSIP (2014M3A9D7070128); a gran<sup>t</sup> of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number : HI14C3477); and the National Research Council of Science & Technology (NST) gran<sup>t</sup> by the Korea governmen<sup>t</sup> (MSIT) (No. CAP-18-01-KIST).

**Conflicts of Interest:** The authors declare no conflict of interest.
