**1. Introduction**

As interest in health care increases, the importance of stress managemen<sup>t</sup> has grown. As many people are exposed to the stressful environments, they are more likely to suffer from physical and mental disorders. Indeed, stress has been shown to cause diseases such as depression, asthma, and autoimmune diseases [1]. To observe the changes in our body caused by stress, many researchers have focused on physiological signals, such as electrocardiography (ECG) signals and galvanic skin response [2].

When a person receives stress stimulation, his/her autonomic nervous system reacts to the stress, which results in physiological changes [3]. Among the physiological signals, the ECG enables us to observe how our bodies react to stress. An ECG is an electrical signal which is generated by heart activity. An ECG signal has three main components: The P-wave, QRS-complex, and T-wave. Among them, the time-series intervals between successive R peaks are used to calculate the heart rate variability (HRV) [4]. The HRV can be represented by various parameters that are calculated along the time, frequency, and non-linear domains. These HRV parameters have often been used for stress recognition [5–7].

With the recent development of mobile sensors for ECG recording, HRV analysis and stress studies have been actively carried out. However, due to the inherent limitations of HRV analysis, which requires sufficient data to observe the variability, the longer the time window of the ECG record for an HRV analysis is, the more accurate the statistical characteristics can be. In other words, it is difficult to perform an HRV analysis with short-term ECG measurements. Some previous research has demonstrated the minimum time window required for an HRV analysis. Camm et al. [7] recommended at least a 5 min ECG measurement to analyze the HRV. Moreover, the R peaks along the ECG time-series must be detected. To do so, computational algorithms based on the Pan-Tompkins algorithm [8] can be used. These algorithms contain preprocessing steps, such as filtering and differential operations, to find the QRS-complex adequately. Namely, the classical HRV approach requires additional preprocessing steps with a limited window length.

To recognize stress states, several previous studies [9–11] have reported classical HRV analysis using machine learning algorithms, such as the support vector machine (SVM), k-nearest neighbors (kNN), adaptive boosting (AB), and logistic regression (LR) methods, along with the ECG and the other physiological signals. Other physiological signals include skin conductance (SC) [9,11], respiration [9,11], and skin temperature (ST) [11].

Table 1 shows a comparative summary of previous studies which used conventional machine learning algorithms, as well as DNN models. The column "Window length" refers to the duration of the ECG measurement used to extract the HRV parameters. The column "Performance" indicates the reported accuracy of the classifier mentioned in the column "Classifier". Most studies used more than one classifier, but only those that gave the best results are shown in Table 1. The study [11] used different stressors to the other studies [9,10], including the stroop color word test (SCWT), mental arithmetic (MA), and counting numbers. These stimuli are considered to be laboratory-environmental stress stimulation. Castaldo et al. [10] designed an experiment where an ECG was acquired on two different days. One was a day when the participants were undergoing a university verbal examination, and the other day was after a vacation. Although these studies (which were performed in a laboratory-controlled environment) showed an accuracy of about 85%, outside of the laboratory the accuracy reached only 80% [10]. These results might indicate the limitations of the stressors used in the laboratory environment and of the conventional machine learning methods. A conventional machine learning method based on HRV features involves not only the preprocessing of the ECG signals, but also feature selection among the HRV parameters.



As the deep neural network (DNN) approach has recently demonstrated outperforming pattern recognition accuracy [15], some studies [12,13,16–21] have utilized DNNs for biomedical engineering applications, including heart arrhythmia classification, medical image classification and enhancement, stress detection, and for other medical diagnoses. U-net [21], which consists of a convolutional neural network (CNN), forms an autoencoder architecture using skip-connection between the encoder and decoder. This architecture could be used in medical imaging applications, including augmentation, classification, and detection. Hannun et al. [16] achieved a performance in arrhythmia detection using a deep convolutional architecture that was similar to or exceeding that of cardiologists.

Hwang et al. [12] proposed the DeepECGNet method, which detects stress using an ultra-short-term (10 s) ECG. They used raw ECG signals for the input of the DNN without extracting the HRV parameters. A model was configured with one CNN and two long short-term memory (LSTM) [22] models. They suggested an optimal convolution filter width and pooling window width, respectively, of 0.6 s and 0.8 s. They recommended selecting the proper hyperparameter values for the convolution filter width and pooling window width, both of which are capable of covering the QRS-complex of an ECG. They designed an experiment for inducing mental and physical stress in participants through MA, SCWT, visual stimuli, and cold pressor. There were 20 and 30 participants, separated into two cases. In the two cases, the model [12] reached 87.39% and 73.96% accuracy. Saeed et al. [13] suggested using a multitask convolutional neural network (MT-CNN). Raw physiological signals configured with the heart rate (HR) derived from an ECG and SC are fed to the MT-CNN to detect stress. There was no mention of how long the duration of the ECG measurement was, but only 300 samples of ECG signals were reported. They achieved 0.918 for the area under the receiver operating characteristic curve (AUROC).

Though these studies suggested models that detect stress automatically using conventional machine learning algorithms, some issues remain unsolved. One of these is the complexity of the classification steps. Four steps are involved in conventional methods: Preprocessing, feature selection, classifier training, and classification. Preprocessing, including R-peak extraction, is required to extract the proper R-peak and calculate the HRV. Most of these studies used other physiological signals (i.e., respiration, SC, and ST) as well as an ECG, and needed to select proper features among the both physiological signals and the HRV parameters. These processes make it difficult to apply these models in a practical environment, from the perspective of real-time classification. Additionally, a person must measure the ECG for at least 1 min to detect a stress state, even though it is a short-term HRV analysis.

In this paper, we sugges<sup>t</sup> a DNN model to detect stress with a raw ECG signal, which possibly overcomes the limitations mentioned above. An end-to-end method using a raw ECG signal does not require preprocessing (i.e., filtering and R-peak extraction). Additionally, this method does not require additional feature selection. The other contribution of our research is a method for training the DNN with a pretrained model. The DNN requires a large amount of data to train the model, but it is difficult to acquire a large data set of physiological signals. We trained the proposed model based on a pretrained model, which learned a large amount of data [14]. We evaluated the performance of the proposed model by calculating evaluation metrics (e.g., accuracy and area under the curve) which are widely used in the evaluation of DNN models. We also assessed the proposed model by comparing it with conventional machine learning methods [9–11] which used only the ECG to detect stress, and with other DNN methods [12,13].

#### **2. Material and Methods**

#### *2.1. Subjects and Data Acquisition*

We used two kinds of data sets, which were obtained from two different experiments, to train and evaluate the proposed model. The two different data sets can be considered as ambulatory and laboratory stress, respectively. One of the data sets consisted of ECG measurements collected from drivers who drove through a city and on a highway [14]. The other data set was recorded in an experimental environment, where mental stress was induced by arithmetic tasks in the participants. Figure 1 shows detailed information about both protocols, including duration and task.

**Figure 1.** The experimental procedures and their durations.

#### 2.1.1. Driving Data Set

PhysioNet [23] offers free access to a large number of physiological signals recorded under various conditions. We selected the driver stress data set [14] from among the free accessible databases. It consists of various physiological signals, including ECG, electromyography (EMG), SC, and respiration signals, which were recorded under the conditions of driving and resting. Healey et al. [14] tried to monitor real-world stress during driving situations. Among the recorded physiological signals, we chose the ECG, which was sampled at 496 Hz. Modified lead II configuration electrodes were used to measure the ECG. Sixteen participant's records were uploaded to PhysioNet. We excluded 2 subject's data, as they did not contain a record of the marker for when the participants changed the driving region to highway or city. Finally, we selected 14 subject's records for use in our study. All the selected records included approximately 50–85 min of ECG measurements during driving and resting. Due to differences in traffic conditions, the total duration of experiments differed by subject. The participants were made to take a rest for 15 min before and after driving.

#### 2.1.2. Mental Arithmetic Data Set

Seventeen people (6 female and 11 male, 27.9 ± 3.3 years old) participated in our experiment. We designed an experiment to induce mental fatigue in the participants using a mental arithmetic task. The mathematical tests were developed based on the Montreal Imaging Stress Task (MIST) to elicit two different levels of mental stress in participants. To do so, we simplified the MIST paradigm by two levels of difficulty. The participants had to try to solve the arithmetic problems and push the keypad to answer the questions. The problem consists of two levels: Moderate and high. The moderate level included three integers with plus and minus operations. For the high level, four integers and all of the arithmetic operations were used. All participants encountered the same level of complexity for the arithmetic problems. The ECG data were sampled at 256 Hz and measured by a T-REX TR100A sensor; the electrodes were placed in a modified lead II configuration, which was the same as for the driving data set [14]. First, we measured the baseline ECG for 5 min while the participants took a rest. After the baseline measurement, the participants took the mental arithmetic test two times (5 min each) at a moderate and a high level. They encountered more complicated mathematical problems during the high level test. We

provided a 5 min rest between the mental arithmetic tests. To measure whether the mental arithmetic induced mental stress in the participants, we used two questionnaires, including self-assessment manikin (SAM) [24] and distress thermometer (DT) [25]. The self-reports were written after the first rest period and after two repetitions of the tasks. The mental arithmetic experiments were approved by the institutional review board at the Korea Institute of Science and Technology (2017-030).

#### *2.2. Data Preprocessing and Annotation Procedures*

We performed some preprocessing procedures before using the data sets for training the neural network. As the ECGs of the two data sets were recorded by different sensors, we scaled them to the same range (0–1) using z-score normalization,

$$\mathbf{x}\_{i}^{s} \leftarrow \frac{\mathbf{x}\_{i}^{s} - \mu^{s}}{\sigma^{s}}, \quad \text{for } 0 \le i < n$$
 
$$\mathbf{s} \in \{ \text{Driving } \mathbf{MA} \}$$

where *n* denotes the number of data sets for each stressor and window. We calculated the mean (*μ*) and standard deviation (*σ*), along with all the data for each stressor (i.e., the driving and the mental arithmetic). We normalized the ECG using both the mean and standard deviation. After normalization, the driving data set was downsampled to match the sampling rate of the mental arithmetic data set, which was sampled at 256 Hz. We needed a fixed input dimension, due to using the same neural network for training the model and detecting the stress based on the two different data sets. The ECG signals were sliced into 10 s, 30 s, and 60 s windows to detect stress in short-term windows. The reason for setting a short-term window was to try to recognize stress in nearly real-time.

We needed to annotate the data with specific labels to train a neural network by a supervised method. We segmented the driving data set, based on the boundaries between driving on the highway, driving on the city, and resting. The ECG measurements recorded during driving on the highways and in the cities were labeled as stress, and the other measurements were labeled as rest. In the case of the mental arithmetic data set, we labeled the ECG measurements recorded during the mathematical task as stress. The other ECG measurements, recorded during the rest period, were annotated as rest.

Table 2 shows the numbers of the data sets and their label distribution for each window. The number of data labeled as stress in the driving data set was much larger than those labeled as rest, while the mental arithmetic data set shows a balanced label distribution. The drivers took a rest for approximately 30 min, including an initial and final rest, but they drove for over 45 min, resulting in an imbalanced distribution. The participants who took the mental arithmetic test were exposed to the same amount of time for the stress task and resting; 10 min each. The driving data set and the mental arithmetic data set contained over 72,000 and 16,000 ECG cycles, respectively.


**Table 2.** Number of samples in the data sets.

#### *2.3. The Deep Neural Network*

We propose a deep convolutional neural network to detect stress events. Figure 2 shows the architecture of the proposed model. It obtains input of only raw ECG signals, not HRV parameters or other physiological signals. The input dimension can be defined as *x* ∈ R*<sup>m</sup>*×*w*×*c*, where *m* and *c* denote the size of the mini-batch and the number of channels, respectively, and *w* refers to the width of the ECG, which is defined as the multiplication of the sampling rate and window. Our model contains successive stages ( *N* = 8) to extract features from the ECG. Table 3 shows the list of the operations and the detailed parameters used in the stages. The number of filters in the convolutional layer is defined as 8 × <sup>2</sup>*k*, where *k* begins at 0 and is increased by one every second stage. Each stage consists of two convolutional layers and one pooling layer. A convolutional layer performs a convolution operation with its filter and a specific stride. A stride is defined as how much the filter moves within a layer (i.e., the convolutional and pooling layer). An output of the first convolutional layer is fed to both a strided convolutional layer and a pooling layer. The stride values of the strided convolutional layer and pooling layer are set to 2. The inputs of these two layers are subsampled by a factor of 2. Max-pooling, which chooses the maximum value among the filter widths, is used in the pooling layers. Both the strided convolutional layers and the pooling layers subsample their inputs, followed by each input being concatenated along its channels. If the output of a previous stage has a *σ*(*<sup>N</sup>*−<sup>1</sup>) ∈ R*m*×*w*×*c* dimension, both the strided convolutional layer and pooling layer produce outputs as *<sup>C</sup>*(*N*), *P*(*N*) ∈ <sup>R</sup>*<sup>m</sup>*<sup>×</sup>(*w*/2)×(*c*/2), where *C*(*N*) and *P*(*N*) denote the output of the strided convolutional layer and pooling layer in the *N* stage. Concatenating along its channels gives it a dimension of *σ*(*N*) ∈ R*<sup>m</sup>*<sup>×</sup>(*w*/2)<sup>×</sup>*c*. When passing through the stages ( *N* = 1, 2, ... , 8), dimension reduction along its width is performed. For example, if the input (raw ECG) has dimensions of *x* ∈ R*m*×*w*×1,

$$\begin{aligned} \sigma^{(N)} & \in \mathbb{R}^{m \times (\text{w}/2^N) \times c}, \quad N \in \{1, 2, \dots, 8\} \\ w &= 256 \times \text{window}, \quad \text{window} \in \{10, 30, 60\} \end{aligned}$$

$$c = 8 \times 2^k, \quad k = \begin{cases} \frac{N}{2} - 0.5, \text{ when } N \text{ is odd} \\\ \frac{N}{2} - 1, \text{ when } N \text{ is even} \end{cases}$$

The extracted features (*σ*(8)) generated by the last stage are fed to the softmax classifier, which performs a binary classification between stress and rest:

$$h\_j = \frac{\exp\left(\sigma\_j^{(8)}\right)}{\sum\_{k=1}^2 \exp\left(\sigma\_k^{(8)}\right)}, \quad j = 1, 2 \tag{1}$$

where *j* = 1, 2 for the binary classification. The output of the softmax classifier, *hj*, represents the probabilistic distributions of each class—stress and rest—the sum of which is 1. Table A1 shows more detailed information of the proposed network, including the shape of the output for each operation with the input of 10 s of ECG.

We used a rectified linear unit (ReLU) as an activation function to generate a non-linear decision boundary from the successive linear combinations of the weighted inputs. The activation function produces a maximum value between zero and its input. We applied dropout [26] with a drop rate of 0.3 and batch normalization [27] to prevent overfitting. A neural network can be easily overfitted to the training data when a model learns within a small number of data sets. Many studies have made use of dropout and batch normalization to overcome overfitting. Dropout requires a drop rate, which represents how many neurons are dropped in each layer. Batch normalization makes the input data follow a specific distribution,

> **ECG INPUT OUTPUT Softmax Rest Stress Deep Neural Network, 8 STAGES 12 3456 78 StridedPooling N-STAGE B A C A DE LEGEND A: Convolution B: Max Pooling C: Batch Normalization D: Activation E: Dropout**

based on a normalized input distribution. The distribution can be changed during training through the trainable variables [27] *γ* and *β*, where *γ* scales the normalized input and *β* shifts it.

**Figure 2.** Deep neural network architecture and the components of each stage. Raw ECG signals are provided into the input layer. The successive stages extract features from an output of a previous stage. After the last stage, a softmax classifier performs a binary classification between the rest and stress.

**BN** **ReLU** **Dropout**

 **Conv 1-D**

**Conv 1-D**


**Table 3.** List of operations and hyperparameters used in each stage.

#### *2.4. Training the Neural Network*

There are three types of training method for the proposed model: Type I generates a pretrained model that trains using the driver data set; Type II trains a model with the mental arithmetic data set; and Type III trains a pretrained model (i.e., Type I) with the mental arithmetic data set.

All three types of training use the same end-to-end architecture, using the raw ECG signals from each data set. A loss function needs to be set to train the DNN model. We utilized the cross-entropy loss function:

$$L = -\frac{1}{m} \sum\_{i=1}^{m} (y\_i \log \left( h\_i \right) + (1 - y\_i) \log \left( 1 - h\_i \right)),\tag{2}$$

where *hi* and *yi* denote the prediction results from the proposed model and the true labels from the data set, respectively, and *m* represents the size of the mini-batch, which is set to 64. When the model predicts a state (i.e., stress or rest) properly, the loss function becomes nearly zero. However, it diverges from zero in the opposite situation; that is, when the model produces an output different from the data set label. A proper optimizer must be selected to train the DNN stably, because the optimizer ensures that the loss

function converges to zero. We used the Adam optimizer (*β*1 = 0.9, *β*2 = 0.999, = <sup>10</sup>−8) [28] to train the proposed model. This optimizer calculates the gradients of the loss function by back-propagation, which adjusts the weights of the neurons in an end-to-end model. All the weights of the neurons are initialized by the He initializer [29].

$$\mathcal{W} \sim \mathcal{N}(0, Var(\mathcal{W})).\tag{3}$$

The weight distribution is initialized to the normal distribution, which has a mean of zero and standard deviation of the weight variance. The weight variance is defined as follows:

$$Var(\mathcal{W}) = \sqrt{\frac{2}{n\_{in}}},\tag{4}$$

where *nin* denotes the number of input weights.

There are many hyperparameters (e.g., the number of layers, number of neurons, size of the mini-batch, filter width, number of channels, among others) to be decided, in order to train a model properly. We first considered how many stages are adequate for the proposed model. The number of the stages began with 1, and the performance for accuracy showed improvement as the number of stages increased by 1, up until 8. Within the proper number of stages, we tuned the filter width and the number of filters to find the best fitting parameters. We searched the hyperparameters of the DNN through a grid search method and a manual search. Finally, we chose the model that achieved the highest accuracy for the test data set along all the maximum 10 epochs. Section 2.5.1 shows how we split the data set into a training set and testing set, based on cross-validation.

#### 2.4.1. Type I Training

We generated a pretrained model with the driving data set. As the DNN requires a large amount of data for training, we used the driving data set, which was larger than the mental arithmetic data set. The learning rate was set to 1 × 10−<sup>3</sup> and was reduced by a factor of 10 every 5 epochs.

#### 2.4.2. Type II Training

Using the mental arithmetic data set, the same method of end-to-end training was applied as in Type I training. Additionally, we used all the same hyperparameters to observe how the size of the data set affected training the neural network.

#### 2.4.3. Type III Training

We applied a transfer learning method, using the pretrained model generated by Type I training. As mentioned above, it is difficult to train a neural network with a small data set. If a model is trained based on a pre-trained model, the model can then easily be fine-tuned. We hypothesized that the ECG measurements obtained from the participants who had taken the mental arithmetic task were similar to those obtained from drivers. It is effective to apply transfer learning when the distribution of data set to be learned is similar to that of the pretrained model. The softmax classifier required a re-training process, because there is little difference between the data used in pretraining and the data to be trained, although their distributions were similar. However, re-training only the softmax classifier did not show an acceptable performance. The number of layers to retrain was, thus, considered as a hyperparameter. We applied the grid search method to find the proper stages (Stage 1, Stage 2, . . . , Stage 8) to be retrained. The start stage to be retrained was changed until a satisfactory performance was achieved. We kept the

pretrained model, except for the last stage (*N* = 8) and the softmax layer (*N* = 9). The trainable variables in the softmax classifier were initialized before training.

$$\mathcal{W}^{(N)} \sim N(0, Var(\mathcal{W}^{(N)})). \quad N = \theta \text{ (softmax)}\tag{5}$$

We utilized the Adam optimizer [28] to update the trainable variables of the last stages and softmax classifier through backpropagation, but the variables in the other stages were kept constant:

$$\mathcal{W}^{(N)} \leftarrow \mathcal{W}^{(N)} - \mathfrak{a}\_l \cdot \mathfrak{m}\_l / (\sqrt{\upsilon\_l} + \varepsilon), \quad N = 8, 9 \tag{6}$$

where *m* and *v* denote the first moment and second moment of the Adam optimizer, respectively. We used a different learning rate (*α*), which started at 1 × 10−<sup>4</sup> with the same decay rate (decreasing by a factor of 10 every 5 epochs), to train the model. As the pretrained model was already fine-tuned, it was better to use a lower learning rate. The results of three training types and comparisons between them are shown in Section 3.

#### *2.5. Model Evaluation*

In this section, we describe how to evaluate the proposed model. All three training types, as mentioned above, performed training with a training set using cross-validation. We tested each type of end-to-end model with its test set and calculated the evaluation metrics (i.e., the receiver operating characteristic curves). Additionally, we observed the features not only at the end of the neural network, but also in the middle stages. The T-distributed stochastic neighbor embedding (t-SNE) [30] makes high-dimensional features visible in a two- or three-dimensional domain.
