3.2.1. Protocol

Each experiment set took about 1 h, and comprised two stages: (1) an initial relaxation stage before the experiment began (about 15–20 min) and (2) the main experimental stage (about 45 min). These are described in more detail in Figure 1.

**Figure 1.** An experiment process. There are two stages for the whole experiment: an initial relaxation stage (colored with light green), and a regular experiment stage (indicated by gray bold lines). The regular experiment stage can be segmented into 5 min of relax tasks (colored with dark green) and stress tasks (colored with dark orange). From the start to end of the regular experiment stage, subjects' physiological signals were captured by a wearable device. At the end of each short a relax or a stress task, a mental stress level assessment was carried (indicated with an orange arrow).

During the initial relaxation stage, each subject was asked to wear a wearable device that collected ECG and RESP data and to completely relax, eliminating any excitement or nervousness regarding the experiment. In addition, we explained our protocol in detail to prevent any mistakes by the subject. During the main experimental stage, the subjects alternately experienced simulated relaxing states (called RELAX tasks in Figure 1) and stressful states. During the relaxation tasks, the subject was asked

to sit on a chair in a comfortable position without any mental activity. The first relaxation task aims to build a psychological baseline and remove unwanted excitement before the regular experiment stage. Similarly, the other relaxation tasks aim to remove stress after a stressful task and prepare the next task by setting the psychological baseline. This design improves the reliability of the experiment's results [5]. During the stressful tasks, the subject was provided with one of two types of stressors: (1) a math task, namely a quiz requiring the subject to solve a series of subtraction problems via mental arithmetic, or (2) a Stroop task, namely a quiz where the subject was asked to respond with the color of a given word and ignore its meaning. Because all the subjects were Korean, the words were presented in the Korean language. These are typical tasks that have been used to induce stress in previous studies [12,15,17].

The tasks also varied in difficulty, based on the results of a previous study by Cho et al. [15]. For example, an easy math task might be to repeatedly subtract 1 from a four-digit number, responding within 7.5 s, while a hard math task might involve repeatedly subtracting a two-digit number, rather than 1, from a four-digit number with the same time limit. Likewise, easy Stroop tasks involved words with the same color and meaning, while these were mismatched for hard Stroop tasks. In either case, the time limit for each Stroop problem was 1.5 s. Appropriate sound feedback was also provided to indicate whether or not the entered answer was correct, encouraging the subject to pay attention to the task and inducing additional stress.

The subjects were presented with these four stress-inducing tasks (two types with two difficulties) in random order. Although a previous study [15] used a fixed order, we chose not to do this, for two reasons: (1) a fixed order could bias the stress level, and (2) a random order would better replicate real stress-inducing situations. All of the relaxation and stressor tasks lasted for 5 min. At the end of each task, the subject was asked to evaluate how mentally stressed he or she felt, based on a visual analogue scale (VAS) [15], which is used as a evaluation method of an individual's subjective stress score with value from 0 (not at all) to 10 (extreme stress). For example, if a subject is relaxed or stressed, the VAS score will be close to zero or 10, respectively. In this study, the purpose of the use of VAS is to confirm the average effects (e.g., induce stress or relieve stress) of each task on the 16 subjects. During the experimental stage, the subjects were asked not to use their cellphones and to minimize unexpected mental stimuli. When the experimental stage was complete, the subject took off the wearable device to complete the experiment.

#### 3.2.2. Experimental Setup

A BioHarness module 3.0 (Zephyr Technology, Annapolis, MD, USA) was used to collect the subjects' ECG and RESP data. This wearable device is compact and can be tightened with a strap, making it a good choice to minimize movement disturbance during the experiment. Because making the strap too tight could induce unnecessary stress or pain, we asked the subjects not to tighten it so much that it was uncomfortable to wear.

The experiments were conducted on a laptop computer (with a 2.8 GHz Intel Core i7 processor (Santa Clara, CA, USA) and 16 GB of RAM) in a closed room. During each experiment, the subject was alone in the room, as shown in Figure 2. On the laptop computer, a graphical user interface application was installed, which we developed with MATLAB R2016a (MathWorks, Natick, MA, USA). This was designed to be as simple as possible so as not to confuse the subjects.

**Figure 2.** Setup of the experiment in a closed room. A subject proceeds with the experiment with a laptop computer. There was not only no one else except the subject, but also no camera not to make the subject nervous or embarrassed.

#### 3.2.3. Data Preprocessing

After running the experiment a total of 16 times, we collected 16 datasets, consisting of ECG and RESP data, and stress level indices (VAS scores).

During the preprocessing step, the captured ECG signal was first filtered by a 2000th-order finite impulse response notch filter with 58–62 Hz bandwidth, and a second by a 3000th-order finite impulse response bandpass filter with 1.5–150 Hz bandwidth [12]. This de-noising process makes it easy to find the R-peaks of ECG. In contrast, during the preprocessing of the RESP signal, we did not filter this because it was captured from torso expansion and contraction, and thus any motion noise might not be independent of the subject's breathing.

We divided the segmen<sup>t</sup> for each task into six clips, consisting of 50-s windows with no overlap. We chose 50-s windows because at least 50 s of physiological data are required to extract several important features [17]. The ECG and RESP segments' start and end times were all clearly synchronized. Here, there was only one data point of overlap between one segmen<sup>t</sup> and the next. Then, we excluded the first clip from each segmen<sup>t</sup> owing to the initialization time needed for each task. After preprocessing, we obtained a total of 720 segments (16 subjects, each recording nine segments, with five clips per segment). Finally, we labeled each segmen<sup>t</sup> with its binary class (stressed or unstressed) according to the task type (relaxation or stressor).

#### *3.3. Machine Learning Approaches*

To compare our deep learning approach with conventional machine learning approaches, we also developed several machine learning models for use as benchmarks. Here, we selected ECG and RESP features that have been used in many previous studies [11,12,17–19].

We extracted 11 handcrafted features from the ECG data, including four time-domain features and seven frequency-domain features (Table 1). As time-domain features, we extracted the mean HR (HR mean), standard deviation of the Normal-to-Normal (NN) interval (sdNN), root mean square of successive difference of R peak-to-R peak (RR) intervals (rmssd), and percentage of the differences between adjacent RR intervals that were greater than 50 ms (pNN50). As frequency-domain features, we extracted the NN interval powers in the following ranges: 0.00–0.04 Hz (VLF), 0.04–0.15 Hz (LF), 0.15–0.40 Hz (HF), and 0.14–0.40 Hz (TF). In addition, we included the ratios of LF to LF+HF (nLF), HF to LF+HF (nHF), and LF to HF (LF2HF) as frequency-domain features.

We also extracted a total of eight handcrafted RESP features: three time-domain features and five frequency-domain features (Table 1). As time-domain features, we used the square root of the mean squared RESP (RMS), interquartile range (IQR), and mean difference between adjacent elements of each RESP segmen<sup>t</sup> (MDA). As frequency-domain features, we used the powers in the 0.00–1.00 Hz (LF1), 1.00–2.00 Hz (LF2), 2.00–3.00 Hz (HF1), and 3.00–4.00 Hz (HF2) ranges, as well as the LF1 + LF2

to HF1 + HF2 ratio (L2H). As with the ECG frequency-domain features, the RESP features were also computed using Welch's method of estimating the data's power spectral density.

Then, we developed several machine learning models that have previously been proposed to classify stress states [20]. While the models were being trained and evaluated, the features were normalized by using a MinMax scaler to bring them into the 0–1 range. To prevent data leakage during training, the scaler parameters were fitted using only the training set features, but used to normalize both the training and test set features. We tuned the models' hyper-parameters via grid search and calculated their average performance using five-fold cross validation.

**Table 1.** A list of features extracted from ECG and RESP. We computed the power spectral density of ECG's NN interval and RESP, using Welch's method, to extract frequency domain features. Abbreviations: ECG, electrocardiogram; RESP, respiration; NN, normal-to-normal; RR, R peak-to-R peak.


#### *3.4. Deep Learning Approaches*

Unlike machine learning approaches, deep learning approaches are based on deep neural networks that can directly extract features from the data, and are not reliant on well-defined handcrafted features. As the name implies, deep neural networks are artificial neural networks with two or more hidden layers. Having many hidden layers enables such networks to learn more complex nonlinear patterns and hierarchical information than would be possible with shallow networks. Despite these advantages, however, deep neural networks also usually have a large number of parameters, which can lead to over-fitting, and they can experience issues with the gradient vanishing when they have a large number of layers. These problems can result in a failure to learn and an increase in generalization errors. To overcome these limitations, recent algorithmic advances (e.g., rectified linear units, batch normalization, dropout, stochastic gradient descent, and data augmentation), more powerful computational hardware (e.g., general-purpose graphical processor units), and innovative network architectures, such as CNNs and LSTMs, have partially resolved these over-fitting and gradient vanishing problems, enabling high performance to be achieved. These developments have encouraged the use of deep learning approaches in numerous fields, including physiological signal analysis [21] and stress recognition [5,12,15].

We designed our proposed network based on Deep ECG Net's structure [12]. First, a batch-normalization layer is used to normalize each physiological signal, so that the network can learn to normalize the signals based on the data itself. Then, there is a 1D convolutional layer and a 1D max-pooling layer for each signal, which extract stress-related waveform patterns from the ECG and RESP data. Here, a rectified linear unit (ReLU) is used as the activation function. Next, comes another 1D convolutional layer. There is no additional max-pooling layer this time because the previous max-pooling process has greatly reduced the dimensionality. After that, there are multiple LSTM layers, in order to obtain sequential information about the features extracted from the previous convolutional layer. Next, we concatenate the extracted ECG and RESP features and add a dense layer. Finally, there is a fully-connected layer with a sigmoid activation function, which classifies the data as stressed or unstressed. To prevent over-fitting, we also add dropout and batch-normalization layers. Figure 3 shows the structure of the proposed DeepER (ECG–RESP) Net.

**Figure 3.** The structure of the proposed DeepER Net. The different signals were processed in each network branch and then concatenated for recognizing the stress. The basic structure is based on the structure of Deep ECG Net [12].

As noted by the developer of Deep ECG Net [12], both the first 1D convolutional layer's kernel length and 1D max-pooling layer's pooling length are important factors. They determined that a kernel length of 0.6 s (i.e., 600 points at a sampling frequency of 1 kHz) and a pooling length of 0.8 s (i.e., 800 points) were optimal. These choices are very plausible. First, the length of the PQRST of the ECG is the sum of its PR and QT intervals that is between 0.57 and 0.67 [12]. Thus, selecting a value between them is reasonable as a kernel length. Furthermore, to apply a max-pooling operation of an interval including at least one R peak that is related to HR and HRV, an average heart rate period (about 0.8 s) can be a considerable candidate. Based on these heuristic choices, we designed our first 1D convolutional layer to have the same kernel and max-pooling lengths (0.6 s and 0.8 s, respectively) for processing the ECG data. The kernel and max-pooling lengths of the network designed to process RESP data were designed similarly: a single respiration period was used for the kernel and max-pooling lengths. Because the RESP pattern is simple and split into by an expiration (e.g., nadir) and an inspiration (e.g., peak), the size is sufficient to extract RESP's features. Because adults normally respire 12–20 times per min [22], we set both lengths to 5 s (i.e., 125 points at a sampling frequency of 25 Hz).

Our proposed network has 50 filters in each of the initial 1D convolutional layers, which has a stride of 1. For the ECG network, there are 50 filters in the second 1D convolutional layer, which has a kernel length of 25 and a stride of 1. For the RESP network, there are 50 filters in the second 1D CNN layer, which has a kernel length of 4 and a stride of 1. Zero-padding was used in all the convolutional layers to maintain the input size. There are 32 and 16 units in the first and second LSTM layers, respectively, and 512 units in the dense layer. The second 1D convolutional layers in the ECG and RESP networks have kernel lengths of 25 and 4, respectively, so as to focus on the same time interval (20 s). All dropout layers have a dropout rate of 0.5 and the weight decay's regularization strength is 10−4.

For training, we used the Adam [23] optimizer with a learning rate of 10−<sup>3</sup> and a step decay scheduler (i.e., the learning rate is halved every 50 epochs). The binary cross-entropy loss was used to calculate the losses between the labels and predictions, as follows:

$$L = -\frac{1}{M} \sum\_{i=1}^{M} y\_i \log(\mathcal{Y}\_i) + (1 - y\_i) \log(1 - \mathcal{Y}\_i). \tag{1}$$

We used a total of 250 epochs, a batch size of 32, and a 0.3 validation split (i.e., 30% of the training set). Finally, the model with the lowest loss on the validation set after 250 epochs was used for evaluation. As with the machine learning models, we used five-fold cross validation to evaluate the performance of the network.

All training processes were conducted using the well-known Keras deep learning library, with Python 2.7 running under Ubuntu 16.01.5, on a PC with a 3.6 GHz Intel Core i7 processor, 128 GB of RAM, and 4 NVIDIA GeForce GTX1080 Ti (Santa Clara, CA, USA).
