3.3.1. Preprocessing

The physiological signals coming from different sensors were segmented into non-overlapping time windows. We selected the window size as two minutes since it was reported that the duration of stress stimulation and recovery processes was approximately a few minutes and these segmen<sup>t</sup> sizes could capture such processes [41]. IBI (Interbeat Interval) and EDA signals were sent to the artifact detection units. Therefore, the response time of our system was approximately two minutes.

The artifact detection and removal unit for the heart rate signal applied an artifact detection percentage threshold between the time of the successive R to R interval recordings. The threshold was selected as 20% [42]. After the removal of the artifacts, they were replaced with a cubic spline interpolation function, as applied in Kubios [43]. The interpolation method was selected for applying time and sample constraints on the remaining data since it achieved better results [18]. The windows containing more than 10% of RR interval artifacts were removed. This unit was developed in our previous work [18].

For the EDA artifact detection unit, we used the toolbox developed by Taylor et al. [44]. It uses the SVM classifier and detects the artifacts in the EDA signal with 95% accuracy by analyzing skin temperature, accelerometer, and skin conductance signals. We added a batch processing feature to this toolbox and removed the detected artifacts. Then, the resulting signal was transferred to the EDA feature extraction unit.

#### 3.3.2. Feature Extraction

To create a multi-modal stress recognition framework, we extracted state-of-the-art features from the EDA and RR intervals to create a feature vector. In this section, we describe the feature extraction methodology for each of the physiological signals. We decomposed the EDA signal into phasic and tonic components using the convex optimization-based EDAcvx tool [45]. EDAcvx also cleans the noise in the EDA signal. We extracted seven features from both the phasic and tonic components of the EDA signal. Percentiles are very handy for exploring the distribution of number sets using various EDA graphs. These following EDA features were selected from the literature [13,26].


We extracted 13 heart rate variability (HRV) time and frequency domain features from RR intervals. These features were commonly used in the previous works [11,26,43,46,47]. In order to compute frequency domain features, we re-sampled RR intervals at 4 Hz [48] and applied fast Fourier transform (FFT). The computed HRV features are shown below:


#### 3.3.3. Feature Selection

We applied correlation-based feature (CBF) [49] selection using the Weka [50] tool. The importance of the features is shown in Figure 4. Since our goal was to develop a stress detection model that works in daily life settings, we conducted experiments with the 1 to 20 best features for the DDSR model. We achieved the best results with ten features for HRV + EDA, five features for HRV, and five features for EDA. We applied the classifiers to the selected features.

**Figure 4.** Features listed in order of importance based on correlation-based feature selection for the DDSR model. EDA peaks are the feature that has the highest importance, whereas the EDA strong peak has the lowest.

#### 3.3.4. Preparation of the Data for ML Algorithms

Although we had evenly distributed data in terms of known context labels in the laboratory, we had a class imbalance problem in the self-reported ground truth labels obtained from the daily life and laboratory. In the laboratory, 71.5% of the data was relaxed and 28.5% of data was stressed. In addition, 73% of the daily life data was relaxed, and the remaining 27% was stressed. We overcame this problem by randomly undersampling the extra samples of the majority class, which is the most commonly used procedure for imbalanced datasets [51]. We further applied normalization on features to prevent overfitting. Lastly, we converted the numeric labels to the nominal type as an input to the Weka toolkit classification algorithms.

#### 3.3.5. Forming ML Models

We developed five different models that used different ground truth types and 248 training and test environment combinations (see Figure 5). The data were divided into two minute windows in the preprocessing part. The label of the window extracted from the time was added to the feature vector. The label could be the stressor level of the session or the perceived stress level obtained from the participants. Features extracted from all windows belonging to a session were averaged, and the label was appended to this feature vector. Data coming from all participants were merged and randomly listed. We then formed one general model from our dataset. In other words, separate models

for all participants were not formed. In order to evaluate the performance of the classifiers, we applied 10-fold cross-validation by partitioning the original sample into a training set to train the model and a test set to evaluate it, then we changed the test and training until each partition was used for the test set. We further created separate training (80%) and test sets (20%) and evaluated the results to compare with 10-fold cross-validation results. In the 10-fold results, the standard deviations of the folds are provided in parentheses.

**Figure 5.** We developed five different stress level classification models with varying ground truth labels and training-test environments. Red and green arrows indicate the ground truth type used. The incoming black arrow shows the training environment, the an outgoing black arrow shows the testing environment.

#### 3.3.6. Classification Algorithms

We used five different machine learning classifiers for recognition of stress events. These classifiers were the ones mostly used and that best performed in the literature [26,42], namely multi-layer perceptron (MLP), random forest (RF), k-nearest neighbor (kNN), support vector machine (SVM), and logistic regression (LR). For the LR, the output probability was divided into two different classes, which were above and below 0.5 [52]. We used the Weka Machine Learning Software [53] for the classification section of our proposed system.

In order to select the best parameter set for the MLP, we experimented with the different numbers of hidden layers (1, 2, and 3) and units (from 1 to 20); one outperformed others, which had two hidden layers, and each layer had five units. We experimented with N for the kNN. After parameter tuning, the best N was selected as 3. We applied the radial basis kernel (RBF) and linear kernel for the SVM. We selected RBF for SVM because it outperformed the linear kernel. For the RF, we selected the number of trees as 100.

#### **4. Experimental Results and Discussion**

We divided our tests into two groups: laboratory and in the wild experiments. We examined the results in the following two sections.

#### *4.1. Laboratory Experiments*

In this experiment, our aim was to differentiate between the stress and baseline states. In this section, we investigate the performance of our stress detection scheme in two different manners. The first one was using the known context as the ground truth. We further provide these labels as classes to the ML algorithm. The second way was to use the perceived stress levels collected from self-reports as the ground truth. In order to measure the perceived stress levels, we collected PSS-5. For these five questions, positive emotions (happy and cheerful) were evaluated inversely. In other words, if a participant stated six: extremely happy from 1–6, it was evaluated as one because happiness and cheerfulness are inversely proportional to stress levels. On the other hand, anger, sadness, and frustration were evaluated proportionally when calculating the score (see Equation (1)).

$$\text{PercStress} = (\mathcal{T} - H\_i) + (\mathcal{T} - \mathcal{C}\_i) + A\_i + S\_i + F\_i \tag{1}$$

where *Hi*: happiness score, *Ci*: cheerfulness score, *Ai*: anger score, *Si*: sadness score, and *Fi*: frustration score. Individual scores on the PSS can range from zero to 30 with higher scores indicating higher perceived stress. Scores ranging from 0–15 would be considered low stress, and scores ranging from 15–30 would be considered high perceived stress. The division was made by adapting the three class division of the PSS-14 class [54]. The performance of our system is presented in Tables 2 and 3.

**Table 2.** Stress detection accuracies with different ML algorithms: 2 class classification. On the left side, stress recognition results, which only use self-reports as the ground truth labels are presented. On the right side, known context information is used for the ground truth label. LLKC stands for laboratory-to-laboratory known context, whereas LLSR stands for laboratory-to-laboratory self-report. Ten-fold cross-validation is used. Standard deviations are shown in parenthesis. HRV, heart rate variability.


**Table 3.** Stress detection accuracies with different ML algorithms: 2 class classification. On the left side, stress recognition results, which only use self-reports as the ground truth labels are presented. On the right side, known context information is used for the ground truth label. Separate training (80%) and test sets (20%) are used. LLKC stands for laboratory-to-laboratory known context, whereas LLSR stands for laboratory-to-laboratory self-report.


We successfully differentiated stress with baseline states, as seen in Table 2 (with a maximum of 94.4% accuracy). Perceived stress level detection classification performance is always higher than physiological stress level detection in the known context because participants may experience different stress levels than the expected level of the context. Some participants may experience lower stress in the TSST while preparing the presentation, presenting in a foreign language, or counting tasks. This proves that the choice of using the ground truth as known context labels or perceived stress labels has a significant influence on the performance of the system. The combination of the two physiological signals always achieves higher accuracy than the single modality with minimum accuracy. However, it does not give the maximum accuracy in all conditions. We could state that multi-modality has an observable effect on some conditions. Overall, we successfully differentiated between baseline and stress states in the laboratory environment.

#### *4.2. Testing the Models in the Wild*

As mentioned, we had two types of class labels in the laboratory environment stress detection experiment: PSS-5 self-reports and the known context (the stressor level). On the other hand, in the wild, we had only self-reports of individuals, which was among the reasons why daily life stress detection performances were low [34], and there is still a room for improvement in daily life stress detection research.

We used both laboratory ground truth labels separately while developing machine learning models for laboratory environments and testing in daily life. The LDSR model was trained with the laboratory self-report labels, whereas the LDKC model was trained with the laboratory known context labels. We expected that laboratory self-report based labeling would have higher performance than laboratory context-based labeling because we had only the perceived stress labels (questionnaires) in the wild, which were more coherent with the laboratory self-reports. Furthermore, participants might experience different perceived stress than the known context implied, which reduced the performance of the LDKC models. We further developed a DDSR model that was solely trained and tested with daily life data. We compare these three models in Tables 4–7. As expected, the LDSR model had the best performance, whereas the DDSR model had the lowest performance. In the laboratory, collected self-reports were more reliable because the environment was controlled. The noise coming from the daily life environment (i.e., forgetting stressful events in a session, unrestricted movements) decreased the performance of DDSR models when compared to LDSR models, which had more clear training data and labels. We could state that collecting laboratory data and training a stress level detection model with that data improved the stress level detection performance in the wild. Furthermore, choosing the self-report label resulted in better performance when compared to that with the known context labels. As far as the performance of modalities is concerned, we achieved the best results with the HRV signal. In most of the daily life cases (15/20 tests), a combination of the signals achieved better results than single modalities alone. In the remaining tests, negative correlations between the selected best features from different modalities could decrease the performance of the system when modalities were combined. In daily life, RF and SVM achieved the best performances, which aligned with the recent literature [26]. Especially with the EDA signal, RF always outperformed other classifiers in daily life tests. Lastly, when the contribution of features were examined (see Figure 4), the best ones included five EDA and five HRV features, which also showed that the combination of these modalities was important.


**Table 4.** The classification accuracy using the combination of two modalities and a single modality along with the DDSR (Daily-to-Daily-Self-Report) technique was provided. 10-fold cross validation is used. Standard deviations are shown in parenthesis.


**Table 5.** The classification accuracy using the combination of two modalities and a single modality along with the DDSR (daily-to-daily self-report) technique are provided. Separate training (80%) and test sets (20%) are used.

**Table 6.** The classification accuracy using the combination of two modalities and a single modality along with the LDKC (lab-to-daily known context) technique are provided.


**Table 7.** The classification accuracy using the combination of two modalities and a single modality along with the LDSR (lab-to-daily self-report) technique are provided.

