**4. Experiments**

Experiments were made using classification and regression models. Moreover, user-independent and personal models were compared. The accuracies presented in this section are balanced accuracies, which is calculated by separately calculating accuracy for both classes and the reported balances accuracy is mean from these. Balanced accuracy was selected as the performance metric as it is not as vulnerable to unbalanced data as accuracy.

#### *4.1. User-Independent Stress Detection*

To compare regression and classification models, binary Random Forest classifier was trained and results of it were compared to Bagged Tree based ensemble regression model. Both were trained using leave-one-subject-out -method, meaning that in turn one person's data were used for testing and others data for training. For classification model, target values were transformed as 1 and 0, so that data from baseline are labelled as 0 and data from driving as 1. However, regression model was trained using continuous target values. Moreover, as the outputs of the regression model are continuous values, they were transformed as binary by finding an optimal threshold for each person to divide outputs as stress and non-stress that maximizes the accuracy. This was done by analyzing the obtained continuous prediction values study subject-wise, and such stress level value was searched which divides the prediction values as stressed and non-stressed so that the obtained balanced accuracy rate is as high high as possible. This of course over-estimates the performance of regression model as the threshold is calculated based on personal data but still it shows the potential of regression model based stress detection.

The results of user-independent recognition rates using regression and classification algorithms and different sensor combinations are presented in Table 2. Using leave-one-subject-out cross-validation method, a own recognition model was trained for each study subject and the performance of the model was calculated using balanced accuracy, sensitivity and specificity. The values presented in Table 2 shows the mean balanced accuracies, sensitivities and specificities calculated over all nine study subjects, and standard deviation between the study subjects is presented in parenthesis.


**Table 2.** Average recognition results accuracies, sensitivities and specificities (standard deviation in parentheses) using regression and classification model and sensor combinations.

Interestingly, according to Table 2 in both cases (regression and classification), the best recognition rates are obtained using only accelerometer features. This is not in line with previous stress detection studies, where it has been noted that features extracted biosignals are more useful for detecting stress than accelerometer features (see, for instance [19]). However, the reason for this is that, in this case, accelerometer-based models do not recognize stress at all; instead, they recognize the mode of transportation as baseline signal and stress signal are performed under different activity (sitting in non-moving car vs. driving a car). In fact, it is shown that these two activities can be recognized based on accelerometer data [20]. Therefore, as the aim of the study is to recognize stress, only models based on biosignals (=EDA, BVP, and ST) are worth studying.

Table 2 shows the potential of regression models. It outperforms classification algorithm with each sensor combination, and the highest recognition rates can be obtained using BVP and ST features. With this setting, the average balanced accuracy was 74.1% with classification model and 82.3% using regression model. In addition, sensitivity and specificity values are the highest using this combination. Confusion matrix for regression model for regression model using these features is shown in Table 3, where it can be seen that both classes are detected with reasonable accuracy. Moreover, according to Table 4 where user-independent accuracies are studied subject-wise using BVP and ST features, it can be noted that regression model performs better than classification model with all nine study subjects. Partly, the grea<sup>t</sup> performance of regression model is due to optimizing the threshold to classify outputs as stressed/non-stressed based on personal data but this is not the only reason for good performance of regression model. Another reason for it is that regression models gets more information as an input than classification model. In fact, classification models gets binary class labels as input while regression model's inputs are continuous targets. This results show that stress detection benefits from continuous targets and is as a nature a regression problem instead of classification problem. This was expected, as the level of the stress of a person can be high or low or anything between these.

**Table 3.** Confusion matrix, when user-independent regression model based on BVP and ST features is used for stress detection.


**Table 4.** Subject-wise balanced accuracy using user-independent model and a combination of BVP and ST features.


While regression models outperform classification models in stress detection, the real benefit of regression models is that, unlike classification, they can be used to estimate the level of stress. Figure 1 shows how well regression models estimate the level of the stress compared to user reported stress level values. The figure shows that in most cases the regression model has approximately managed to predict the level of the stress. However, this approximation is not very accurate, and it is inconsistent. This suggests that the the trained regression models is very sensitive to small changes in feature values, and due to this, the value of predicted stress level can rapidly change. Moreover, the approximation does not work at all for some study subjects (see study subjects MT, Figure 1d and SJ, Figure 1i). In the case of MT, the model has not managed to recognize the non-stressed stage, and in the case of SJ, the recognition of stressed stage has caused problems. These exceptions sugges<sup>t</sup> that stress does not cause similar reactions to every person. Moreover, Figure 1 shows the personal threshold used to divide regression model outputs as stressed and non-stressed (black horizontal line). In most cases, the value of this threshold is between 0.3 and 0.5. However, also in this case MT and SJ are exceptions. In the case of MT, the value of threshold is close to 1, and in the case of SJ, the value is close to 0. This underlines how special these the cases are compared to other seven study subjects.

Table 5 shows root mean square error (RMSE) of predicted and user reported values for each study subject. As the stress level of the baseline was not measured during the data gathering, it was assumed that during the baseline measurement the level of the stress was constant and zero. This of course is just a best guess, and therefore, cannot be considered as fact. In fact, also the predictions shown in Figure 1 suggests that the level of the stress is not constant during the baseline and neither it is zero. Due to this, RMSE values of Table 5 are calculated from two parts of the signal: the second column shows the RMSE value from the whole signal and the third column shows the RMSE calculated only from driving part of the signal as only it contained target values defined by the driver. For instance, in the case of subject MT, there is a big difference between these two. Moreover, the R-Squared value was calculated from each subject, and they are shown in Table 5. These also underline the fact that prediction does not work for each subject. In fact, R-Squared value is zero for three study subjects indicating that for these persons user-independent model is not able to predict any variation on response. In overall, the results of the Table 5 show the potential of predicting the level of the stress using regression models. However, it also shows that at this point of the research, it only works for some study subjects and the

prediction can only give a rough estimation of the stress value, and it needs to be further studied how the exactness of the prediction could be improved.

**Figure 1.** Predicted stress level (blue line) vs. subject estimated continuous target value for stress level (orange line). Personal threshold used to divide outputs as stressed and non-stressed is shown using black horizontal line.

**Table 5.** Subject-wise RMSE and R-Squared using user-independent model and a combination of BVP and ST features.


The combination of BVP and ST features produced on average the highest recognition rates using user-independent regression model. However, this combination was not the best for each subject. Table 6 shows subject-wisely which sensor combination produces the best result. It can be seen that for most subjects, BVP + ST produces the best results but there is also exceptions. For instance for subject EK, a different selection of sensors improves the recognition rate over 20%-units (59.4 vs. 83.2, respectively). However, it has even bigger effect on standard deviation which is much smaller when features are selected subject-wise (17.0 vs. 11.7, respectively). Therefore, possibility to select features

subject-wise, and this way personalize recognition model, could potentially have a significant positive effect to the recognition rates.


**Table 6.** Best subject-wise recognition rates obtained using user-independent regression model, and which sensor combination was used to obtain this result.

#### *4.2. Personal Stress Detection*

The dataset contains more that one data gathering session from three study subjects (three sessions from subject NM (named NM1, NM2 and NM3), two from RY (named RY1 and RY2) and GM (named GM1 and GM2). Therefore, these three study subjects can be used to experiment if the stress recognition models trained using personal data are more accurate than user-independent models, as it is suggested in some of the papers. The data gathering protocol was the same in each of these session, and on each session, the study subjects drove the same route. However, what makes each session unique is that traffic was different in each case as the study subjects drove on public roads.

User-dependent recognition rates for NM, RY and GM are shown in Table 7 using different sensor combinations, and classification and regression models. In each case, models are trained with data from one session and tested with data from another session from the same user. NM1, RY1 and GM1 were used for testing, as these were also used in previous section to train and test user-independent models. According to Table 4 using user-independent model and a combination of BVP and ST features, which based on the Table 2 provide the highest average accuracy, Subject NM's stress can be detected with balanced accuracy of 64.3%/84.6% (classification vs. regression), RY's 90.8%/93.7%, and GM's 69.9%/95.0%. When these are compared to personal recognition rates based on the same features, it can be noted that personal training data do not have a big effect on the recognition rates. In fact, in the case of RY, the results based on personal model are much worse than the results based on user-independent model. Moreover, according to Table 7, the results using personal models are surprisingly bad no matter which sensor combination is used. There are some exceptions, though. RY's results are really good when using BVP features, and GM's stress can be recognized with high accuracy using a combination of BVP and ST features. However, the results from these users were already really good using user-independent model and usage of personal data do not have a big effect compared to those.

There can be several reasons for low personal recognition rates. For instance, in the case of data from subject GM, there seems to be problems with EDA signal, and thus, prediction using it always leads to really bad results. Another reason for low personal recognition rates can be that the size of the training data was not big enough, as in the case of personal model training data consist of measurement from one session and in user-independent case it consisted of measurements from eight sessions. In fact, when NM1 was trained using data from two personal data gathering sessions (NM2 and NM3), the results were much better than the one's reported in Table 7, and over 98% balanced accuracy was obtained using BVP and ST features, see Table 8. However, also in this case it can be noted that the recognition rate is highly dependent on which data are used for training and which for validation. It can be noted that when NM2 and NM3 are used for training and NM3 for testing, the recognition rates are really bad no matter which sensor combination is used. This shows that there are a lot of variation between data gathering sessions, even if the the data are collected from the same person.


**Table 7.** Recognition rates of users NM, RY, and GM using models trained using personal data.

**Table 8.** Cross-validation of three datasets collected from study subject NM.


#### **5. Discussion and Conclusions**

In this article, regression and classification models were compared for stress detection. Both personal and user-independent models were experimented. The article was based on publicly open dataset [14], which unlike most of the other stress detection datasets, contained continuous target variables. The used classification model was Random Forest and the regression model was Bagged tree based ensemble.

The study shows that regression models are superior to classification models when it comes to user-independent stress detection, if continuous target values are available. The best results were obtained using a combination of BVP and ST features, and using these the average balanced accuracy was 74.1% with classification model and 82.3% using regression model. In fact, basically no matter which sensor combination was used, the results using regression models were better than the one's obtained using classification models. This is because during the training process, regression model gets more information than classification model: regression model gets continuous targets as input while classification models inputs are discrete. Moreover, the results show that stress is not binary problem, a person can be highly stress, not stressed at all or anything between these two. Therefore, the results of this article show that prediction model should be trained so that all the available can be given as an input to it.

The main advantage of using regression models is that they can not only be used to recognize whether the person is stressed or not, they can also be used to predict the level of the stress. However, based on our experiments this prediction is not very accurate, and for some study subjects it does not work at all. In fact, the main goal of the future work is to study how the quality of this prediction could be improved. Due to this, different regression models needs to be compared to find

out which is the most capable to predict the level of the stress. This also includes comparison of different quantitative indicators to measures the goodness of the prediction. Unfortunately, continuous target variables are rarely available for the datasets which is the reason why regression models are not often used for stress detection, and why stress detection based on classification models is an important topic to study also in the future.

Stress detection based on personal training data was also studied using data from three persons. The results obtained using personal models were surprisingly bad. The situation was better when more personal data were used for training but still there was a lot of variation in the recognition rates depending on how data gathering sessions were divided for training and testing. This shows that biosignals have a lot of variation not only between the study subjects but also between the session gathered from the same person. Therefore, personal recognition models should be able to constantly adapt to changes in human body. Due to this, for instance incremental learning-based method should be experimented [21]. On the other hand, it was shown that subject-wise feature selection for user-independent model can be a better approach than usage of personal training data to personalize and improve recognition models. It was noted that with subject-wise feature selection, the average detection rate improved 4%-units, and it was especially useful to reduce the variance in the recognition rates between the study subjects.

Moreover, it can be noted that the recognition rates obtained in this article are not as high as the one's obtained in several other stress detection studies where the same features are used ([7,19]). The reason for this can be used data gathering protocol. In many other studies, the stress data are collected from contexts (giving public speech, performing difficult arithmetic tasks, etc.) that most likely are more stressful than driving a car. Therefore, the dataset used in this study does as big difference between relaxed and stressed state as other datasets making it more difficult to analyze. It is also possible that the type of the stress is different in these contexts. In fact, part of the future work is to make experiments using different datasets. In fact, to extend this study, these datasets should include data from new types of sensors (for instance electroencephalography (EEG)), to show that the proposed method is not dependent on the used sensor. The expected result is that the proposed method works similar to this study if appropriate features are extracted from the user sensor. In fact, extracted features needs to be selected sensor-wise, and though, the same features should not be extracted from each sensor. In addition, the experimented datasets should contain data not only from stress, but also from other affect states to show that the proposed method is not dependent on the studied affect state. The final aim is to detect multiple affect states from at same time, for example a person can be stressed and angry. This could be done for instance by training an own regression model for each affect state.

Lastly, it should be noted that when dealing data from stress and affect states which is labeled by the study subjects themselves, it should be noted that study subjects do not necessarily know what they feel [22]. Therefore, the target variables obtained from the study subjects can be unreliable. Due to this, while the preliminary results presented in this article are encouraging, more experiments with more datasets should be carried out to ge<sup>t</sup> a better understanding of how accurate the presented method actually is.

**Author Contributions:** Conceptualization, P.S.; Formal analysis, P.S.; Supervision, J.R.; Validation, P.S.; Writing–original draft, P.S.; Writing–review & editing, J.R. Both authors have read and agreed to the published version of the manuscript.

**Funding:** This research is supported by the Business Finland funding for Reboot IoT Factory-project (www. rebootiotfactory.fi).

**Acknowledgments:** Authors are thankful for Infotech Oulu. Authors would also like to thank Neska El Haouij for collecting a grea<sup>t</sup> dataset and helping us to connect the right observations for the right target values.

**Conflicts of Interest:** The authors declare no conflict of interest.
