**5. Conclusions**

Since stress detection systems have lower accuracies in the wild when compared to laboratory environments, there is a need to develop new techniques to improve their performance. In this study, we examined the effect of developing ML models in different environments and with varying ground truth labels. To the best of our knowledge, this was the first work to examine all possible combinations of perceived stress measurements for daily life and laboratory settings along with different ground truth labels. We used EDA, HRV, ST, and ACC signals (ST and ACC were used for artifact detection and removal) in our unobtrusive stress detection system. We first trained and tested our system in the laboratory environment. We obtained a maximum of 94.4% accuracy with HR, 86.70% with EDA, and 92.30% with HRV + EDA, which showed that our system detected the stress levels in the laboratory successfully. These results were aligned with the literature [14]. Choosing the ground truth as self-reports while training the ML model always achieved higher accuracies than using the known

context labels, which could be explained by the fact that stressor levels (i.e., known context) might not represent the perceived stress of participants. We further took a step out to daily life environments and tried the DDSR model, which achieved 68.30% accuracy with HRV, 63.60% accuracy with EDA, and 71.40% accuracy with multimodal HRV + EDA. Then, we applied the model trained in the laboratory with self-reports and showed that the performance increased with all types of physiological signals for daily life stress recognition (7% increase for HRV, 14.8% increase for EDA, 2.8% increase for HRV + EDA). We also investigated that the mean accuracy was enhanced with the LDKC model for the HRV + EDA- and EDA-based stress recognition frameworks. The classification performance of the proposed system changed significantly based on the event labeling methodology. Models trained in the laboratory for daily life stress detection outperformed the ones trained with daily life data. We achieved the best results (73.81%) in the LDSR model, where the daily life stress detection system was trained in the laboratory environment with self-report labels. This demonstrated that we could increase the accuracy of the system by training the model in the laboratory with the same kind of ground truth since we also had self-reports as the ground truth in the wild. We also showed that multi-modal sensing provided a more robust framework for all types of session labeling approaches. The performance of the LDKC model was better than the DDSR model and worse than that of the LDSR in the multimodal framework. On the other hand, the accuracies of the LDKC model were lower than both those of the LDSR and DDSR results when only a single modality was used. We could infer that the DDSR method suffered from relatively noisy training labels and data when compared to LDSR, where training data were obtained from the controlled laboratory environment. Using different types of labels in training and testing (known context labels in the laboratory for training and self-reports in the wild for testing) might be responsible for the low performance of LDKC models. RF and SVM classifiers outperformed other classifiers, and these results were aligned with the daily life stress recognition studies mentioned in the Related Work Section. Feature and modality selection is vital for achieving better performances. We selected the best ten features; five of them were from EDA, and five of them were from HRV, which also suggested that a multi-modality approach was crucial for daily life stress detection. In most of the test cases (15/20), the combination of modalities increased the performance of the system. In the remaining tests, anticorrelations between the features of different modalities might be the cause of lower accuracies. There is still room for improvement for daily life stress recognition. Moreover, our study was not without limitations. In order to generalize the conclusions, additional studies based on larger heterogeneous sample groups are needed. As future works, we plan to develop personalized perceived stress models to overcome the subjectivity problem of self-reports. We will try to exploit baseline surveys and daily session-based questionnaires ofindividualstopreventthebiascausedbysubjectiveself-reports.

**Author Contributions:** Y.S.C. was the main editor of this work and made major contributions to the data collection, analysis, and manuscript writing. D.G. and D.R.K. made valuable contributions to both data collection and manuscript writing. They designed the experiment and contributed to the related sections regarding data collection. D.E. and N.C. contributed equally to this work in the design, implementation, data analysis, and writing the manuscript. C.E. provided invaluable feedback and technical guidance to interpret the design and the detail of the field study. He also performed comprehensive critical editing to increase the overall quality of the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by AffecTech: Personal Technologies for Affective Health, Innovative Training Network funded by the H2020 People Programme under Marie Skłodowska-Curie Grant Agreement No. 722022, and by the Turkish Directorate of Strategy and Budget under the TAM Project Number DPT2007K120610.

**Conflicts of Interest:** The authors declare no conflict of interest.
