*4.2. Results*

In accordance with the questionnaire, all the participants agreed they felt stressed (*M* = 5.36; *SD* = 0.92) during the mental arithmetic task and relaxed (*M* = 1.72; *SD* = 0.64) during the relaxation task (see Figure 6). A pairwise *t*-Test (*t* = 11.74; *p* < 0.05) confirmed that all participants were significantly more stressed during stress Task 5 compared to the relaxation Task 6. All participants stated they felt more energetic during stress task 5 (*M* = 5.18; *SD* = 1.47), than at the relaxation task 6 (*M* = 3.45; *SD* = 2.01), which was confirmed as statistically different following a pairwise *t*-Test (*t* = 2.55; *p* < 0.05). Additionally, participants rated the relaxed task (*M* = 6.36; *SD* = 0.67) as more pleasant than the stress task (*M* = 4.18; *SD* = 2.13), which was evidenced as significant by a pairwise *t*-Test (*t* = 3.54; *p* < 0.05). Most crucially, in terms of Task Load (equally weighted), a pairwise comparison (*t* = 9.49; *p* < 0.05) between both tasks clearly indicated that stress Task 5 significantly induced a higher task load than relaxation Task 6.


**Figure 6.** Results of generalisability. The graph of electrodermal activity and accuracy shows the averages of the 11 participants over non-overlapping windows of 10 s.

#### 4.2.1. Electrodermal Activity (EDA)

These subjective ratings are also consistent with the physiological data gathered—the average EDA response. It showed an overall positively increasing slope at the stress task 5 and negatively decreasing slope for relaxation task 6. Comparing the trends of both tasks by a *t*-Test confirmed the average EDA slopes to be significantly different (*t* = 6.9; *p* < 0.05).

#### 4.2.2. Overall Model Validation

The features were extracted as mentioned before and classified (using *LDA*) with seven previously built models. We chose four single feature models based on the best performer for each observed characteristic, which are: A1, B2, C3 and D1. Furthermore, we used a multi-feature model combining all four. The sixth model was generated based on previous data and by combining the most meaningful features of A1 and C3. The final model was a combination of all computed features. Again, we windowed the accuracy of models over 10 s to observe any periodic trends (see Figure 6). The summary of the overall accuracy rates are shown in the Table 2. A one-way ANOVA for correlated samples (*F6,339* = 12.75, *p* < 0.05) showed significant differences between the accuracy of the models. A Tukey's post-hoc analysis revealed that the significant difference occurred due to the low mean accuracy of model D1. Since model A1+B2+C3+D1 showed slightly higher accuracy than model C3+A1 and a lower standard deviation (*M* = 87.45; *SD* = 8.5), we sugges<sup>t</sup> it is the best performing model. Furthermore, it will work for most of the users, since it considers a feature related to foot tapping such as D1. The model incorporating all features showed an overall accuracy of 85.6%, but with a comparably high standard deviation (*SD* = 12.0). Model C3 showed the highest accuracy as an individual feature model (*M* = 86.7; *SD* = 10.0). In a pair wise *t*-Test, all models indicated high separation sharpness between stress and relaxation (*p* < 0.0001).

**Table 2.** Model Performance for second set of participants for different tasks (Selected classifier: LDA).


#### **5. External Validity—Study 3**

In this study, the goal is to demonstrate the robustness of our model. Therefore, we conducted an in-field study at which we recorded data over a working day from users performing their usual everyday office tasks. Similar to the previous studies, the University of Auckland Human Participants Ethics Committee approved the protocol.
