*3.6. Results*

As shown in Figure 5, both stress tests, Task 1: Stroop Colour and Word Test, as well as Task 3: Minesweeper, indeed induced an elevated stress level based on the users' subjective rating. Task 1 scored *M* = 5.06 (*SD* = 0.68) and Task 3 scored *M* = 6.05 (*SD* = 0.76), respectively. In both relaxation tasks, Task 2: Minesweeper Introduction Video, as well as Task 4: Nature Video, the average rated stress level significantly reduced to *M* =2(*SD* = 0.76) and *M* = 1.39 (*SD* = 0.58). A pairwise *t*-Test (*t* = 19.33; *p* < 0.05), demonstrating significance and provides proven evidence of our success in inducing stress and relaxation among the users.

**Figure 5.** Summary of the results for each task in study 1. The graph of electrodermal activity was computed using the average of the mean electrodermal activity (EDA) over non-overlapping windows of 10 s of all the participants. Accuracy graphs were computed by calculating the average stress detection accuracy of all the participants over non-overlapping windows of 10 s.

Acute stress raises adrenaline levels, making participants feel energetic. In accordance with the answers, all participants stated they felt significantly more energetic during the stress task 1 (*M* = 5.38; *SD* = 0.89) and 3 (*M* = 5.85; *SD* = 1.04), than at the relaxation task 2 (*M* = 3.67; *SD* = 1.67) and 4 (*M* = 2.91; *SD* = 1.78). A pairwise comparison between the stress and relaxation task demonstrated a statistical difference (*t* = 7.09; *p* < 0.05). We hypothesised that a relaxation task will also be more pleasant than a stress task. The perceived pleasure was positive throughout all tasks (T1: *M* = 4.5; *SD* = 1.51, T2: *M* = 4.47; *SD* = 1.6, T3: *M* = 4.95; *SD* = 1.9, T4: *M* = 6.43; *SD* = 0.79). Grouping both relaxation tasks together demonstrated significance to the stress tasks (*t* = 2.84; *p* < 0.05). However, this effect is only significant because the last relaxation task, watching a nature video, was rated as extraordinarily pleasurable. A oneway-ANOVA (*F3,71* = 7.88, *p* < 0.05) confirmed this task as significantly more pleasurable than all other tasks. In addition, The NASA TLX seems to be correlated with the actual stress level. A pairwise *t*-Test (*t* = 13.24; *p* < 0.05) indicated two groups as significantly different from each other. Both stress tasks (T1: *M* = 64.03 ; *SD* = 17.46 and T3: *M* = 74.67; *SD* = 11.32) showed a significantly increased task load in comparison to the relaxation tasks (T2: *M* = 33.55; *SD* = 11.78 and T4 *M* = 21.3; *SD* = 8.08). The data coincides with the self-perceived stress level.

#### 3.6.1. Electrodermal Activity (EDA)

As displayed in Figure 5, the average EDA profile of the participants show a significant increase in both stress tasks (T1 and T3) in comparison to the relaxation task (T2 and T4), which is evidenced by a pairwise *t*-Test (*t* = 2.05; *p* < 0.05). These findings show that the physiological response also correlates to the self-reported stress level and with the Task Load of the participants.

#### 3.6.2. Model Training

As previously stated, we identified four general characteristics that occurs at our foot when placed under pressure. Since our data is low-dimensional to identify the quality of these features, we developed a model for each individual feature using a machine learning classifier. We tested 5 different classifiers with our data, which were: Random Forest (RF) (ntree = 3, mtry = 2), K-Nearest-Neighbour (KNN) (k = 9), Support Vector Machines (SVM) (sigma = 13.9779 C = 0.25, kernel = radial), Decision Trees (CART), cp = 0.02294894 and Linear Discriminant Analysis (LDA). A one-way ANOVA for correlated samples (*F4,60* = 4.82, *p* < 0.05) showed significant differences. A Tukey's post-hoc analysis suggested that LDA-KNN, LDA-SVM, LDA-RF pairs yielded a significant difference. We selected the *LDA* classifier because it showed a consistent and higher mean performance with our data.

Using a supervised learning approach, we trained 10 single-feature models (A1, A2, ..., D2) based on the annotation of our ground truth data. For the ground truth data, we considered the data gathered from all participants, whose stress rating was *M* > 4 for the stress tasks and whose stress rating was *M* < 4 for the relaxation tasks. Hence, for training our model, we excluded ambiguous data, which showed a low stress level at stress tasks (T1: 7/23p., T3: 3/23p.) and an elevated stress level at a relaxation task (T2: 8/23p., T4 0/23p.). Moreover, we excluded the first 60 s and the last 60 s of each task when training our model. Exclusion was necessary to reduce possible noise created by a task accustomisation at the beginning and task exhaustinction at the end of the task.

#### 3.6.3. Model Validation

Each model (*A: Foot Pressure, A1: Fore Foot, A2: Rear Foot, A3: Total Foot, B: Center of Pressure B1: CoP X-Axis, B2: CoP Y-Axis, C: Accelerometer C1: X-Axis, C2: Y-Axis, C3: Z-Axis, D: Foot Tapping, D1: Median Frequency, D2: Dominant Frequency*) was trained and validated using a leave-one*User*-out method. Meaning, we built a user-specific model, which was trained by all other users, but does not include the one we tested the model with. Figure 5 depicts the accuracy rates for a stress detection. Instead of calculating an overall accuracy for each task, we calculated the accuracy for non-overlapping windows of 10 s to allows us to observe the progressing confidence throughout the task. Table 1 summarises the overall accuracy rates across all single feature and multi-feature Models. Creating a

model using all features provides a reasonable accuracy (*M* = 83.9%). However, the standard deviation (*SD* = 12.01) is higher than any other model. The highest accuracy (*M* = 85.32; *SD* = 8.1) was from the combination of all four high performers (A1+B2+C3+D1). Although the standard deviation is relatively high as a single feature model, C3 showed the highest accuracy (*M* = 83.1; *SD* = 11.9) compared to other single feature models. In addition, a pair-wise *t*-Test was conducted to identify the separation sharpness between the stress and relaxation tasks. Except for model A3, all other models showed a high distinguishability (See Table 1).

**Table 1.** Model performance (selected classifier: Linear Discriminant Analysis (LDA)).


#### **4. Empirical Replicability—Study 2**

To validate the generalisability of our models, we replicated the previous study with different parameters, such as using different users and different tasks. The study protocol was approved by the University of Auckland Human Participants Ethics Committee.

#### *4.1. Study Design*

The apparatus and data gathering remained the same. The procedure was very similar, with the only difference being the deployment of a single stress and relaxation task. After each task, the participants were asked to answer the same questionnaire from the previous study and to complete a NASA TLX. The data analysis remained the same. We recruited 11 new participants (7 males and 4 females), aged between 22–34 (*M* = 26.4, *SD* = 3.17) with different ethnicity. The inclusion/exclusion criteria were similar to previous study. We also utilised a different stress (Task 5: Mental Arithmetic Test) and relaxation (Task 6: Nature Video) task.

#### 4.1.1. Task 5 [Stress]: Mental Arithmetic Test

Participants were asked to complete 20 challenging maths questions based on fundamental mathematics within 5 min. We created additional pressure by informing participant's their performance will be graded. Similar to previous tasks, an experimenter observed their performance and commented on their performance.

#### 4.1.2. Task 6 [Relaxation]: Nature Video

To relax the user, we showed a 5 min nature video with soothing music. The video was different to the one in Task 4.
