*3.5. Metrics*

Because this is a binary classification problem (i.e., the subject is stressed or unstressed), we used the following metrics to evaluate both the deep learning network and the machine learning models:

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \times 100 \text{ (\%)},\tag{2}$$

$$\text{F1 score} = 2 \times \frac{TP}{2 \times TP + FP + FN}.\tag{3}$$

Here, TP (true positive) is the number of cases correctly classified as "stressed," while TN (true negative) is the number of cases correctly classified as "unstressed." Likewise, FP (false positive) is the number of cases that were classified as "stressed" but were actually "unstressed," while FN (false negative) is the number of cases that were classified as "unstressed" but were actually "stressed." The first metric (accuracy) is the percentage of cases that were correctly predicted, while the second (F1 score) is the harmonic mean of the precision and recall, which indicates the trade-off between these two metrics.

In addition to the accuracy and F1 score, we also used the area under the receiver operating characteristic (ROC) curve to evaluate the models. The area under the ROC curve (AUC) is a well-known model accuracy metric [24]. By calculating each sensitivity and specificity according to probability thresholds, which is within 0 to 1, the ROC is independent of the different thresholds and thus the metric is reliable and reflects the average performance with the thresholds. Models with AUCs above 0.9 are considered to be accurate [24].

## **4. Results**

In our experiments, we collected a total of 144 VAS scores for individual tasks from 16 subjects. These were evaluated after each relaxation and stressor task (Figure 1). To eliminate any inter-subject variability, the scores were normalized for each subject. Table 2 shows the average normalized scores.

As Table 2 shows, the scores are significantly lower for the relaxation tasks than for the stressor ones. Among the stressor tasks, the hard and easy math tasks yielded the highest and lowest average scores, respectively. Contrary to our expectations, the average score was lower for the hard Stroop task than for the easy one, possibly because easy but tedious tasks may be more stressful than difficult tasks. However, if the task is too difficult, as with the hard math task, it appears to be more stressful than a tedious task.


**Table 2.** Average normalized visual analogue scale (VAS) scores for all tasks. These have been normalized to a range of 0–1 with a MinMax scaler.

Because our experiments involved alternating relaxation and stressor tasks, we also calculated the average difference between the normalized VAS score recorded immediately before a stressor task (i.e., after a relaxation task) and that recorded immediately after the stressor (Table 3). Here, it is clear that all the stressor tasks induced stress, and that the most and least stressful tasks were the hard and easy math tasks, respectively, as in Table 2. Again, the easy Stroop task was a stronger stressor than the hard one.

**Table 3.** Average differences between the normalized VAS scores before and after each task. Here, the relaxation tasks were used as a baseline before stressor tasks.

