**3. Results**

In the following, the resulting database, the validation of the induction via questionnaires and via subjective feedback, as well as the data annotation results are presented.

#### *3.1. The Database*

In total, three subjects were excluded from the analysis: two subjects from the first sample because of missing biosignal data (ID-04) and an absent logger data (ID-40) as well as one subject from the second sample because of missing sequences due to a technical error (*Underload* and *Frustration* for ID-90). Because ID-90 represents the second measurement of a participant from the second sample, the first and third measurement data of that subject (ID-80 and ID-100) were also excluded from the analysis. Consequently, the final dataset *uulmMAC* consists of 95 recording sessions from 57 subjects, presented for the following groups and subgroups:


While both groups underwent exactly the same experiment, they slightly di ffer in one modality acquisition: The EMG data of Group A include only musculus trapezius activity measurements (thus, without facial electrodes, which allows a better analysis of facial expressions from the video data). As for Group B, the EMG data include activity measurements of three muscles: musculus trapezius, musculus currogator and musculus cygomaticus. In the following, the results of Group A, Group B1, Group B2, and Group B3 are separately analyzed and presented.

#### *3.2. Evaluation via Questionnaires*

The three questionnaires TEIQue-SF, ERQ, and TIPI collected from all the participants prior to the experiment are first evaluated for Group A, Group B1, Group B2, and Group B3. For all questionnaires items, the possible score values range between 1 (minimum) and 7 (maximum).

#### 3.2.1. TEIQue-SF Questionnaire

In Figure 6 the four dimensions of the TEIQue-SF, consisting of Well-Being, Self-Control, Emotionality, and Sociability factors, are presented for the di fferent groups. The mean values vary between 5.61 and 5.82 for the Well-Being factor, between 5.04 and 5.26 for the Self-Control factor, between 4.80 and 5.06 for the Emotionality factor and between 5.02 and 5.34 for the Sociability factor. The standard deviations (SD) range between 0.55 and 1.28 for all groups and all factors. The first three factors of Well-Being, Self-Control and Emotionality have a small decreasing tendency within Group B1, Group B2 and Group B3, with the highest value obtained for Group B1. Only for Sociability the mean value of Group B3 slightly increases compared to the value of Group B2. Nevertheless, in total the mean values of all factors present a homogenous distribution within all the groups, showing minimal deviations and, thus, stable results.

**Figure 6.** Mean values of the four dimensions of the TEIQue-SF questionnaire for the different groups. The error bars represent the corresponding standard deviations.

#### 3.2.2. ERQ and TIPI Questionnaires

In Figure 7 the results of the ERQ and TIPI questionnaires are presented for Group A, Group B1, Group B2 and Group B3. Reappraisal and Suppression are the factors related to the ERQ, while Extraversion, Agreeableness, Conscientiousness, Emotional Stability and Openness to Experience are the factors related to the TIPI questionnaire. The range of the Reappraisal mean values varies between 2.94 and 3.05 (range of SD: 0.81 to 1.19), while the range of the Suppression values varies between 4.35 and 4.46 (range of SD: 1.14 to 1.43). For the TIPI questionnaire, the Extraversion has values between 4.61 and 4.76, Agreeableness between 5.11 and 5.29, Conscientiousness between 5.53 and 5.79, Emotional Stability between 5.39 and 5.63 and Openness to Experience between 5.13 and 5.37. The standard deviations of the five factors of the TIPI have values from 0.74 to 1.64. Similar to the TEIQue-SF questionnaire, in summary, the mean values of all factors present homogenous distribution within all the groups/subgroups, showing minimal deviations and, thus, stable results.

**Figure 7.** Mean values of the ERQ and TIPI questionnaires for the different groups: Reappraisal and Suppression are the ERQ dimensions, while Extraversion, Agreeableness, Conscientiousness, Emotional Stability, and Openness to Experience are the TIPI dimensions.

#### *3.3. Validation via Subjective Feedback*

Following, the evaluations obtained from the subjective feedback of the participants are presented for Group A, Group B1, Group B2, and Group B3. They include the analysis of the SAM Ratings and the Direct Questions. The evaluation of the subjective feedback is necessary to provide ground truth and validation of the dataset, which is in turn essential for further analysis and applications. The Free Speech data are not analyzed here but are part of the dataset in their raw state.

#### 3.3.1. SAM Ratings

The SAM Ratings were collected from every subject after each accomplished sequence during the experiment. With the help of the three dimensions, Valence, Arousal, and Dominance, the induction of the di fferent sequence levels of cognitive load and a ffective states is evaluated. First, the evaluation of the ratings of Group A is presented. Then, the ratings of the three di fferent measurements of Group B are separately analyzed (Group B1, Group B2, and Group B3). Finally, repeated measures ANOVA and post-hoc corrections were performed to examine the significance of the variations between the di fferent sequences.

In Figure 8, the mean SAM Ratings for Group A are presented. The highest valence values are found for the sequences *Easy* (7.32) and *Interest* (6.92) and *Underload* (6.84), while the lowest values are found for the *Overload* (5.13) and the *Frustration* (5.68) sequences. On the other hand, the highest Arousal was perceived for these two latter sequences, *Overload* (5.18) and *Frustration* (4.37), while the lowest Arousal was registered for *Underload* (2.11) and *Easy* (2.39). As for the Dominance values, the highest mean values were also obtained for these two sequences, *Underload* and *Easy* (7.03 each), while the lowest values were registered for *Overload* (3.66) and *Frustration* (4.26).

**Figure 8.** SAM Ratings of Group A for all sequences with mean Valence, Arousal, and Dominance.

Figures 9–11 illustrate the SAM Ratings results for the first (Group B1), second (Group B2), and third (Group B3) measurement time of Group B, respectively. Additionally, here, the mean SAM Ratings values are consistent with each other for all three measurement times showing transtemporal stability in the subjective evaluation. Additionally, compared to the rating results of the one-measurement group (Group A) illustrated in Figure 8, the mean Valence, Arousal, and Dominance values show similar rating tendencies.

In summary, the SAM Ratings show overall stable course with highest Valence, lowest Arousal, and highest Dominance values for the *Easy* and *Underload* sequences and with lowest Valence, highest Arousal and lowest Dominance values for the *Overload* and *Frustration* sequences. Both transtemporal stability and the similar rating tendencies between the subjects of the first and second samples prove the robust quality of the induction as evaluated by the subjects.

**Figure 10.** SAM Ratings of Group B2 for all sequences with mean Valence, Arousal, and Dominance.

**Figure 11.** SAM Ratings of Group B3 for all sequences with mean Valence, Arousal, and Dominance.

The distribution of the SAM Ratings for all the measurements are presented as scatter-plots in the Appendix A in Figure A1a (Valence), Figure A1b (Arousal), and Figure A1c (Dominance).

In order to examine if the differences of the SAM Ratings evaluations are statistically significant in the VAD-space between the different sequences, further statistical analysis was carried out. To analyze the ratings for Group A and Group A + Group B1, we conducted separate repeated measures ANOVA with the factors Sequence and VAD (for Valence, Arousal, and Dominance, respectively). Post-hoc, Newman-Keuls corrections were carried out to compare the mean differences between the sequences. For Group A, the repeated measures ANOVA revealed a significant effect of Sequence (F(5.185) = 16.866, *p* < 0.001, ηp2 = 0.313), VAD (F(2.74) = 57.996, *p* < 0.001, ηp2 = 0.611) and the interaction (F(10.370 = 30.748, *p* < 0.001, ηp2 = 0.454). Additionally, post-hoc tests using Newman-Keuls correction revealed significant differences (see Table A1a in the Appendix A). For the combined Group A + Group B1, the repeated measures ANOVA revealed a significant effect of Sequence (F(5.280) = 30.190, *p* < 0.001, ηp2 = 0.353), VAD (F(2.112) = 106.429, *p* < 0.001, ηp2 = 0.774) and the interaction (F(10.560) = 51.405, *p* < 0.001, ηp2 = 0.447). Again, post-hoc tests using Newman-Keuls correction revealed significant differences (see entire Table A1b in the Appendix A).

A direct analysis of the SAM Ratings of all sequences in comparison to the *Normal* sequence as baseline is presented in Table 3, while the results of the SAM Ratings between the *Overload* vs. *Underload* sequences and between the *Interest* vs. *Frustration* sequences are presented in Table 4 (the entire results can be found in the Appendix A as Tables A1a and A1b).



\**p* < 0.05, \*\**p* < 0.01, \*\*\**p* < 0.001.


**Table 4.** Post-hoc Newman-Keuls corrections for the Valence (V), Arousal (A), and Dominance (D) ratings between *Overload* vs. *Underload* and between *Interest* vs. *Frustration*. Mean-Differences (Mean-Diff.) and *p*-values are presented.

\**p* < 0.05, \*\**p* < 0.01, \*\*\**p* < 0.001.

Further, in order to justify the combination of Group A and Group B1 (all first measurements) in the statistical analysis, an ANOVA was additionally computed for the Valence, Arousal, and Dominance scores of the SAM Ratings between Group A and Group B1. Based on a one-way ANOVA, we found no statistically significant difference in the Valence scores (F(2.6) = 1.650, *p* = 0.153), nor in the Arousal scores (F(2.6) = 0.978, *p* = 0.450) nor in the Dominance scores (F(2.6) = 0.376, *p* = 0.891) between Group A and Group B1.

According to Table 3, most of the Valence, Arousal, and Dominance values of the SAM Ratings can be significantly distinguished from each other for all the sequences compared to *Normal*. Exceptions for Group A + Group B1 are the Valence, Arousal, and Dominance of *Interest* and the Valence of *Underload*. For Group A, more exceptions could be observed especially on the Valence dimension. More context-relevant results are the implications in Table 4, showing that the states *Overload* vs. *Underload* and *Interest* vs. *Frustration* can be significantly distinguished from each other on all SAM dimensions for both the Group A and the Group A + Group B1 except for Arousal between *Interest* and *Frustration*.

#### 3.3.2. Direct Questions

A further subjective feedback evaluation was carried out in terms of Direct Questions. Therefore, after each sequence, the subjects were asked to answer Direct Questions related to the assessment of their own perception. Four questions related to "Difficulty", "Performance", "Stress", and "Motivation" were processed: With the help of the first question, the subjects described how difficult the sequence was (very easy = 1; very difficult = 10). The second question is a personal performance assessment (performed very bad = 1; performed very well = 10). For the first sequence of *Interest*, this "Performance" question was adapted to answer the subjects' interest. The third question describes the individually experienced stress level (very relaxed = 1; very stressed = 10), and the fourth question reflects the motivation of the participant (not motivated = 1; very motivated = 10).

In Figure 12 the results of the Direct Questions are shown for Group A. It can be seen, that for the first question "Difficulty", *Overload* has the highest rating (9.18), while *Easy* and *Underload* have the lowest ratings (1.66 and 1.68, respectively). As expected, the sequences *Interest*, *Normal*, and *Frustration* present middle ratings (5.24, 5.50, and 4.58, respectively). As for the second question "Performance", the lowest rating is observed for *Overload* (2.13), while the highest ratings were obtained for *Easy* and *Underload* (8.34 and 8.50, respectively). The "Interest" rating for the first sequence *Interest* was 7.63. Further, the third "Stress" question shows similar course as the first "Difficulty" question. The "Stress" ratings for the sequences *Interest*, *Normal* and *Frustration* are in the same range (5.13, 5.03 and 5.34, respectively), while *Overload* has the highest rating (7.00) and *Easy* and *Underload* the lowest ones (2.32 and 2.58, respectively). An interesting observation here, is the slightly increasing stress from *Easy* to *Underload*. The last "Motivation" question shows the highest rating for *Interest* (9.13), and the lowest ratings for *Overload* (7.13), *Underload* (7.55), and *Frustration* (7.11).

With regard to Group B with the subjects who underwent three measurements each, some changes over time can be observed. Figures 13–15 illustrate the Direct Questions ratings for the first (Group B1), second (Group B2), and third (Group B3) measurement, respectively. The mean rating distributions for each sequence for Group B1 (first measurement) presented in Figure 13 are comparable to the results obtained for Group A (single measurement) presented in Figure 12.

**Figure 13.** Direct Questions of Group B1 for all the sequences with mean values.

Comparing Group B1 and Group B2, the mean rating values of the first question "Difficulty" for the sequences *Interest*, *Easy* and *Underload* decrease from the first to the second measurement. On the other hand, the mean rating values for *Normal* increase from 4.37 to 5.16. As for the second "Performance" question, the mean rating values for *Overload* (2.53 vs. 3.58) and *Underload* (8.32 vs. 8.68)

increase from the first to the second measurement, while the rating related to *Normal* decreases (5.53 vs. 4.89). As for the third "Stress" question, higher differences are observed for the *Interest* (4.26 vs. 3.00), *Normal* (4.00 vs. 4.89) and *Frustration* (5.89 vs. 5.32) sequences. Finally, the last "Motivation" question has comparable tendencies and values for the first and second measurements.

**Figure 14.** Direct Questions of Group B2 for all the sequences with mean values.

**Figure 15.** Direct Questions of Group B3 for all the sequences with mean values.

Finally, comparing Group B2 and Group B3 illustrated in Figures 14 and 15, the mean values of the "Difficulty" question for the sequences *Interest*, *Overload* and *Frustration* increase (3.84 vs. 4.68, 8.95 vs. 9.11 and 4.32 vs. 4.79, respectively), while the values for *Normal*, *Easy* and *Underload* decrease (5.16 vs. 4.95, 1.21 vs. 1.00 and 1.11 vs. 1.00, respectively). The highest rating in the third measurement is again obtained for the sequence *Overload* (9.11), while the lowest values are observed for the sequences *Easy* and *Underload* with values of 1.00 each. Furthermore, the "Performance" question shows increasing values for *Interest* from the second to the third measurement time (7.74 vs. 8.16), while the related mean ratings of the remaining sequences have nearly the same values with small variations. As for the "Stress" question, the highest mean values are also observed for the *Overload* and *Frustration* sequences and show a decreasing tendency compared to the second measurement (5.74 vs. 4.84 and 5.32 vs. 4.16, respectively). The "Motivation" question for all sequences results in mean ratings ranging around the

value of 8, with the lowest values obtained for the sequences *Overload* (7.50 vs. 7.74) and *Frustration* (7.60 vs. 7.95).

The ratings distribution of the Direct Questions for all the measurements are presented as scatter-plots in the Appendix A in Figure A2a ("Difficulty"), Figure A2b ("Performance"), Figure A2c ("Stress"), and Figure A2d ("Motivation").

In order to examine if the differences of the Direct Questions evaluations are statistically significant between the different sequences, further statistical analysis was carried out. Similar to the SAM Ratings, we conducted separate repeated measures ANOVA with a post-hoc Newman-Keuls correction to analyze differences between the respective ratings of the individual questions "Difficulty" (Dif), "Performance" (Per), "Stress" (Str) and "Motivation" (Mot) for Group A and Group A + Group B1. For Group A, the repeated measures ANOVA revealed a significant effect of Sequence (F(5.185) = 43.379, *p* < 0.001, ηp2 = 0.540), Question (F(3.111) = 74.360, *p* < 0.001, ηp2 = 0.668) and the interaction (F(15.555 = 81.485, *p* < 0.001, ηp2 = 0.688). Additionally, post-hoc tests using Newman-Keuls correction revealed significant differences (see Table A2a in the Appendix A). For the combined Group A + Group B1, the repeated measures ANOVA revealed a significant effect of Sequence (F(5.280) = 35.164, *p* < 0.001, ηp2 = 0.386), Question (F(3.168) = 126.204, *p* < 0.001, ηp2 = 0.693) and the interaction (F(15.840) = 111.873, *p* < 0.001, ηp2 = 0.666). Again, post-hoc tests using Newman-Keuls correction revealed significant differences (see Table A2b in the Appendix A).

A direct analysis of the Direct Questions of all sequences in comparison to the *Normal* sequence as baseline is presented in Table 5, while the results of the Direct Questions between the *Overload* vs. *Underload* sequences and between the *Interest* vs. *Frustration* sequences are presented in Table 6 (the entire results can be found in the Appendix A as Tables A2a and A2b).

Further, in order to justify the combination of Group A and Group B1 (all first measurements) in the statistical analysis, an ANOVA was additionally computed for the individual ratings of the Direct Questions between Group A and Group B1. Based on a one-way ANOVA, we did not find any statistically significant difference in the "Difficulty" scores (F(2.6) = 1.333, *p* = 0.260), nor in the "Performance" (F(2.6) = 0.6778, *p* = 0.668), nor in the "Stress" (F(2.6) = 1.740, *p* = 0.131), nor in the "Motivation" (F(2.6) = 1.072, *p* = 0.392) scores between Group A and Group B1.


**Table 5.** Post-hoc Newman-Keuls corrections for the "Difficulty" (Dif), "Performance" (Per), "Stress" (Str) and "Motivation" (Mot) questions between all sequences compared to *Normal*. Mean-Differences (Mean-Diff.) and *p*-values are presented.


**Table 5.** *Cont.*

\**p* < 0.05, \*\**p* < 0.01, \*\*\**p* < 0.001.

**Table 6.** Post-hoc Newman-Keuls corrections for the "Difficulty" (Dif), "Performance" (Per), "Stress" (Str) and "Motivation" (Mot) questions between *Overload* vs. *Underload* and *Interest* vs. *Frustration*. Mean-Differences (Mean-Diff.) and *p-*values are presented.


\**p* < 0.05, \*\**p* < 0.01, \*\*\**p* < 0.001.

According to Table 5, most of the Direct Questions can be significantly distinguished from each other for all the sequences compared to *Normal*. Exceptions are the "Difficulty" and "Stress" questions of *Interest* and *Frustration* as well as the "Motivation" question of *Interest, Easy* and *Underload* for both Group A and Group A + Group B1, in addition to the "Performance" question of *Frustration* for Group A. More context-relevant results are the implications in Table 6, which show that the states *Overload* vs. *Underload* and *Interest* vs. *Frustration* can be significantly distinguished from each other for all the Direct Questions except for the "Difficulty" and "Stress" question between *Interest* vs. *Frustration* and the "Motivation" question between *Overload* vs. *Underload* for both Group A and Group A + Group B1.

#### *3.4. Data Annotation*

In addition to the basic annotation, leading from the experimental design and application log files, the dataset is enhanced by various semi-automatic generated labels. The basic annotation contains the exact timing information at millisecond level of the beginning and ending of all sequences: Timestamps when each search item was presented, if and when a subject pronounced the solution, including whether the solution was correct or wrong and all information of the given subjective feedback. These mostly technical annotations do not necessarily contain emotional information, although they can give hints on situations where the probability of emotional reactions raises, for instance in case of timeouts or wrong answers, or during maximum load phases.

The semi-automatic labels are generated by our data driven active learning approach, presented in [62,63]. The basic assumption of this approach is the sparseness of emotional reactions in the audio and video modalities. In several pre-studies we figured out, that in HCI scenarios, users mostly tend to have emotions only in a few situations, or at least show them only in sparseness [64]. This leads to the assumption that most of the recorded data represent neutral emotional content. Based on this assumption, we train di fferent density estimation models, such as One Class SVM, SVDD, or GMM on the whole dataset (ignoring the underling experimental structure) and then compare each feature vector instance with this neutral or background model. If a specific feature vector has a high distance compared to the background model, the probability of having an emotional instance increases. As such, less fitting-points are then presented to experts, which rate the points towards the emotional content. After having the first points labeled (emotion or neutral), these labels are used to improve the background model iteratively, until most of the outlier data points are labeled. Details of this active learning-based process can be found in the cited papers. The main conclusion of our active learning algorithms is the dramatic reduction of annotation e ffort in case of a ffective datasets like the one presented here. In most cases, only 10% of a naturalistic HCI dataset has to be annotated in order to achieve the same classification results as the baseline classifier using the full dataset. The active learning based, semi-automatic generated labels are part of the dataset. Further we had to manually label nine participants in order to evaluate our active learning approach. These manual labels are also part of the dataset.

Additionally, we provide some further manual created labels regarding the body pose information. As described in [65], we annotated several body poses based on distance measures of the skeleton provided by the Kinect sensor. Static poses include onsets and o ffsets of: arms crossed, hands behind back, hands on hips, legs crossed, and legs in step position. Dynamic poses include: sideways moving hands away from body, facial hand touch, and quick movement of feet.

#### **4. Discussion and Summary**

The resulting multimodal *uulmMAC* database from our emotional and cognitive load scenario conducted in a mobile interactive HCI setting is a valuable contribution to research fields related to multimodal a ffective computing and machine learning applications in HCI. Summarized, the main contributions of our work include the following:


final dataset. Depending on the focus of the research question or modality, the recording sessions of 38 subjects or 19 subjects or all 57 subjects can be analyzed: For instance, when focusing on physiological reactions with no specific interest in facial EMG, all 57 subjects (Group A + Group B1) can be analyzed; while when focusing on facial expression from video data, 38 subjects (Group A) can be analyzed.


Overall, we created a dataset for various applications in the fields of affective computing and machine learning, including classifications, feature analysis, multimodal fusion or transtemporal investigations. The dataset includes multimodal sensor data as well as various annotations and extracted labels. Limitations of this work include the relatively limited number of transtemporal data (57 measurements from 19 subjects) as well as the absence of electroencephalography (EEG) or electrooculography (EOG) data for brain and eye movement analysis, both relevant for cognitive reactions. Finally, the experiment was conducted in a laboratory setting designed to be close to real HCI, and the next step would be to transfer our settings and findings into-the-wild for closer real-life induction and recognition research.

Future work will include numerical evaluations based on classification models using machine learning for the full dataset. Thereby, standardized sets of feature extraction techniques for each recorded modality will be generated and standard features for each emotional and cognitive state will be defined. A multimodal fusion analysis will be conducted to investigate the effect of each modality on the recognition rates of the different states. Further, a transtemporal analysis of the Group B data will be conducted to investigate the changes in time including features and classifications. Further, investigations related to the analysis of human-computer dialogs could be conducted, for instance to investigate the effects of computer feedbacks on human performance and the psychophysiological responses. Similarly, a gender analysis could also be conducted to investigate differences in the elicitation levels, emotional-cognitive psychophysiological responses or in the recognition rates and individual performance.

Finally, considering the relevance of emotional *Frustration* and cognitive *Overload* in the emergence of stress, which was investigated in many studies [70–73], we believe that our *uulmMAC* database on emotional and cognitive load states can also be used for a ffective computing and machine learning applications in the field of stress research. The well-adapted TSST—Trier Social Stress Test [70] employs a mental arithmetic task to induce high cognitive load (beside a social-evaluative part based on a public speaking task). The Stroop Color Test [71] employs a word-color task to induce high cognitive load and was further adopted by Choi et al. [74] in their experiments to develop a wearable stress monitoring system. Additionally, Wijsman et al. employ computer tasks (calculation, puzzle, memorization) under time pressure to induce stress [72]. In a similar context, a multimodal dataset was recently collected within the SWELL project [73] to induce stress by manipulating the working conditions of the subjects through mail interruptions and time pressure. Based on these studies, we will investigate in our future work the application of our database to the field of stress recognition research. It would be of interest if specialized machine learning techniques like transfer learning and/or deep learning approaches can be applied to transfer features and classifiers created on the *uulmMAC* dataset into the stress classification scenario.

**Author Contributions:** Conceptualization: D.H.-R., H.H., and H.C.T.; methodology: D.H.-R. and S.M.; software: D.H.-R., S.M., and A.D.; validation: D.H.-R., S.M., A.D., H.H., and H.C.T.; formal analysis: D.H.-R., S.M., A.D., and J.S.; investigation: D.H.-R. and A.D.; resources: H.C.T. and F.S.; data curation: D.H.-R., S.M., and A.D.; writing—original draft preparation: D.H.-R., S.M., and A.D.; writing—review and editing: D.H.-R., F.S., J.S. and H.C.T.; visualization: D.H.-R., S.M., A.D., and J.S.; supervision: D.H.-R., H.H., H.C.T., and F.S.; project administration: D.H.-R., H.H., H.C.T., and F.S.; funding acquisition: D.H.-R.., H.H., H.C.T., and F.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by grants from the Transregional Collaborative Research Center SFB/TRR 62 Companion Technology for Cognitive Technical Systems funded by the German Research Foundation (DFG). It is also supported by a Margarete von Wrangell (MvW) habilitation scholarship funded by the Ministry of Science, Research and Arts (MWK) of the state of Baden-Württemberg for Dilana Hazer-Rau.

**Conflicts of Interest:** The authors declare no conflict of interest.
