**1. Introduction**

The rapid technological advancements and the expectations for fast adaptation impose high pressure on humans to deliver maximum effort in stressful constraints and multitasking situations of HCI. Among the variety of emotional and cognitive states in HCI, cognitive load is a prominent "multi-dimensional construct representing the load imposed on the working memory during performance of a cognitive task" [1]. It is highly associated with human effort and with the efficiency of cognitive technical systems during Human-Computer Interaction (HCI) [2]. Following Sweller [3], which focuses on human learning, the intensity of cognitive load experienced for a specific mental task varies between individuals depending on their working memory capacity. Individuals can raise their cognitive e ffort to adapt to increasing di fficulties until mental limit capacities are reached. Above this limit, human performance decreases, involving increase in errors, emergence of stress, and negative affects [2]. An adequate level of cognitive load for an individual is desirable, in order to perform a task in an optimal manner. Results from our transsituational study show indeed the existence of a biological basis for success in human-computer interaction [4]. Therefore, particularly in the context of HCI, knowledge about cognitive load is essential in order to intelligently match the level and nature of the interaction in such systems. The recognition of cognitive load in HCI can enable real-time user's state monitoring and adaptation to the individual users. Individual content generation for distant learning and adaptive learning systems [5], practical training sessions [6], monitoring pilots [7], and truck drivers [8], usability testing and evaluation of user-interface and mobile applications [9], or digital assistance providing personalized advises for stress reduction and health risk prevention strategies [10] are some relevant fields of useful applications.

Estimation of cognitive load can be achieved via various measuring approaches, including subjective measures, performance measures, physiological measures, and behavioral measures [11–15]. Traditional simple measures are based on subjective ratings, asking the users to perform a self-assessment of their mental state. These measures lack objectivity and are not reliable for computational recognition techniques. They are generally used as ground truths in experiments, however with the disadvantage of being acquired after the event. Performance measures can be measured in parallel, but are di fficult to evaluate in real-life applications and generally insensitive to load capacity variations. Physiological and behavioral procedures are non-intrusive methods providing a more reliable and direct access to cognitive load in an objective way. Cognitive load recognition using multimodal sensors has the potential to increase the robustness and accuracy compared to estimation from single modality data. Unlike subjective measurements prevalent in psychological research, cognitive load estimation based on human responses is necessary for advanced computational techniques. Further, real-life investigation requires the implementation of mobile measurements "in-the-wild" [16]. Despite all technological advancements, mobile measurements still represent a challenge and can only be realistic if the measuring devices and sensor techniques are reliable and sensitive to wild movements, are at low cost, and easy to wear.

Various datasets were specifically collected for the study of cognitive load. While most of the studies are based on statistical approaches or functional magnetic resonance imaging (fMRI) [17,18], alternative methods including physiological [19,20], text [21,22] speech [23,24], brain [25,26], and pupil change [27,28] analyses, are used to detect cognitive load. The relationship between cognitive load and writing behavior was examined using the CLTex (Cognitive Load via Text), CLSkt (Cognitive Load Sketching) and CLDgt (Cognitive Load via Digits) datasets [29]. The datasets are composed of writing samples of 20 subjects under three cognitive load levels, induced from a writing task experiment. Speech-based cognitive load examination is supported by the Cognitive Load with Speech and Electroglottography (CLSE) dataset [30]. It includes recordings of 26 subjects for the determination of a speaker's cognitive load during speech based on acoustic features. Mattys et al. developed an experiment to induce cognitive load based on a concurrent visual search task for the investigation of the impact of cognitive load on the Ganong e ffect [24]. The e ffect of visual presentation was also investigated for the detection of cognitive load: Liu et al. present a contact-free method to improve cognitive load recognition from eye movement signals and for this purpose designed an experiment to induce cognitive load [31]. In their final project report for AOARD Grant, Chen et al. summarize research activities and issues related to multimodal cognitive load recognition in the real world. They examine the use of various electroencephalography (EEG) features, eye activities, linguistic features, skin conductance response, facial activities and writing behavior. An extended version of the report is their book "Robust multimodal cognitive load measurement" presenting all the related issues in details [29].

As for the induction of emotional states, many studies exist focusing on basic emotions in both discrete (i.e., fear, anger, joy, sadness, surprise or disgust) or dimensional (i.e., valence, arousal, dominance) models. These emotional states are especially induced using standardized pictures [32,33] for instance from the International Affective Picture System (IAPS) [34] or relying on audiovisual stimuli [35] used as movie clips [36,37] or as music clips [38]. Emotional states can be also induced using game scenarios by asking the user to perform a certain task [39]. This elicitation method is especially useful for the induction of HCI relevant emotional states such as *Frustration* and *Interest* [40]. These states are relevant in designing efficient and easy-to-use interactive systems [41], in interactive educational and social applications [42], or in therapeutic settings by providing tailored feedback for instance to reduce *Frustration* states [43].

Taylor et al. conducted a study to induce *Frustration* in subjects based on the inclusion of latency between the user's touch and the reaction of the breakout engine [44]. A more recent study on *Frustration* is given by Aslam et al. examining the effects of annoying factors in HCI on feelings of *Frustration* and disappointment [45]. For the induction of *Frustration*, they asked the subjects to fill in a registration form, which fails twice based on intended system errors, before it succeeds in the third time. Additionally, Lisetti et al. designed an experiment for the elicitation of six emotions including *Frustration* in the context of HCI [46]. They collected physiological data via wearable computers and included classification results of three different supervised learning algorithms. In their paper on human-robot interaction, Liu et al. present a comparative study of four machine learning methods using physiological signals for the recognition of five different emotions including *Frustration* [47].

In her article "Interest—the curious emotion", Silvia focuses on the role of *Interest* in learning and motivation and describes its central role in cultivating knowledge and expertise [48]. Additionally, Reeve et al. present a concept of *Interest* in three ways: as a basic emotion, as an affect, and as an emotion schema [49]. They explain the importance of *Interest* in educational settings as a mean to motivate high-quality engagemen<sup>t</sup> that leads to positive learning outcomes and as an enrichment of motivational and cognitive resources that leads to high-vitality experience rather than exhaustion. According to Ellsworth, *Interest* can be related to the uncertainty of a positive event which may also lead to curiosity and hope, while lack of control often results in *Frustration*, which if sustained can lead to desperation and resignation [50]. Thus, in a HCI context, providing excitement through an appropriate degree of uncertainty might increase *Interest*, while providing a certain level of controllability, by preventing inexplicit system errors can reduce *Frustration*. The recognition of *Frustration* and the system reaction to turn it into a positive *Interest* state are critical aspects for avoiding negative affective consequences and valuable for enhancing positive interaction effects.

Despite the many studies investigating emotional and cognitive states, particularly *Overload*, *Underload, Frustration* and *Interest*, their measurement still poses many challenging issues especially with respect to multimodal, mobile and transtemporal acquisition. Additionally, regarding the validation of the experimental induction, most of the studies limit their validation to one subjective modality. Further, previous studies restrict their induction to either cognitive or emotional elicitation and rarely include both states into one single dataset. In this paper, we focus on these issues and present a database for affective computing research, based on systematic induction of cognitive load (*Overload*, *Underload*) and specific emotions relevant to HCI (*Interest*, *Frustration*) as well as a neutral and a transition state (*Normal, Easy*) (see Section 2.2). The database is (1) designed and acquired in a mobile interactive HCI setting, (2) based on multimodal sensor data, (3) involving transtemporal acquisition including different recording times, and (4) validated via three different subjective modalities. Combining these challenging issues related to mobile, interactive, multimodal, transtemporal, and validated acquisition into one large dataset for both cognitive and emotional states are the main contributions of this work.

In the next section (Section 2), the methods are described including a description of the participants and cohorts, interaction scheme, experiment structure, technical implementation and multimodal sensors infrastructure. Following (Section 3), the results are presented including the generated *uulmMAC* database, the validation via questionnaires and subjective feedback, as well as the data annotation. Finally (Section 4), we conclude with a discussion and a summary of the results.

#### **2. Materials and Methods**

An experimental mobile interactive and multimodal emotional-cognitive load scenario was designed and implemented for the induction of various cognitive and emotional states in an HCI setting. Based on this mobile and interactive scenario, multimodal data were acquired generating the *University of Ulm Multimodal A*ff*ective Corpus (uulmMAC)*. The basic concept of our cognitive load scenario follows a generic scheme from Schüssel et al. who proposed a gamified setup for the exploration of various aspects with potential influence on users' way of interaction [51]. The generic scheme is, however, an abstract fundament for HCI exploration with no specific application field. The induction of emotional and cognitive states depends on various factors related to the specific nature of human reactions [52]. Therefore, for our research question focusing on emotional and cognitive states induction in real-life HCI, the development of the current experiment required further developments with an in-depth adjustment and re-implementation of the original generic paradigm such that to comply with the induction requirements of cognitive load and affective states. The main development contributions include the design of the interaction sequences scheme inducing cognitive, emotional and neutral states (Section 2.2), the development of the experimental structure (Section 2.3) and the software implementation and platform embedment (Section 2.4). Furthermore, for the experimental data acquisition, we developed and implemented a technical infrastructure with multimodal sensors system for the distributed experimental and recording setup (Section 2.5).

#### *2.1. Participants and Cohort Description*

The *uulmMAC* dataset consists of two homogenous samples of 60 participants (30 females, 30 males; 17–27 years; mean age = 21.65 years, SD = 2.65) with a total of 100 recording sessions (N = 100) of about 45 minutes each. The 60 subjects are medical students and were recruited through bulletin notices distributed at the campus of the Ulm University. The first sample includes 40 subjects who underwent one measurement each, while the second sample consists of 20 subjects who underwent three measurements each. The three different measurements were acquired at three different times with one week of time-interval in-between. The second sample allows for instance the investigation of additional transtemporal research questions. While both samples underwent exactly the same experiment, they slightly differ in one modality acquisition: The first sample does not include facial electromyography (EMG) measurements, allowing better conditions for the analysis of facial expressions via video data. Both samples are evenly balanced between male and female. All subjects gave their informed consent for inclusion before they participated in the experiment and the study was approved by the Ethics Committee of the Ulm University (Project: C4 - SFB TRR62).

In summary, the original dataset of *uulmMAC* consists of 100 individual recording sessions: The first sample with 40 recording sessions (40 subjects × 1 measurement) and the second sample with 60 recording sessions (20 subjects × 3 measurements).

#### *2.2. The Interaction Scheme*

The goal of the experiment was the induction of various dialog-based cognitive and emotional states in a real HCI environment. Therefore, the participants were asked by the system to solve a series of cognitive games in order to investigate their reaction to various cognitive tasks difficulties, varying from high interest and overwhelming to boring and frustrating levels. The aim of each game task was to identify the single one item that is unique in shape and color (i.e., the number 36 and the number 2 in Figure 1), based on a visual search task. The difficulty was set by adjusting the number of objects, shapes and colors shown per task as well as the available time given to solve that task. Thus, cognitive *Overload* was induced by increasing the task field objects and decreasing the available time, while cognitive *Underload* was induced by decreasing the task field objects and increasing the

available time. Further, for each individual task, the subject could earn a certain amount of money (up to ten cents) according to the individual speed of the given response. The amount of reward money earned for solving a task was increasingly reduced, the longer the subject needed to answer. If the given answer was incorrect, the participant received no reward at all for that particular task. Figure 1 shows screenshots of the visual search task.

**Figure 1.** The visual search task on the example of *Overload* (**left**) and *Underload* (**right**) induction scheme. The user has to spot the single unique object. The correct answers are 36 for *Overload* (unique blue and square object) and 2 for *Underload* (unique red and pentagon object).

#### *2.3. Experiment Structure*

The experiment structure consists of six *induction sequences*, separated by *subjective feedback* related to the actual sequence, and followed by a respiration *baseline* and a *summary* of the achieved results in that sequence. While the experimental sequences are used to induce various cognitive and emotional states, the subjective feedback is used for the validation of the induction. These are described in details in the following subsections. Further, prior to the experiment, each subject received an introduction and instructions to the experimental steps in form of a short PowerPoint presentation and was afterwards asked to fill in three questionnaires related to: (1) emotion regulation based on the Emotion Regulation Questionnaire (ERQ) [53,54]; (2) emotional control based on the Trait Emotional Intelligence Questionnaire Short Form (TEIQue-SF) [55,56]; and (3) personality traits based on the Ten Item Personality Measure (TIPI) [57,58]. These questionnaires are also used as further subjective evaluation of the stability of the induction paradigm.

#### 2.3.1. Induction Sequences

Six consecutive sequences of different difficulties, with 40 single tasks each, are implemented for the induction of six different emotional and cognitive load states. All tasks within a sequence have thereby the same or comparable difficulty levels. The first introductory sequence is designed to induce *Interest* and is of moderate difficulty to gain the users' interest and familiarize them with the visual search task procedure. The *Interest* sequence has 40 tasks and is designed with a mix of 3 × 3 and 4 × 4 matrices, and 10 s time per task to give the right answer. The second sequence is designed to induce *Overload* and consists of 40 difficult tasks with a 6 × 6 matrix each and with short time of 6 s per task to provide an answer. The third sequence has a moderate *Normal* difficulty and is defined with 40 tasks with 4 × 4 matrices and moderate time of 10 s to respond per task. This *Normal* sequence is the neutral (cognitive and emotional) state to be considered as baseline between the sequences. The fourth sequence is implemented as an *Easy* sequence with 40 tasks with 3 × 3 matrices and very long time of 100 s for responding. In order to induce *Underload*, the fifth sequence is defined as a repetition of the previous *Easy* scheme of low difficulty, with again 40 tasks with 3 × 3 matrices and 100 s to provide an answer. This originates from the trivial idea that repeating an easy well-known task in the same way two times in a row, generates a state of boredom and leads to *Underload*. Based on this idea, the *Easy* sequence is considered as a transition state used as a mean to induce *Underload*. Finally, the last sixth sequence is intended to induce *Frustration* by purposely logging in a wrong answer at randomly distributed tasks (eight wrong out of 40), even when the subject provides a right answer. This *Frustration* sequence has 40 tasks with a mix of 3 × 3 and 4 × 4 matrices each and 10 s time to provide an answer. Table 1 illustrates a summary of the experimental procedure.


**Table 1.** Illustration of the experimental procedure and sequences description.

\*(Every) Sequence is followed by subjective feedback, respiration baseline, and results' summary.

The user-system interaction during all the tasks is a mobile interaction conducted via natural speech while the participants could freely move and walk in the room (standing position). The walking area is limited to a field of 1 m × 3 m, represented by an electrostatic floor mat to prevent any signal disturbance caused by any electrostatic charge influence.

## 2.3.2. Subjective Feedback

In order to evaluate the validity of the induction paradigm, various kinds of subjective feedback are implemented, including *Free Speech*, *SAM Ratings*, and *Direct Questions* parts. These are presented to the subjects on the screen as illustrated in Figure 2. After each of the six accomplished sequences, the participants provided a series of information about their current emotional state in three di fferent ways, including: (1) expressing in own words via *Free Speech* feedback of 12 s duration, how they felt during that particular sequence, (2) rating their emotions via Self-Assessment-Manikin *SAM Ratings* on the Valence-Arousal-Dominance (VAD) scale, and (3) answering *Direct Questions* related to the assessment of their own performance. The aim of this subjective feedback is to determine the current subjective emotional state experienced in that particular sequence, which, in turn, can be used as ground truth to evaluate and validate the induction paradigm. While the Free Speech feedbacks are given via natural speech, logging of the SAM Ratings and Direct Questions was carried out per mouse-click to ensure correct logging documentation. The user was thereby guided and instructed by the system via speech output. The user-system interaction modality (mouse, speech or both) within the experiment is part of the technical implementation as described in Section 2.4.

**Figure 2.** Illustration of the subjective feedback screens including Free Speech (**left**), SAM Ratings (**middle**) and Direct Questions (**right**) parts. Note that the SAM Ratings are scored on a nine-point Likert scale (represented by both the big labeled fields and the small empty fields).

#### 2.3.3. Respiration Baseline and Results' Summary

Following the subjective feedback, a baseline phase consisting of a breathing exercise to level o ff the physiological reactions related to that particular sequence is conducted by the subjects. Additionally, here, the users are thereby guided by the system via speech to first deeply breathe, then hold their breath for few seconds, and finally breathe out. The exercise was repeated three times subsequently. Finally, after the baseline phase, the system informs the user via speech about his performance during the last sequence and the related results achieved, including the earned money, are presented on the screen.

#### *2.4. Technical Implementation*

The further developments of the generic paradigm and software implementation of the interaction scheme and experimental structure for the induction of various cognitive, stress, and a ffective states are realized using C# programming and integrated within the Semaine platform [59].

The workflow of the experiment including the structure, order and content of the di fferent sequences as well as the subjective feedback and baseline sections in between are defined in an external *taskset* file which can be imported at the beginning of the experiment. Within a *taskset*, the course setting of the sequences can be defined individually for every task and every subject, allowing a high flexibility and an easy-to-handle workflow setup. Additionally, the user-system interaction modality (mouse, speech, or both) for every part within the experiment is predefined in this file. This also includes the text content (spoken and written) given by the system. The *taskset* describes the course of events of the entire experiment and is consistent for all the participants, except for the second sample who underwent three repeated measurements at three di fferent times. For this group, the content of speech output given by the system is slightly modified for the second and third measurements by using alternative synonyms while keeping the content the same. The intention here is to keep the interaction as natural as possible by preventing a repetition of exactly the same words every time.

During the visual search task, the user is instructed to give his answer by speech command. To recognize the speech content, our experimental implementation includes an integrated automated speech recognition algorithm. If well trained in advance, the speech recognition works properly in most of the cases. Nevertheless, in order to ensure a smooth interaction between the user and the system, a "Wizard of Oz" (WOZ) scenario was also implemented and used to support the integrated automated speech recognition algorithm. This was especially useful if the automated recognition fails for instance because of language dialect disparity of specific subjects that strongly diverge from the norm language on which the recognition algorithm was trained. Within the WOZ scenario, the experiment was observed on an external monitor in a separate room by the experimenter, who controlled and adjusted the (correct) login of the given answers, if necessary.

Finally, the behavior of the subjects and all their conducted actions as well as the whole course of the experiment are triggered after the events. As a result, for every individual subject, a .log file is generated after every experiment including all the course details of the experiment and can be used for the later processing and analysis of the signal data.

#### *2.5. Multimodal Sensors for Data Acquisition*

In order to collect high quality data for a wide kind of multimodal analysis there are mainly two important issues regarding the technical data acquisition. First, a wide set of di fferent modalities with a maximum of data quality in each sensor needs to be ensured. Second, the synchronization between all sensors and the user interface components has to be as congruen<sup>t</sup> as possible. The sensors used here can be divided into two kinds. Sensors attached to the participant and sensors mounted to the environment. To ensure a high mobility of the participant, and, therefore, less influence on the participant's natural behavior, wireless sensors were used.

In particular, they include a small theatre stereo headset microphone with a frequency range of 20 to 20.000 Hz, sampled at 48 kHz, transmitted via digital radio and a g.tec g.MOBIlab+ Bluetooth amplifier for biophysiological sensors. The bioamplifier was equipped with sensors for electromyography (EMG), electrocardiography (ECG), skin conductance level (SCL), respiration, and body temperature at a sampling rate of 256 Hz. To ensure accurate recordings free of motion artifacts, the signals from the physiological sensors underwent an online monitoring check adapted for our experiment using Simulink ®software. This online signal quality check was conducted during an initial baseline record at rest in sitting position and prior to the first sequence of the experiment.

A stationary mounted frontal webcam with HD resolution of 1920 × 1080 pixels at 30 frames per second was used. Further a Microsoft Kinect v2 also was mounted in the front. The Kinect includes a full HD RGB color video stream (1080p @ 30 Hz), an infrared (IR) video stream (512 × 424 @ 30 Hz), a depth stream (512 × 424 @ 30 Hz), a directed audio stream (virtual beam forming by a microphone array) and pose estimation stream including skeleton information containing 25 joints. Kinect and primary webcam were placed on top of the interaction screen in front of the scenery looking towards the participants face. Finally, a second webcam with a resolution of 1280 × 720 @ 30 fps was placed in the rear of the experimental setting in order to monitor the scenery overview and sample the atmosphere sounds. Figure 3 shows the views from the frontal and rear cameras and the acquired depth information.

**Figure 3.** View of the frontal camera (**left**), depth information (**middle**) and scenery overview of the rear camera (**right**).

Summarized, we recorded 16 sensor modalities, including four video streams (front/rear/Kinect RGB/Kinect IR), three audio streams (headset/directed array/atmosphere), seven biophysiological streams (3 × EMG/ECG/SCL/respiration/temperature), depth, and pose stream. Further, several label information streams extracted from an application log file, described later, were also recorded. After recording, all data were post-processed in order to prove a high quality towards technical and signal quality issues. As visualization tool we used ATLAS [60,61] to present (and playback) all recorded data to the experts. Only sessions which passed all technical and manual quality checks belong to the final dataset of 100 (40 × 1 + 20 × 3) sessions. These are described in Section 3. In addition to the annotation extracted from the log file entries and experimental design structural issues, some additional labels are achieved by a semi-automatic active learning procedure as described in the Annotation section (Section 3.4) of this work.

Figure 4 shows an overview of the collected data of a single session displayed in the visualization tool ATLAS. All video streams, time series type data and some label information are illustrated. The timescale is at minimum zoom, so the structure of experimental phases can be seen in the upper annotation line. It is not possible to record this massive amount of data on a single PC, so we developed a modular network-based recording infrastructure called MAR2S (Multimodal Activity Recognition and Recording System). This contains a specific recording module which on the one hand controls each specific sensor according to its specific API. This can include preparation and initialization commands, trigger and timing control, data format transformations, disk read/write control of the streams, etc. On the other hand, each module accomplishes the defined network commands and synchronization protocols. The modules are mostly written in C#, but due to the inter-module communication by network, there is no technical limitation to a specific programming language, operating system or hardware type. Depending on the sensors, hard and software requirements, in most cases more than one sensor can be grouped on a PC without influencing each other.

**Figure 4.** Overview of a whole recording session displayed in the multimodal annotation tool ATLAS: Kinect video (**top left**); Front webcam view, infrared and pose (first video row); rear camera, depth images (second video row); face position and simple facial estimations (**top right**). The data window contains from top to bottom: Sequence start and end information from logfile, audio, speech recognition information from logfile, stereo audio, front webcam, search, and answer phases including hit and miss from logfile, ECG, EMG, respiration, SCL.

In addition to the sensor modules, the user interface (UI) and WOZ module were also encapsulated in such a network module in order to control and monitor their behavior in the same synchronous manner. Finally, a logging module was established acting like a sensor, not recording physical data, but recording the whole system behavior. This includes exact time stamps on all participants and WOZ inputs, global information on the internal and external systems states, information about the sensors states, any network communication, etc. With this log file and the recorded sensor streams it is possible to reconstruct the whole experimental procedure in detail up to a virtual playback without a real participant. Therefore, the data can not only be used for numerous offline analyses but also for the development of real time capable online recognition systems.

Finally, each involved PC had a network monitoring module, measuring the current network latency to ensure synchronous recording. Due to the usage of "of the shelf" sensors like the Kinect sensor and webcams, which do not include physical trigger input capabilities, and the complex multi-PC network environment, we are not able to ensure synchronicity on a nanosecond level, like highly-specialized, expensive, hardware-triggered setups do. Hence, our setup is much more flexible and a grea<sup>t</sup> deal more realistic towards future end user implementations on custom hardware and smart devices. The recorded emotions and mental states occur typically in a longer range and all multimodal recognition approaches typically use time windows from 50 ms up to several seconds. Thus, inter-modality delays from under one millisecond are acceptable. To ensure this, each involved PC was directly attached to a separate recording control sub network containing just one switch transmitting only record timing and control information (no sensor streams, they are processed locally). Figure 5 shows the technical infrastructure of the distributed experimental and recording setup.

**Figure 5.** Technical infrastructure of the distributed experimental and recording setup.

The module which initiates the recording start also listens to its own send "start" message, and starts recording after the message returns back to itself to prevent time leading of the initiator module. Each module further sends a roundtrip message to itself to measure the network latency at the beginning of each recording session. The round trip times can be seen in Table 2. Thus, we can assume that the average delay or desynchronization is within an acceptable range. Additionally, the synchronicity can be improved by taking the individual delays into account and shifting the timestamps after recording in the post-processing step. This is not done in the raw data.


**Table 2.** Average network latency between recording modules and estimated maximum delay between modalities.
