# **Emotion Intelligence Based on Smart Sensing**

Edited by Mincheol Whang and Sung Park Printed Edition of the Special Issue Published in *Sensors*

www.mdpi.com/journal/sensors

## **Emotion Intelligence Based on Smart Sensing**

## **Emotion Intelligence Based on Smart Sensing**

Editors

**Mincheol Whang Sung Park**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Mincheol Whang Sangmyung University Seoul Korea

Sung Park Sangmyung University Seoul Korea

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Sensors* (ISSN 1424-8220) (available at: https://www.mdpi.com/journal/sensors/special issues/ smart sens).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6646-7 (Hbk) ISBN 978-3-0365-6647-4 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


Reprinted from: *Sensors* **2022**, *22*, 4023, doi:10.3390/s22114023 .................... **235**


## **About the Editors**

#### **Mincheol Whang**

Mincheol Whang received the M.S. and Ph.D. degrees in biomedical engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 1990 and 1994, respectively. Since March 1998, he has been a Professor in the Department of Emotion Engineering, Graduate School, Sangmyung University, Seoul, South Korea. He has published more than 400 academic papers in human–computer interaction, emotion engineering, human factors, and bioengineering and is currently the dean of College of Convergence Engineering, Sangmyung University.

#### **Sung Park**

Sung Park received the M.S. degree in HCI from the University of Michigan, Ann Arbor in 2003 and Ph.D. degree in engineering psychology from the Georgia Institute of Technology, Atlanta in 2009. Since 2009, he has been a senior UX designer and UX group leader at Samsung Electronics and SK Telecom. He has led the UX design of voice recognition AI speakers and social robots. He has been a Professor in SCAD (Savannah College of Art and Design) since 2019 before joining Sangmyung University in 2022. His research interest includes human–computer interaction, emotion engineering, human factors, and artificial intelligence.

## *Editorial* **Special Issue "Emotion Intelligence Based on Smart Sensing"**

**Sung Park <sup>1</sup> and Mincheol Whang 2,\***


Emotional intelligence is essential to maintaining human relationships in communities, organizations, and societies. By definition, emotional intelligence refers to how well emotion is recognized and expressed. The level of emotional intelligence of an AI is mainly determined by its ability to accurately and reliably recognize its human counterpart; that is, all next-generation AI devices and services involving VR, AR, and social robots are able to quantitatively track and recognize emotion in real-time during an interaction with a human.

Emotion has been quantified by sensing facial expressions, gestures, and physiological signals such as EEG, ECG, and EDA. In addition, emotion could be more accurately recognized by considering the emotional context, including spatiotemporal variability, the congruency of implicit and explicit responses, the consistency of human action, and human relationships in society. Human emotion includes not only short-term but also long-term responses to patterns and trends in daily life. Lab studies with the aim of sensing emotion should extend to smart sensing, which monitors and tracks emotional variation with a predictable pattern.

This Special Issue explores empirical studies of emotional mechanisms, qualitative and quantitative measurements of emotion, the recognition of emotional contexts, and the application of emotion. Fourteen papers were accepted for publication in this Special Issue entitled "Emotion Intelligence Based on Smart Sensing", which includes papers ranging from lab-based studies aimed at understanding emotional mechanisms to applying emotion recognition in the real world (e.g., in driving, games, education, and virtual avatars). They are summarized below.

The review paper in [1] presents a detailed analysis of over 600 papers related to sensors and methods to understand affective-, emotional-, and physiological-state recognition. Facial action coding and facial expression analysis are long-studied fields, as represented by four articles in our SI. While facial recognition systems in the real world (i.e., in an uncontrolled environment) have evolved with performance improvements, [2] proposed a multi-spectral facial recognition system that overcomes the limitation of a single spectral band in the visible spectrum. The multi-spectral facial recognition system is robust to occlusions (e.g., fog or plastic materials) and low- or no-light environments. The authors of [2] achieved 99.5% (pose variation) and 99.6% (expression variation) Rank-1 scores in the TUFTS multi-spectral database. As AI technology evolves rapidly, so does the facial expression recognition algorithm. The authors of [3] proposed a multi-depth network that classifies facial expressions by being fed reinforced features. A multi-rate-based 3D convolutional neural network (CNN) built on a multi-rate signal process scheme was suggested, and they achieved 96.23% accuracy with the CK+ dataset.

Building an emotionally intelligent system requires a better understanding of human facial expression characteristics. The authors of [4] investigated the differences in the intensity of facial expressions between older (n = 56) and younger adults (n = 113). The participants' facial expressions were elicited using facial expression stimuli. The results indicated that the older adults strongly expressed some negative and neutral emotions. In addition, older adults used more facial muscles than younger adults across emotions.

**Citation:** Park, S.; Whang, M. Special Issue: "Emotion Intelligence Based on Smart Sensing". *Sensors* **2023**, *23*, 1098. https://doi.org/10.3390/ s23031098

Received: 13 December 2022 Accepted: 12 January 2023 Published: 18 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Human facial expressions include facial micromovements, which provide insights into fake expressions. The authors of [5] investigated the characteristics of real and fake facial expressions representing emotions by analyzing participants' facial micromovements. The results indicated significant differences in the micromovement feature variables between the real and fake expression conditions. The differences varied according to facial regions as a function of emotions.

This issue also includes a speech-emotion-recognition study [6] that proposed a multipath and group-loss-based network (MPGLN) for emotion recognition to support multidomain adaptation. The authors proposed a model that includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish). The model learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models.

The simultaneous activation of brain regions (i.e., brain connection features) is an essential mechanism of brain activity in emotion recognition, and this issue presents three EEG-based studies that advance such science. The authors of [7] investigated the relationship between brain connectivity (strength and directionality) and eye movement features (left and right pupils, saccades, and fixations) when participants (n = 47) viewed emotion-eliciting content. They found that the connectivity eigenvalues of the long-distance prefrontal lobe, temporal lobe, parietal lobe, and center were related to cognitive activity involving high valance. In addition, saccade movement was correlated with long-distance occipital–frontal connectivity. The authors of [8] investigated model-free functional connectivity metrics along with deep learning to efficiently classify human cognitive workload. They achieved state-of-the-art multi-class classification accuracy of 80.87% using a combination of MI (Mutual Information) and CNN, followed by 75.88% using a combination of PLV (Phase Locking Value) and CNN (at), and 71.87% using MI with LSTM. The authors of [9] constructed a learning emotion EEG dataset (LE-EEG) which captures physiological signals reflecting the emotions of boredom, neutrality, and engagement during learning, and proposed an EEG emotion classification network based on attention fusion (ECN-AF). On the LE-EEG dataset, the proposed model achieved the highest accuracy of 95.87%, demonstrating a 21.49% increase compared to the baseline models.

Biological hormones are relatively less explored, but could provide insights into negative emotions such as fear or panic. The authors of [10] investigated catecholamines, which are hormones released in the body in response to physical or emotional stress. They analyzed physiological signals in reference to catecholamine through an experimental task whereby 21 female volunteers received audiovisual stimuli through an immersive virtual-reality environment.

The essence of emotional intelligence overlaps with empathy, a psychological construct. A system that analyzes whether a human is empathizing is paramount. The authors of [11] suggested a non-contact method for measuring empathy by evaluating the synchronization of facial micro-movements between consumers and people in the media. Their study shows that the non-contact ballistocardiography (BCG) method can be complementary to subjective empathy scales.

Finally, this issue also extends to studies applicable to the real world (e.g., in driving, games, and virtual agents). The authors of [12] proposed a data collection system that collects multimodal emotion datasets during real-world driving. The proposed system includes a self-reportable HMI application into which a driver directly inputs their current emotion state. To demonstrate the collected dataset's validity, the paper provides case studies for statistical analysis, driver face detection, and personalized driver emotion recognition. The authors of [13] used electrocardiograms (ECGs) to investigate heart rate variability (HRV) parameters that can quantitatively characterize game addiction. The participants played the game *League of Legends*, and the experimenter performed ECG measurements during the game at various window sizes and specific events. The correlation and factor analyses were used to find the most effective parameters. The most accurate

set of parameters was found to be pNNI20, RMSSD, and LF within 30 s after the "being killed" event. The authors of [14] investigated elements that may affect a the participant's social perceptions (similarity, familiarity, attraction, liking, and involvement) of customized virtual avatars engineered considering the user's facial characteristics. The results indicated that participants felt that the avatar that embodied their habitual expressions was more similar to them than the avatar that did not.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Review* **A Review of AI Cloud and Edge Sensors, Methods, and Applications for the Recognition of Emotional, Affective and Physiological States**

**Arturas Kaklauskas 1,\*, Ajith Abraham 2, Ieva Ubarte 3, Romualdas Kliukas 4, Vaida Luksaite 1, Arune Binkyte-Veliene 3, Ingrida Vetloviene <sup>1</sup> and Loreta Kaklauskiene <sup>1</sup>**


**Abstract:** Affective, emotional, and physiological states (AFFECT) detection and recognition by capturing human signals is a fast-growing area, which has been applied across numerous domains. The research aim is to review publications on how techniques that use brain and biometric sensors can be used for AFFECT recognition, consolidate the findings, provide a rationale for the current methods, compare the effectiveness of existing methods, and quantify how likely they are to address the issues/challenges in the field. In efforts to achieve the key goals of Society 5.0, Industry 5.0, and human-centered design better, the recognition of emotional, affective, and physiological states is progressively becoming an important matter and offers tremendous growth of knowledge and progress in these and other related fields. In this research, a review of AFFECT recognition brain and biometric sensors, methods, and applications was performed, based on Plutchik's wheel of emotions. Due to the immense variety of existing sensors and sensing systems, this study aimed to provide an analysis of the available sensors that can be used to define human AFFECT, and to classify them based on the type of sensing area and their efficiency in real implementations. Based on statistical and multiple criteria analysis across 169 nations, our outcomes introduce a connection between a nation's success, its number of Web of Science articles published, and its frequency of citation on AFFECT recognition. The principal conclusions present how this research contributes to the big picture in the field under analysis and explore forthcoming study trends.

**Keywords:** review; human emotions; affective and physiological states; Plutchik's wheel of emotions; sensors; methods and applications; statistical and multiple criteria analysis; country success and publications maps of the world

#### **1. Introduction**

Global research in the field of neuroscience and biometrics is shifting toward the widespread adoption of technology for the detection, processing, recognition, interpretation and imitation of human emotions and affective attitudes. Due to their ability to capture and analyze a wide range of human gestures, affective attitudes, emotions and physiological changes, these innovative research models could play a vital role in areas such as Industry 5.0, Society 5.0, the Internet of Things (IoT), and affective computing, among others.

For hundreds of years, researchers have been interested in human emotions. Reviews on the applications of affective neuroscience include numerous related topics, such as the

**Citation:** Kaklauskas, A.; Abraham, A.; Ubarte, I.; Kliukas, R.; Luksaite, V.; Binkyte-Veliene, A.; Vetloviene, I.; Kaklauskiene, L. A Review of AI Cloud and Edge Sensors, Methods, and Applications for the Recognition of Emotional, Affective and Physiological States. *Sensors* **2022**, *22*, 7824. https://doi.org/10.3390/ s22207824

Academic Editors: Mario Munoz-Organero, Mincheol Whang and Sung Park

Received: 18 August 2022 Accepted: 12 October 2022 Published: 14 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

mirror mechanism and its role in action and emotion [1], the neuroscience of under-standing emotions [2], consumer neuroscience [3], the role of positive emotions in education [4], mapping the brain as the basis of feelings and emotions [5], the neuroscience of positive emotions and affect [6], the cognitive neuroscience of music perception [7], and social cognition in schizophrenia [8]. Applications in neuroscience also include the analysis of cognitive neuroscience [9–11], and brain sensors [12,13], and works in the literature also discuss the recognition of basic emotions using brain sensors [14].

Studies of the applications of affective biometrics can be found in the literature in the fields of brain biometric analysis [15], predictive biometrics [16], keystroke dynamics [17], applications in education [18], consumer neuroscience [19], adaptive biometric systems [20], emotion recognition from gait analyses [21], ECG databases [22], and others. Several works on affective states have integrated multiple biometric and neuroscience methods, but none have included an integrated review of the application of neuroscience and biometrics and an analysis of all of the emotions and affective attitudes in Plutchik's wheel of emotions.

Scientists analyzed various brain and biometric sensors in the reviews [23–26]. Curtin et al. [23], for instance, state that both fNIRS and rTMS sensors have changed significantly over the past decade and have been improved (their hardware, neuronavigated targeting, sensors, and signal processing), thus clinicians and researchers now have more granular control over the stimulation systems they use. Krugliak and Clarke [26], da Silva [24], and Gramann et al. [27] analyzed the use of EEG and MEG sensors to measure functional and effective connectivity in the brain. Khushaba et al. [25] used brain and biometric sensors to integrate EEG and eye tracking for assessing the brain response. Other scientists [28–33] used the following biometric sensors in their studies: heart rate, pulse rate variability, odor, pupil dilation and contraction, skin temperature, face recognition, voice, signature, gestures, and others.

Indeed, the biometrics and neuroscience field has been the focus of studies by many researchers who have achieved significant results. A number of neuroscience studies have analyzed the detection and recognition of human arousal [34], valence [35,36], affective attitudes [36,37], emotional [38–41], and physiological [42] states (AFFECT) by capturing human signals.

Though most neuroimaging approaches disregard context, the hypothesis behind situated models of emotion is that emotions are honed for the current context [43]. According to the theory of constructed emotion, the construction of emotions should be holistic, as a complete phenomenon of brain and body in the context of the moment [44]. Barrett [45] argues that rather than being universal, emotions differ across cultures. Emotions are not triggered—they are created by the person who experiences them. The combination of the body's physical characteristics, the brain (which is flexible enough to adapt to whatever environment it is in), and the culture and upbringing that create that environment, is what causes emotions to surface [45]. Recently, there have been attempts in the academic community to supply contextual (from cultural and other circumstances) analysis [46,47].

Various theories and approaches (positive psychology [48–50], environmental psychology [51–53], ergonomics—human factors science [54–56], environment–behavior studies, environmental design [57–59], ecological psychology [60,61], person–environment– behavior [62], behavioral geography [63], and social ecology research [64] also emphasize emotion context sensitivity.

The objective of this research is to provide an overview of the sensors and methods used in AFFECT (affective, emotional, and physiological states) recognition, in order to outline studies that discuss trends in brain and biometric sensors, and give an integrated review of AFFECT recognition analysis using Plutchik's [65] wheel of emotions as the basis. Furthermore, the research aim is to review publications on how techniques that use brain and biometric sensors can be used for AFFECT recognition. In addition, this is a quantitative study to assess how the success of the 169 countries impacted the number of Web of Science articles on AFFECT recognition techniques that use brain and biometric sensors that were published in 2020 (or the latest figures available).

In this paper, we identify the critical changes in this field over the past 32 years by applying text analytics to 21,397 articles indexed by Web of Science from 1990 to 2022. For this review, we examined 634 publications in detail. We have analyzed the global gap in the area of neuroscience and affective biometric sensors and have aimed to update the current big picture. The aforementioned research findings are the result of this work.

When emotions as well as affective and physiological states are determined by recognition sensors and methods—and, later, when such studies are put to practice—a number of issues arise, and we have addressed these issues in this review. Moreover, our research has filled several research gaps and contributes to the big picture as outlined below:


The following sections present the results of this study, a discussion, the conclusions we can draw, and avenues for future research. The method is presented in Section 2. Section 3 summarizes the emotion models. In Section 4, we discuss about brain and biometrics AFFECT sensors, classifications of biometric and neuroscience methods and technologies, emotions and explores the use of traditional, non-invasive neuroscience methods (Section 4) and widely used and advanced physiological and behavioral biometrics (Section 4). Section 4 also summarizes prior research and studies techniques for the recognition of arousal, valence, affective attitudes, and emotion-al and physiological states (AFFECT) in more detail. We summarize existing research on users' demographic and cultural backgrounds, socioeconomic status, diversity attitudes, and the context in Section 5. We present our research results in Section 6, evaluation of biometric systems in Section 7, and finally, a discussion and our conclusions in Section 8.

#### **2. Method**

The research method we used can be broken down as follows: (1) formulating the research problem; (2) examining the most popular emotion models, identifying the best option among them for our research (Section 3), and creating the Big Picture for the model; (3) carrying out a review of publications in the field (Section 4); (4) raising and confirming two hypotheses; (5) collecting data; (6) using the INVAR method for multiple criteria analysis of 169 countries; (7) determining correlations; (8) developing three maps to illustrate the way the success of the 169 countries impacts the number of Web of Science articles on AFFECT (emotional, affective, and physiological states) recognition and their citation rates; (9) developing three regression models; and (10) consolidating the findings, providing a rationale for the current methods, comparing the effectiveness of existing methods, and quantifying how likely they are to address the issues and challenges in the field. The following ten steps of the method describe the proposed algorithm and its experimental evaluation in detail.

Furthermore, the research aim is to review publications on how techniques that use brain and biometric sensors can be used for AFFECT recognition, consolidate the findings, provide a rationale for the current methods, compare the effectiveness of existing methods, and quantify how likely they are to address the issues/challenges in the field (Step 1). We have analyzed the global gap in the area of neuroscience and affective biometric sensors and have set the goal of updating the current big picture. The findings of the research above framed the problem.

Step 2 of the research was to examine the most popular emotion models (Section 3) and identify the best option among them for our research. We have chosen the Plutchik's wheel of emotions and one of the main reasons is that the model enables integrated analysis of human emotional, affective, and physiological states.

Step 3 was to review sensors, methods, and applications that can be used in the recognition of emotional, affective, and physiological states (Section 4). We have identified the major changes in the field over the past 32 years through a text analysis of 21,397 articles indexed by Web of Science from 1990 to 2022. We searched for keywords in three databases (Web of Science, ScienceDirect, Google Scholar) to identify studies investigating the use of both neuroscience and affective biometric sensors. A total of 634 studies that used both neuroscience and affective biometric sensor techniques in the study methodology were included, and no restrictions were placed on the date of publication. Studies which investigated any population group were at any age or gender were considered in this work.

A set of keywords related to biometric and neuroscience sensors were used for the above search of three databases. Two main sets of keywords "sensors + biometrics + emotions" and "sensors + neuroscience/brain + emotions" were used in our main search. More specific search terms related to biometrics (i.e., eye tracking, blinking, iris, odor, heart rate), neuroscience/brain techniques (i.e., EEG, MEG, TMS, NIRS, SST) and their components (i.e., algorithms, functionality, performance) were also used to refine the search. For each candidate article, the full text was accessed and reviewed to determine its eligibility. The primary results and article conclusions were identified, and discrepancies were resolved by way of discussion. The studies differed significantly in terms of protocol design, signal processing, stimulation methods, the equipment used, the study population, and statistical methods.

In Step 4, two central hypotheses were raised and confirmed:

**Hypothesis 1.** *There is an interconnection between a country's success, its number of Web of Science articles published, and its citation frequency on AFFECT recognition. When there are changes in the country's success, its number of Web of Science articles published, and its citation times on AFFECT recognition, the countries' 7 cluster boundaries remain roughly the same (*Section 6*).*

**Hypothesis 2.** *Increases in a country's success usually go hand in hand with a jump in its number of Web of Science articles published and its citation times on AF-FECT recognition.*

Next, in Step 5, we collected data. The determination of the success of 169 countries and the results obtained are described in detail in a study by Kaklauskas et al. [66]. This study used data [66] from the framework of variables taken from a number of databases and websites, such as the World Bank, Eurostat-OECD, the World Health Organization, Global Data, Global Finance, Transparency International, Freedom House, Knoema, Socioeconomic Data and Applications Center, Heritage, the Global Footprint Network, Climate Change Knowledge Portal (World Bank Group, Washington, DC, USA), the Institute for Economics and Peace, and Our World in Data; global and national statistics and publications were also used. We based our research calculations on publicly available data from 2020 (or the latest available).

We used the INVAR method [67] to conduct a multi-criteria examination of the 169 nations—the outcomes can be found in Section 6 (Step 6). This method determines a combined indicator for whole nation success. This combined indicator is in direct proportion to the corresponding impact of the values and significances of the specified indicators

on a nation's success. The INVAR method was used to conduct multiple criteria analyzes of different groups of countries, such as the former Soviet Union [68], Asian countries [69], and the global analysis of 169 [66] and 173 [70] countries.

The study's 7th step presents the median values of the correlations for 169 countries, its publications, and citations (Section 6). It was found that the median correlation of the dependent variable of the Publications—Country Success model with the independent variables (0.6626) is higher than in the Times Cited—Country Success model (0.5331). Therefore, it can be concluded that the independent variables in the Publications—Country Success model are more closely related to the dependent variable than in the Times Cited— Country Success model.

In Step 8, we developed three maps that illustrate the way the success of the 169 countries impacts the number of Web of Science articles on AFFECT (emotional, affective, and physiological states) recognition and their citation rates. The Country's Success and AF-FECT Recognition Publications (CSP) Maps of the World are a convenient way to illustrate how the three predominant CSP dimensions (a country's success, the numbers of publications, and the frequency of articles being cited) are interconnected for the 169 countries, while the CSP models allow for these connections to be statistically analyzed from various perspectives. It also allows for CSP dimensions to be forecast based on the country's success criteria. In other words, the CSP models give us a more detailed analysis of the CSP dimensions through statistical and multi-criteria analysis, while the CSP maps (Section 6) are more of a way to present the results in a visual manner. The amount of data available is gradually increasing, as is the knowledge gained from research conducted around the world. As a result, the CSP models are becoming better and better, and providing a clearer reflection of the actual picture. This means that they can effectively facilitate research and innovation policy decisions.

In Step 9, we created two regression models (Section 6). For the multiple linear regressions, we used IBM SPSS V.26 to build two regression models on 15 indicators of country success [66] and the three predominant CSP dimensions (Section 6). Step 9 entailed the construction of regression models for the number of publications and their citation rates, and the calculation of the effect size indicators describing them. Two dependent variables and 15 independent variables were analyzed to construct these regression models. The process was as follows:


It was found that changes in the values of the Country Success variable explain the variance of the Publications variable by 89.5%, and the variance of the Times Cited variable by 54.0%. Additionally, when the value of the Country Success variable increases by 1%, the value of Publications increases by 1.962% and Times Cited—by 2.101%. As the success of a country increased by 1%, the numbers of Web of Science articles published and their citations grew by 1.962% and 2.101%, respectively. A reliability analysis of the compiled regression models allows us to conclude that the models are suitable for analysis (*p* < 0.05). The 15 country success indicators explained 69.4% and 51.18% of the number of Web of Science articles published and their citations, respectively.

Step 10 was to assess the biometric systems under analysis: the rationale behind the available biometric and brain approaches was outlined, the efficacy of existing methods compared, and their ability to address issues and challenges present in the field determined (Section 7).

#### **3. Emotion Models**

First, this chapter will discuss emotion models in more detail. Then, we will choose the best option for our research and look at the Big Picture, i.e., the links between the selected emotion model and biometric and brain sensors, and the trends.

Emotional responses are natural to humans, and evidence shows they influence thoughts, behavior, and actions. Emotions fall into different groups related to various affects, corresponding to the current situation that is being experienced [71]. People encounter complex interactions in real life, and respond to them with complex emotions that often can be blends [72]. Emotional responses are the way for our brain and body to deal with our environment, and that is why they are fluid and depend on the context around us [73].

Two fundamental viewpoints form the basis in approaches to the classification of emotions: (a) emotions are discrete constructs and they have fundamental differences, and (b) emotions can be grouped and characterized on a dimensional basis [74]. These classifications (emotions as discrete categories and dimensional models of emotion) are briefly analyzed next.

In word recognition, alternative models have so far received little interest, and one example is the discrete emotion theory [75]. This theory posits that there is a limited set of universal basic emotions hardwired through evolution, and that each of the wide variety of affective experiences can essentially be categorized into this limited set [76,77]. The discrete emotion theory states that many emotions can be distinguished on the basis of expressive, behavioral, physiological, and neural features [78]. The definition of emotions provided by Fox [79] states they are consistent and discrete responding processes that can include verbal, physiological, behavioral, and neural mechanisms. They are triggered and changed by external or internal stimuli or events and respond to the environment. Russell and Barrett [80] argue that, unlike the discrete emotion theory, their alternative models can account for the rich context-sensitivity and diversity of emotions. Emotion blends could be of three kinds: (a) Positive-blended emotions were blends of only positive emotions; (b) negative-blended emotions were blends of only negative emotions; and (c) mixed emotions were blends of both positive and negative emotions, as well as neutral ones. The way teachers have described blended emotions reflects that mathematics teaching involves many and complex tasks, where the teacher has to continuously keep gauging the level of progress [81].

Emotional dimensions represent the classes of emotion. Categorized emotions can be characterized in a dimensional form, with each emotion located in a different location in space, for example in 2D (the circumplex model, "consensual" model of emotion, and vector model) or 3D (the Lövheim cube, the pleasure–arousal–dominance (PAD) emotional state model, and Plutchik's model) [82].

The circumplex model [83] proposes that two independent neurophysiological systems: One of the systems is related to arousal (activated/deactivated) and to valence (a pleasure–displeasure continuum), and the other to valence (a continuum from pleasure to displeasure) and to arousal (activation–deactivation) [84]. Each emotion can be understood as having varying valence and arousal, and is a linear combination of these two dimensions, or as varying valence and arousal [83,85]. We already applied the Russel's circumplex model of emotions to perform a review of the human emotion recognition of sensors and methods [85].

The vector model comprises two vectors. The model holds that there is an underlying dimension of arousal with a binary choice of valence that determines direction, and an underlying dimension of arousal. This results in there being two vectors that, both starting at zero arousal and neutral valence and zero arousal, proceed as straight lines, one in a positive, and one in the direction of negative valence and the other in the direction of positive valence. Typically, the vector model uses direct scaling of the dimensions of each individual stimulus individually in this model [86,87].

The positive activation–negative activation (PANA) or "consensual" model of emotion, also known as positive activation/negative activation (PANA), assumes that there are two separate systems—positive affect and negative affect. In the PANA model, the vertical axis represents low to high positive affect, and the horizontal axis of this model represents low to high negative affect (low to high). The vertical axis represents positive affect (low to high) [88]. There are two uncorrelated and independent dimensions: Positive Affect (PA), represents the extent (from low to high) to which a person shows enthusiasm for life. The second factor is Negative Affect (NA), and NA represents the extent to which a person is feeling upset or unpleasantly aroused. Positive Affect and Negative Affect are independent and uncorrelated dimensions [89].

The Pleasure–Arousal–Dominance (PAD) Emotional-State Model, offers a general threedimensional approach to measuring emotions [90]. This 3D model captures emotional response, and includes the three dimensions of pleasure–displeasure (P), arousal–nonarousal (A), and dominance–submissiveness (D) as basic factors of emotional response [91]. The initials PAD stand for pleasure, arousal, and dominance, which span different emotions. For instance, pleasure can be happy/unhappy, hopeful/despairing, satisfied/unsatisfied, pleased/annoyed, content/melancholic, and relaxed/bored. Arousal can be excited/calm, stimulated/relaxed, wide-awake/sleepy, jittery/dull, frenzied/sluggish, and aroused/unaroused. Dominance can be important/awed, dominant/submissive, influential/influenced, controlling/controlled, in control/cared-for, and autonomous/guided [92]. The neuro-decision and neuro-correlation tables, the inverted U-curve theory, the PAD emotional state model, neuro-decision making, and neuro-correlation tables are used to evaluate the impact of digital twin smart spaces (such as indoor air quality, a level of the lighting intensity and colors, learning materials, images, smells, music, pollution, and others) on users, and track their response dynamics in real time, and to then react to this response [93].

The PAD is composed of three different subscales, reflecting pleasure, arousal, and dominance. These can represent different emotions; for example, the pleasure states include happy (unhappy), pleased (annoyed), satisfied (unsatisfied), contented (melancholic), hopeful (despairing) and relaxed (bored), while the arousal states include stimulated (relaxed), excited (calm), frenzied (sluggish), jittery (dull), wide awake (sleepy) and aroused (unaroused), and the dominance states include controlling (controlled), influential (influenced), in control (cared for), important (awed), dominant (submissive), and autonomous (guided) [92]. The affective space model makes it possible to visualize the distribution of emotions along the two axes of valance (V) and arousal (A). Using this model, different emotions can be identified, such as happiness, calmness, fear, and sadness [94].

Swedish neurophysiologist Lövheim proposed that a cube of emotion is the direct relation between certain specific combinations of the levels of the three signal substances (serotonin, noradrenaline, and dopamine) and eight basic emotions [95]. A three-dimensional model, the Lövheim cube of emotion, was presented where there is a model with each of the signal substances of form represented as the axes of a coordinated system, and each corner of this 3D space holding one of the eight basic emotions is placed in the eight corners. In this model, anger is produced by the combination of high noradrenaline, high dopamine, and low serotonin [96].

The eight main categories of emotions defined by Robert Plutchik in 1980s include two equal groups opposite to each other: half are positive emotions and the other half are negative ones [97]. To visualize eight primary emotion dimensions, which are fear, trust, surprise, anticipation, anger, joy, disgust and sadness, eight sectors have been isolated [98]. The Emotion Wheel shows each of the eight basic emotions highlighted with a recognizable color [99]. When we add another dimension, the Wheel of Emotions becomes a cone with its vertical dimension representing intensity. Moving from the outside towards the wheel's center emotions intensify and this fact is highlighted by the indicator color. The intensity of emotions is decreasing towards the outer edge and the color, correspondingly, becomes less intense [98,99]. When feelings intensify one feeling can turn into another: annoyance into rage, serenity into ecstasy, interest into vigilance, apprehension into terror, acceptance

into admiration, pensiveness into grief, distraction into amazement, and, if left unchecked, boredom can become loathing [98]. Some emotions have no color marking. They are a mix of two primary emotions [98,99]. Joy and anticipation, for instance, combine to become optimism. When anticipation combines with anger it becomes aggressiveness. The combination of trust and fear is submission, joy and trust combine to become love, surprise and fear become awe, the pair of disgust and anger becomes contempt, sadness and disgust combine to become remorse, and surprise and sadness become disapproval [100].

After the analysis of the said emotion models, we have made the decision to choose Plutchik's wheel of emotions for our research. The ability to analyze human emotional, affective, and physiological states in an integrated manner offered by this model is one of the main reasons of our choice. The wheel is briefly discussed below.

Several ways to classify emotions have been proposed in the field of psychology. For that purpose, the basic emotions are first identified and then they allow clustering with any other more complex emotion [101]. Plutchik [65] proposed a classification scheme based on eight basic emotions arranged in a wheel of emotions, similar to a color wheel. Just like complementary colors, this setup allows the conceptualization of primary emotions by placing similar emotions next to each other and opposites 180 degree apart. Plutchik's wheel of emotions classifies these eight basic emotions grounded on the physiological aim [102]. Emotions are coordinated with the body's physiological responses. For example, when you are scared, your heart rate typically increases and your palms become sweaty. There is ample empirical evidence that suggests that physiological responses accompany emotion [103]. Another parallel with colors is the fact that some emotions are primary emotions and other emotions are derived by combining these primary emotions. The two models share important similarities, and such modelling can also serve as an analytical tool to understand personality. In this case, a third dimension has been added to the circumplex model to represent the intensity of emotions. The structural model of emotions is, therefore, shaped like a cone [104]. Figure 1 demonstrates Plutchik's wheel of emotions, biometrics and brain sensors, and trends and interdependence in this Big Picture stage. At the center of the circles is Plutchik's wheel of emotions. Plutchik's wheel of emotions also includes affective attitudes (interest, boredom). Plutchik [65] notes that the same instinctual source of energy is discharged as part of the emotion felt and the underlying peripheral physiological process. Emotions can be of various levels of arousal or degrees of intensity [105]. Looking at the intensity of Plutchik's eight basic emotions, Kušen et al. [106] identified variations in emotional valence. The first circle, therefore, analyses, directly or indirectly, human arousal, valence, affective attitudes, and emotional and physiological states (AFFECT). Human AFFECT can be measured by means of neuroscience and biometric techniques. The market and global trends are a constant force affecting neuroscience and biometric technologies and their improvement. Based on the analysis of global sources [107–110] and our experience, Figure 1 presents brain and biometric sensors, as well as technique trends. Sensors will be able to integrate more and more new technologies and collect a greater variety of data, as they will become more accurate, more flexible, cheaper, smaller, greener, and more energy-efficient [108–110]. Network neuroscience, a new explicitly integrative approach towards brain structure and function, seeks new ways to record, map, model, and analyze what constitutes neurobiological systems and what interactions happen inside them. The computational tools and theoretical framework of modern network science, as well as the availability of new empirical tools to map extensively and record the way shifting patterns link molecules, neurons, brain areas and social systems, are two trends enabling and driving this approach [107].

**Figure 1.** Plutchik's wheel of emotions, biometrics and neuroscience sensors, and trends.

Figure 2 shows numerous sciences and areas in which neuroscience and biometrics analyze the AFFECT. According to Sebastian [111], neuroeconomics is the study of the effect of anticipating money decisions on our brain. It has solidified as an entirely academic and unifying field that ventures to describe the techniques of the decision-making process; and reiterates economic behavior and decision-making process with economic disposition. The procedure of neuroeconomics involves the integration of behavioral experiments and brain imaging in order to more clearly appreciate the workings behind individual and collective decision-making [112]. Serra [113] reported that neuroeconomics researchers utilize neuroimaging devices such as functional magnetic resonance imaging (fMRI), magnetic resonance imaging (MRI), transcranial magnetic stimulation (rTMS), and transcranial direct-current stimulation (tDCS), positron emission tomography (PET) and electroencephalography (EEG). The majority of challenges probed by neuroeconomics researchers are basically similar to the problems a marketing researcher would acknowledge as aspects of their functional domain [114]. Kenning and Plassmann [115] has also defined neuroeconomics as the implementation of neuroscientific methods in the evaluation and appreciation of economically significant behavior.

**Figure 2.** Neuroscience and biometric branches analyzing AFFECT in various sciences and fields.

According to Wirdayanti and Ghoni [116], neuromanagement entails psychology, the biological aspect of humans for decision-making in management sciences. As stated Teacu Parincu et al. [117], neuromanagement is targeted at investigating the acts of the human brain and mental performances whenever people are confronted with management challenges, using cognitive neuroscience, in addition to other scientific disciplines and technology, to evaluate economic and managerial problems. Its focal point is on neurological activities that are related to decision-making and develops personal as well as organizational intelligence (team intelligence). It also centers on the planning and management of people (for example, selection, training, group interaction and leadership) [118].

Neuro-Information Science can be defined as the science that observes neurophysiological reactions that are connected with the peripheral nervous system; that is then connected to conventional cognitive activities. Michalczyk et al. [119] stated that neuro-informationsystems research has developed into a conventional approach in the information systems (IS) discipline for evaluating and appreciating user behavior. Riedl et al. [120] and Michalczyk et al. [119] concluded that Neuro-information-systems comprise studies that are centered on all types of neurophysiological techniques, such as functional magnetic resonance imaging (fMRI), electroencephalograhy (EEG), fNIRS (functional near-infrared spectroscopy), electromyography (EMG), hormone studies, or skin conductance and heart rate evaluations, as well as magnetoencephalography (MEG) and eye-tracking (ET).

Neuro-Industrial Engineering brought about by the synergy between neuroscience and industrial engineering has afforded resolutions centered on the physiological status of people. Ma et al. [121] reported that NeuroIE secures its objective and real data by analyzing human brain and physiological indexes with advanced brain AFFECT devices and biofeedback technology, evaluating the data, adding neural activities as well as physiological status in the process of evaluation; as new constituents of operations management, and finally understanding better human–machine integration by modifying work environment and production system in line with people's reaction to the system, preventing mishaps and enhancing efficiency and quality. According to Ma et al. [121], Neuro-Industrial Engineering is centered on humans and lays hold of human physiological status data (e.g., EEG, EMG, GSR and Temp). Zev Rymer [122] also stated that the application of Neuro-Industrial Engineering is multidisciplinary in that it cuts across the neurological sciences (particularly neurology and neurobiology) in addition to different fields of engineering disciplines such as simulation, systems modeling, robotics, signal processing, material sciences, and computer sciences. The area encompasses a range of topics and applications; for example, neurorobotics, neuroinformatics, neuroimaging, neural tissue engineering, and brain–computer interfaces.

As soon as a user contacts an insurer, a bank or any other call center, a version of Cogito's software known as Dialog could be active in the background, assisting the client service agent to deal with the client. Should the user become upset or angry, the client service agent can ensure that necessary actions are taken to satisfy the client. According to Cogito, this service is known as "digital intuition". Its usefulness in call centers cannot be overemphasized as it can give feedback about real-time communications. The speed at which speeches are made by the callers as well as the dynamic range of their voices can also be analyzed by the software. For example, significant variations in pitch and stresses in caller's tones could signify excitement or anger. Less significant dynamism, a monotonous flat tone, could imply a lack of interest or unconcern. Some companies make use of the software to assist their employees engage new patients for healthcare projects that help control health challenges such as obesity or asthma. Cogito is among recent profit-based research companies whose focus are on the evaluation of signals subconsciously given off by people which exposes their mindset. The evaluation of these kinds of social-signals is beneficial beyond call centers and meeting rooms. According to Hodson [123], keeping track of conversations during surgeries or plane cockpits could assist surgeons and pilots to be aware of whether their colleagues are really attentive to their directives, possibly preserving lives.

Several areas where we can apply the technology of recognizing emotions from speech include human–computer interactions and call centers [124].

#### **4. Brain and Biometric AFFECT Sensors**

#### *4.1. Classifications*

Globally, several classifications of biometric and neuroscience methods and technologies are used. Our research focuses on neuroscience methods that are non-invasive. The use of non-invasive brain stimulation is widespread in studies of neuroscience [125]. The non-invasive neuroscience methods are: transcranial magnetic stimulation (TMS), electroencephalography (EEG), magnetoencephalography (MEG), positron emission tomography (PET), functional magnetic resonance imaging (fMRI), near infrared spectroscopy (NIRS), diffusion tensor imaging (DTI), steady-state topography (SST), and others [126–134]. These non-invasive neuroscience methods are described in detail in Section 3. In the future, the authors of this article plan to analyze invasive neuroscience methods, too.

Biometrics can be physical or behavioral. In the first case, emotions can be identified by their physical features, including face, and in the second case by their behavioral characteristics, including gait, voice, signature, and typing patterns [135]. Various sensors can measure physiological signals, known as biometrics, capturing the response of bodily systems to things that are experienced through our senses, but also things imagined, by tracking sleep architecture, heart rate variability (HRV), respiratory rate (RR), and heart rate (RHR) [136].

Scientific literature classifies biometrics into certain types. Stephen and Reddy [137] and Banirostam et al. [138], for instance, classify biometrics into three categories: physiological, behavioral, and chemical/biological. Yang et al. [139] distinguish physiological and behavior traits. Kodituwakku [140] believes biometric technology can be classified into two general categories: physiological biometric techniques and behavioral biometric techniques. Jain et al. [141] and Choudhary and Naik [142] also classify biometrics into two categories: physiological and behavioral. In the literature, not only signature, voice, and gait are considered behavioral biometric features, but also ECG, EMG, and EEG [143], while other authors distinguish cognitive biometrics [144,145], including electroencephalography (EEG), electrocardiography (ECG), electrodermal response (EDR), blood pulse volume (BVP), near-infrared spectroscopy (NIR), electromyography (EMG), eye trackers (pupillometry), hemoencephalography (HEG), and related technologies [145]. Some scientific sources claim that eye tracking is a behavioral biometric [146], while others claim that it is a measurement in physiological computing [147]. Physiological biometrics measures the physiological signals to determine identity as well as authenticating and analyzing users emotions. Respiration, perspiration, heartbeat, eye-reactions to light, brain activity, emotions, and even body odor can be measured for numerous purposes, including physical and logical access control, payments, health monitoring, liveness detection, and neuromarketing among them [136].

Scientists identify the following AFFECT biometric types [139–142,148–150]:


Biometric technologies are usually divided into those of first and second generation [151]. First-generation biometrics can confirm a person's identity in a quick and reliable way, or authenticate them in different contexts, and law enforcement is one of the areas where such solutions are employed in practice [152]. The primary purpose of first-generation biometrics is identity verification, such as facial recognition, and the technology is built around simple sensors that capture physical features and store them for later use [153]. Second-generation biometrics can also be used to detect emotions, with electro-physiologic and behavioral biometrics (e.g., based on ECG, EEG, and EMG) as examples of such technologies [154]. Second-generation biometrics measure individual patterns of learned behavior or physiological processes, rather than physical traits, and are also known as behavioral biometrics [155]. Second-generation biometrics usage has the ability to analyze/evaluate emotions and detect intentions [156]. The use of secondgeneration biometrics enables wireless data collection regarding the body. The data can then be used to infer an individual's intent and emotions, as well as emotion tracking across spaces [151,157]. We examine only physiological effects affected by emotional reactions (i.e., second-generation biometrics), and the use of biometric patterns for the identification of individuals is not discussed in this study.

A diverse range of AI algorithms have been applied for AFFECT recognition, for example machine learning, artificial neural networks, search algorithms, expert systems, evolutionary computing, natural language processing, metaheuristics, fuzzy logic, genetic algorithms, and others. Some of the most important supervised (classification, regression), unsupervised (clustering), and reinforcement learning algorithms of machine learning are common as tools in biometrics or neuroscience research to detect emotions and affective attitudes, and are listed below:


#### *4.2. Brain AFFECT Devices and Sensors*

Neuroscience is associated with multiple fields of science, for example chemistry, computation, psychology, philosophy, and linguistics. Various research areas of neuroscience include behavioral, molecular, operative, evolutionary, cellular, and therapeutic features of the neurotic system. The neuroscience market encompasses technology (electrophysiology, neuro-microscopy, whole-brain imaging, neuroproteomics analysis, animal behavior analysis, neuro-functional study, etc.), components (services, instrument, and software) and end-users (healthcare centers, research institutions and academic, diagnostic laboratories, etc.) [197]. Global Industry Analysts Inc. (San Jose, CA, USA) [197] has previously grouped the global neuroscience market into instrument, software, and services based on components.

Neuroscience provides valuable perceptions concerning the structural design of the brain and neurological, physical, and psychological activities. It helps neurologists to appreciate the various components of the brain that can assist in the development of medications and techniques to handle and avoid many neurological anomalies. The rising death rate as a result of several neurological disorders, such as Parkinson's disease, Alzheimer's, schizophrenia, and other brain-related health challenges, represents the basic factor controlling the neuroscience market growth [198]. According to Neuroscience Market [198], the increasing request for neuroimaging devices and the progressive brain mapping research and evaluation projects are other crucial growth-inducing factors.

Neuroscience covers a whole range of branches, such as, neuroevolution, neuroanatomy, developmental neuroscience, neuroimmunology, cellular neuroscience, neuropharmacology, clinical neuroscience, cognitive neuroscience, nanoneuroscience, molecular neuroscience, neurogenetics, neuroethology, neurochemistry, neurophysics, paleoneurobiology, neurology, and neuro-ophthalmology.

Other branches of neuroscience analyze AFFECT in various related sciences and fields, such as affective neuroscience [199,200], neuroinformatics [201,202], neuroimaging [203,204], systems neuroscience [205,206], computational neuroscience [207,208], neurophysiology [51,209], behavioral neuroscience [210,211], neural engineering [212,213], neuroeconomics [214,215], neurolinguistics [216,217], neuropsychology [218–220], neurophilosophy [221–223], neuroaesthetics [224–226], neurotheology [227–229], neuropolitics [230–232], neurolaw [233–235], social neuroscience [236,237], cultural neuroscience [238,239], neuroliterature [240–242], neurocinema [243–245], neuromusicology [246–248], and neurogastronomy [249,250].

For example, Lim [251] identifies the following neuroscientific techniques for neuromarketing:

• Electromagnetic methods, including magnetoencephalography (MEG), electroencephalography (EEG), and steady-state topography (SST). MEG involves the magnetic fields produced by the brain (its natural electrical currents) and is used to track the changes that occur when participants see or interact with various presentation outputs. EEG is related to the ways in which brainwaves change and is used to detect changes

when participant see or interact with various promoting outputs (an electrode band or helmet is used for this purpose). SST measures a steady-state visually evoked potential, and is used to determine how brain activities change depending on the task;


Table 1 demonstrates traditional non-invasive neuroscience methods.


**Table 1.** Traditional non-invasive neuroscience methods.


For clarity, several descriptions of traditional neuroscience methods are presented below. Wearable healthcare devices store a lot of sensitive personal information which makes the security of these devices very essential. Sun et al. [272] proposed an accelerationbased gait recognition method to improve gait-based elderly recognition. Gait is also a good indicator in health assessment, Majumder et al. [273] created a simple wearable gait analyzer for the elderly to support healthcare needs.

Lim [251] states that neuroscientific methods and tools include those that track, chart, and record the activity of a person's neural system and brain in relation to a certain behavior, and neurological representations of this activity can then be generated to shed light on how an individual's brain and nervous system respond when the person is exposed to a stimulus. In this way, neuroscientists can observe the neural processes as they happen in real time. There are three main types of neuroscientific method: those that track what is happening inside the brain (metabolic and electromagnetic activity); those that track what is happening at the neural level outside the brain; and those that can influence neural activity (Table 1, Figure 1).

Non-invasive neuroscience technical information is provided in detail in various research literature about the origin of the measured signal and the engineering/physical principle of the sensors for EEG [274–276], MEG [277–279], TMS [280–282], etc.

Gannouni et al. [283] have proposed a new approach with EEG signals used in emotion recognition. To achieve better emotion recognition using brain signals, Gannouni et al. [283] applied a novel adaptive channel selection method. The basis of this method is the acknowledgment that different persons have unique brain activity that also differs from one emotional state to another. Gannouni et al. [283] argue that emotion recognition using EEG signals needs a multi-disciplinary approach, encompassing areas such as psychology, engineering, neuroscience, and computer science. With the aim of improving the reproducibility of emotion measurement based on EEG, Apicella et al. [35] have proposed an emotional valence detection method for a system based on EEG, and their experiments proved an accuracy of 80.2% in cross-subject analysis and 96.1% in within-subject analysis. Dixson et al. [284] have pointed out that facial hair may interfere with detection of emotional expressions in a visual search. However, facial hair may also interfere with the detection of happy expressions within the face in the crowd paradigm, rather than facilitating an effect of anger superiority as a potential system for threat detection.

Wang et al. [285] introduced an EEG-based emotion recognition system to classify four emotion states (joy, sadness, fear, and relaxed). Their experiments used movie elicitation to acquire EEG signals from their subjects [285]. The way in which meditation influences emotional response was investigated via EEG functional connectivity of selected brain regions as the subjects experienced happiness, anger, sadness or were relaxed, before and after meditation.

Neurometrics is a quantitative EEG method. Looking at individual records, this method provides a reproducible, precise estimate of deviations from normal. Only sufficient amount of good quality raw data transformed for Gaussian distributions, correlated with age, and corrected taking into account intercorrelations among measures ensure meaningful and reliable results [286]. Businesses, government agencies, and individuals use neurometric information when they need timely and profitable decisions. Techniques based on neurometric information are applied to make profitable business decisions. These techniques are based on biometric information, eye tracking, facial action coding and implicit response testing, and are used to understand and record human sentiments and other related feedback [161].

The fronto-striatal network is involved in a range of cognitive, emotional, and motor processes, such as decision-making, working memory, emotion regulation, and spatial attention. Practice shows that intermittent theta burst transcranial magnetic stimulation (iTBS) modulates the functional connectivity of brain networks. Treatments of mood disorders usually involve high stimulation intensities and long stimulation intervals in transcranial magnetic stimulation (TMS) (Figure 3) therapy [287].

**Figure 3.** Resting state TMS brain scan image [287].

One of imaging techniques is FDG-PET/fMRI (simultaneous [18F]-fluorodeoxyglucose positron emission tomography and functional magnetic resonance imaging). This technique makes it possible to image the cerebrovascular hemodynamic response and cerebral glucose uptake. These two sources of energy dynamics in the brain can provide useful information. Another greatly useful technique for characterizing interactions between distributed brain regions in humans has been resting-state fMRI connectivity, while metabolic connectivity can be a complementary measure to investigate the dynamics of the brain network. Functional PET (fPET), a new approach with high temporal resolution, can be used to measure fluoro-D-glucose (FDG) uptake and looks like a promising method to assess the dynamics of neural metabolism [288]. Figure 4 shows raw images of signal intensity variation across the brain for one individual subject.

**Figure 4.** Raw images of fPET and fMRI scans [288].

Many biological tissues comprised of fibers, which are groups of cells aligned in a uniform direction, have anisotropic properties. In the human brain, for instance, within its white matte regions, axons usually form complex fiber tracts that enable anatomical communication and connectivity. Non-invasive tools can show the groups of axonal fibers visually. One of them is diffusion tensor magnetic resonance medical imaging (DTI), which is one particular method or application of the broader Diffusion-Weighted Imaging (DWI). The basic principle behind this technique is that water diffuses more slowly as it moves perpendicular to the preferred direction, whereas in the direction aligned with the internal structure the diffusion is more rapid. The DTI outputs can be further used to compute diffusion anisotropy measures such as the fractional anisotropy (FA). The principal direction of the diffusion tensor can also be used to obtain estimates related to the white matter connectivity in the brain. Figure 5 shows an example of DTI tractography, or visualization of the white matter connectivity [289].

**Figure 5.** DTI can be used to construct a transversely isotropic model by overlaying axonal fiber tractography on a finite element mesh: (**a**) DTI-informed Finite Element Model; tractography shows complex fibers from (**b**) the dorsal view, (**c**) the right lateral side view, and (**d**) the posterior view. Cartography of the tracts' position, direction by color: red for right-left, blue for foot-head, green for anterior-posterior [289].

#### *4.3. Physiological and Behavioral Biometrics*

Physiological biometrics (as opposed to behavioral biometrics) is a category of approaches that refers to physical measurements of the human body, including face, pupil constriction and dilation [290]. When a recognition system is based on physiological characteristics it can ensure a comparatively high accuracy [291]. The ubiquity of electronics such as cell phones and computers, and evolving sensor technology offer human beings new possibilities to track their behavioral and physiological features and evaluate the associated biometric results. Advances in mobile devices mean they now have many efficient and complex sensors. Biometric technology often contributes to mobile application growth, including online transaction efficiency, mobile banking, and voting. The global market for biometric systems is wide and comprises many different segments such as healthcare, transportation and logistics, security, military and defense, government, consumer electronics, and banking and finance [292].

Table 2 presents widely used physiological and behavioral biometrics.


#### **Table 2.** Physiological and behavioral biometrics.


Most of today's eye tracking systems are video-based, with an eye video camera and infrared illumination. Eye tracking systems can be categorized as tower-mounted, mobile, or remote based on how they interface with the environment and the user (Figure 6) and different video-based eye tracking systems are required depending on the experiment, the environment, and the type of activity to be studied [313]. Researchers have used eye-tracking for behavioral research.

**Figure 6.** Sample of various kinds of eye-tracking tools: (**a**) eye-tracking glasses [314]; (**b**) helmetmounted [315]; (**c**) remote or table [316].

The left image in Figure 7 shows the last frame of an expression showing surprise on a sample face from Cohn–Kanade database and highlights the trajectories (the bright lines that change color from darker to brighter from their start to end) followed by each tracked feature point. Figure 7. The application of the dense flow method (right) and the result of applying the feature optical flow on the subset of 15 points (left) [317].

**Figure 7.** Facial expression recognition: (**a**) feature point tracking; (**b**) dense flow tracking [317].

A group of participants were tested to record the facial EMG (fEMG) activity. Following the guidelines for fEMG placement recommended by Fridlund and Cacioppo, two 4-mm bipolar miniature silver/silver chloride (Ag/AgCl) skin electrodes were placed on their left corrugator supercilii and zygomaticus major muscle regions (Figure 7) [318]. To avoid bad signals or other unwanted influences, the BioTrace software (on NeXus-32) was used to visualize and, if necessary, correct the biosignals before each recording. Figure 8 shows the arrangement of fEMG electrodes on the M. zygomaticus major and M. corrugator supercilii. An example of a filtered electromyography (EMG) signal is shown on the right side [319].

**Figure 8.** Placement of fEMG electrodes and a sample of a filtered EMG signal [319].

Humans have a range of biometric traits that can be a basis for various biometric recognition systems (Figure 9). The other biometrics traits are iris, face thermogram, gait, keystroke pattern, voice, face, and signature. They can have different significance. For example, iris scan has high accuracy, medium long term stability and medium security level, while voice recognition has low accuracy, low long term stability and low security level [320]. The choice of the biometric traits, however, invariably depends on the availability of the dataset's samples, the application, the value of tolerance accepted, and the level of complexities [150].

**Figure 9.** Other examples of biometric traits.

Biometric sensors are transducers that change the biometric traits of a person, such as face, voice, and other characteristics, into an electrical signal. These sensors read or measure speed, temperature, electrical capacity, light, and other types of energy. Different technologies are available with digital cameras, sensor networks, and complex combinations. One type of sensor is required in every biometric device, and biometric sensors are a key feature of emotions recognition technology. Biometrics can be used in a microphone for voice capture or in a high-definition camera for facial recognition [321].

Jain et al. [141] state that enrolment and emotions recognition are two main phases in biometric emotions recognition systems. The enrolment phase means acquiring an individual's biometric data to be stored in the database along with the emotions recognition details. The recognition phase uses the stored data to compare the data with the re-acquired biometric data of the same individual, to determine emotions. A biometric system is, therefore, a pattern recognition system consisting of a database, sensors, a feature extractor, and a matcher.

Loaiza [322] states that overall physiological effects related to emotional reactions depend on three types of autonomic variables: (1) the cardiac system, including blood pressure, cardiac cycles, and heart rate variability; (2) respiration, including amplitude, respiration period, and respiratory cycles; and (3) electrodermal activity, including resistance, responses, and skin conductance levels. Ekman [77] report that different emotions can have very different autonomic variables. For instance, in contrast to someone in a happy state, an angry person had a higher heart rate and temperature. Furthermore, the feeling of fear

was also accompanied by higher heart rate. Pace-Schott et al. [323] argue that the ability to regulate physiological state and regulation of emotion are two inseparable features. Physiological feelings contribute to emotion regulation, reproduction, and survival.

Many works have focused on emotion detection using different techniques [35,283,284, 324–327]. Specific tasks (e.g., WASSA-2017, SemEval) have also included emotion detection tasks that cover four categories of emotions (anger, fear, sadness, and joy) [320]. According to Saganowski et al. [326], the most common approach to the use of physiological signals in emotion recognition is to (1) collect and clean data; (2) to preprocess, synchronize, and integrate signal; (3) to extract and select features; and (4) to train and validate machine learning models.

Signals are a natural expression of the human body; they can be used with great success in the classification of emotional states. EEGs, temperature measurements, or electrocardiograms (ECGs) are examples of such physiological signals. They can help us to classify emotional states such as anger, sadness, or happiness, and can be captured by different sensors to identify individual differences. The goal of all of these physiological methods is to evaluate consumer attention and to obtain a particular message noticed, and their performance in this area is commendable. The advantages of these techniques include their creative and versatile placement, the stimulation of interest through novel means that capture attention, the ability to directly target and personalize messages, and lower implementation costs [328]. To study marketing trends, Singh et al. [328] recommend avoiding costly research methods such as fMRI and EEG, and instead using smaller and cheaper galvanic readings and eye tracking (ET) to investigate brain responses. These authors also propose a fuzzy rule-based algorithm to anticipate consumer behavior by detecting six facial expressions from still images.

Various organizations are contributing to the progress of biometric standards, such as international standards organizations (International Electrotechnical Commission, ISO-JTC1/SC37, London, UK), national standards bodies (American National Standards Institute, New York, NY, USA), standards-developing organizations (International Committee for Information Technology Standards, American National Institute of Standards and Technology, Information Technology Laboratory), and other related organizations (International Biometrics and Identification Association, International Biometric Group, Biometric Consortium, Biometric Center of Excellence) [329]. De Angel et al. [330] give rise to numerous recommendations to begin improving the generalizability of the research and generating a more standardized approach to sensing in depression.


standardized assessment and analysis tools and reliable feature extraction and missing data descriptions, and has been tested in more representative populations.

Neuromarketing, neuroeconomics, neuromanagement, neuro-information systems, neuro-industrial engineering, products, services, call centers studies use various instruments and techniques to measure user psychological states. Some of these tools are more complex than others, and the results that are produced can vary widely [331]. They fall into three major categories: the first two contain tools used for neuroimaging (medical devices offering in vivo information on the nervous system) and use techniques that measure brain electrical activity and neuronal metabolism, while the third contains tools used to evaluate neurophysiological indicators of the mental states of an individual. Leading neuroimaging tools such as fMRI and PET fall into the first category, while EEG, MEG, and other less invasive and cheaper neuroimaging devices that measure electrical activity in the brain [332] fall into the second category, and tools that track and record individual signals of broader physiological reaction and response measurements (e.g., electro-dermal activity, ET, etc.) fall into the third category.

Next, we overview the literature and examine the various types of arousal, valence, affective attitudes, and emotional and physiological states (AFFECT) recognition methods in more detail. A summary of the outcomes is provided in Table 3.

The combination of several different approaches to the recognition and classification of emotional state (also known as multimodal emotion recognition) is currently a research area of great interest, especially since the use of different physiological signals can provide huge amounts of data. Since each physiological can make a significant impact on the ability to classify emotions [333]. Table 3 presents an overview of studies related to the recognition of valence, arousal, emotional states, physiological states, and affective attitudes (affect). A brief overview of some of these studies follows.

**Table 3.** An overview of studies on arousal, valence, affective attitudes, and emotional and physiological states (AFFECT) recognition.







Many scientists and practitioners have earned acclaim and honor for their research in areas such as diagnostics, large-scale screening, analysis, monitoring, and categorizations of people by COVID-19 symptoms. Their work relied on early warning systems, wearable technologies, the Internet of Medical Things, IoT based systems, biometric monitoring technologies, and other tools that can assist in the COVID-19 pandemic. Javaid et al. [438] review how different industry 4.0 technologies (e.g., AI, IoT, Big data, Virtual Reality, etc.) can help reduce the spread of disease. Kalhori et al. [439] and Rahman et al. [440] discuss the digital health tools to fight COVID-19. Various sensors and mobile devices to detect the disease, reduce its spread, and measure different symptoms are also widely discussed. Rajeesh Kumar et al. [441] propose a system to identify asymptotic patients using IoT-based sensors, measuring blood oxygen level, body temperature, blood pressure, and heartbeat. Stojanovi´c et al. [442] propose a phone headset to collect information about respiratory rate and cough, Xian et al. [443] present a portable biosensor to test saliva. Chamberlain et al. [444] presented distributed networks of Smart thermometers track COVID-19 transmission epicenters in real-time.

Neurotransmitters (NT) are billions of molecules constantly needed to keep human brains functioning. They are chemical messengers that carry, balance, and boost signals travelling between nerve cells (neurons) and other cells in the body. Many different psychological and physical functions can be affected by these chemical messengers, including fear, appetite, mood, sleep, heart rate, breathing rate, concentration and learning [445]. Lim [251] has also outlined new ways of exploiting neuromarketing research to achieve a better understanding of the brain and neural activity and hence advance marketing science. Lim [251] highlighted three main aspects: (i) antecedents (such as the product, physical evidence, the price of the product, the place where everything is happening, promotion, the process involved, people); (ii) the process; and (iii) the consequences for the target market (behavioral outcomes before, during and after the act of buying) and the marketing organization (visits, sales, awareness, equity). Agarwal and Xavier [253] described the most popular neuromarketing tools, including event-related potential (ERP) (P300), EEG, and fMRI, and explained how these tools could be applied in marketing. A business and marketing article [256] lists the three categories of neuroscientific techniques that are applied in business and advertising research (Tables 1 and 2, Figure 1) as follows:


Ganapathy [260] groups neuromarketing tools into three categories (Tables 1 and 2). Farnsworth [258] gives information that can be essential when deciding on the best neuromarketing method or technique to help stakeholders understand research methods relating to human behavior at a glance, while Saltini [264] gives a short list of neuromarketing tools (Tables 1 and 2). A system developed by CoolTool [257] allows several neuromarketing tools to be used separately or combined.

Although individual neuroscientific tools for neuromarketing, neuroeconomics, neuromanagement, neuro-information systems, neuro-industrial engineering, products, services, call centers have been developed by many researchers (for example [111,251,253– 270,293,298–300,303,309,311,312,328,446–448], a review and analysis of the complete range of tools used in neuromarketing, neuroeconomics, neuromanagement, neuro-information systems, neuro-industrial engineering, products, services, call centers research has not yet been carried out. Thorough examinations of the range of research tool alternatives that are available for neuroscience are also often missing from research in this area. We have therefore compiled a complete list of neuroscience techniques for neuromarketing, neuroeconomics, neuromanagement, neuro-information systems, neuro-industrial engineering, products, services, call centers. Humans experience emotions and their associated feelings (e.g., gratitude, curiosity, fear, sadness, disgust, happiness, and pride) on a daily basis. Yet, in case of affective disorders such as depression and anxiety, emotions can become destructive. Thus the focus on understanding emotional responsiveness is not surprising in neuroscience and psychological science [449]. So neuroscience techniques analyze emotional, affective and physiological states tracking neural/electrical activity [335–340,450,451] or neural/metabolic activity [341–344,349,447,452,453] within the brain. This is also presented in Table 3.

For example, neuromarketing techniques can complement business decisions and make them more profitable, using the automated mining of opinions, attitudes, emotions and expressions from speech, text, emotions, neuron activity and other database-fed sources. Advertisements that are adjusted based on such information can engage the target audience more effectively and make a better impact on the audience, and this may translate into better sales and higher margins. In an attempt to enhance corporate branding and advertising routines, various factors have been studied, such as emotional appeal and sensory branding, to ensure that companies deliver the right message and that customers perceive the right message [171].

Affect recognition is widely used in gaming to create affect-aware video games and other software. Alhargan et al. [454] present affect recognition in an interactive gaming environment using eye-tracking. Szwoch and Szwoch [455] give a review of automatic multimodal affect recognition of facial expressions and emotions. Krol et al. [456] combined eye-tracking and brain–computer interface (BCI) and created a completely hands-free game Tetris clone where traditional actions (i.e., block manipulation) are performed using gaze control. Elor et al. [457] measure heart rate and galvanic skin response (GSR) with Immersive Virtual Reality (iVR) Head-Mounted Display (HMD) systems paired with exercise games to show how exercise games can positively affect physical rehabilitation.

Stress is a relevant health problem among students, so Tiwari, Agarwal [458] present a stress analysis system to detect stressful conditions of the student, including measurement of GSR and electrocardiogram (ECG) data. Nakayama et al. [459] suggest measuring heart rate variability as a method to evaluate nursing students stress during simulation to provide a better way to learn.

A literature review can reveal the most popular types of traditional and non-traditional neuromarketing methods. According to Sebastian [111], focus groups are one of the more traditional marketing methods, while various neuroscience techniques have also been applied to record the metabolic activity of the body and the electrical activity of the brain (transcranial magnetic stimulation (TMS), electroencephalography (EEG), functional magnetic resonance imaging, magnetoencephalography (MEG), and positron-emission tomography (PET)).

Electronic platforms are not the only possibility for non-traditional marketing, and Tautchin and Dussome [460] believe that traditional media can also be reimagined in new forms, such as guerrilla marketing, local displays, vehicle wraps, scaffolding, and even bubble cloud ads or aerial banners. In addition to giving high-quality feedback data, non-traditional techniques can also help in the evaluation of business decisions and conclusions [328].

Based on factors such as skin texture, gender, and SC, wearable biometric GSR sensors could be used to identify whether a person is in a sad, neutral, or happy emotional state. To understand marketing strategies better and to improve ads, other biometric sensors such as pulse oximeters and health bands could be used in the future to make automated predictions of emotions [461]. The galvanic skin response (GSR) method has an important limitation—it does not provide information on valence. The usual way to address this issue is to use other emotion recognition methods. They provide additional details and thus enable detailed analysis. Table 3 lists studies where GSR is used to measure emotions.

Eye tracking (ET) is used to record the frequencies of choices; sensor features are extracted and matched with certain preference labels to determine mutual dependences and to discover which brain regions are active when a certain choice task is performed. High values for alpha, beta and theta waves have been reported in the occipital and frontal brain regions, with a high degree of synchronization. A hidden Markov model is a popular tool for time-series data modeling, and researchers have successfully used this approach to build brain–computer-interface tools with EEG signals, counting mental task classification, medical applications and eye movement tracking [462].

A classification model based on SVM architecture, developed by Lakhan et al. [463], can predict the level of arousal and valence in recorded EEG data. Its core is a feature extraction algorithm based on power spectral density (PSD).

Multimodal frameworks that combine several modalities to improve results have recently become popular in the domain of human–computer interaction. A combination of modalities can give a more efficient user experience since the strengths of one modality can offset the weaknesses of another and the usability can be increased. These systems recognize and combine different inputs, taking into account certain contextual and temporal constraints and thus facilitating interpretation. Kong et al. [464] created a way of using two different sensors and calibrating them to achieve simultaneous gesture recording. Hidden Markov Model (HMM) was used for all single- and double-handed gesture recognition. Multimodality means that several unimodal solutions are combined into a system, meaning that multiple solutions can be combined into a single best solution using optimization algorithms [464].

The automatic emotion recognition system proposed by El-Amir et al. [465] uses a combination of four fractal dimensions and detrended fluctuation analysis, and is based on three bio-signals, GSR, EMG, and EEG. Using two emotional dimensions, the signals were passed to three supervised classifiers and assigned to three different emotional groups, with a maximum accuracy for the valence dimension of 94.3% and a maximum accuracy for the arousal dimension of 94%. This approach is based on external signals such as facial expressions and speech recognition, which means that it is simple and that no special equipment is required. The limitations of this approach are that emotions can be faked, and that these types of recognition methods fail with disabled people and people with certain

diseases. Other approaches are based on electromyography, ECGs, SC, EEGs, and other physiological signals that are spontaneous and cannot be consciously controlled [465].

Plassmann et al. [466] as well as Perrachione and Perrachhione [467] carried out exciting studies in an attempt to determine how marketing stimuli lead to buying decisions. They applied neurosciences to marketing in order to create better models and to understand of how a buyer's brain and emotions operate. Gruter [468] states that a wide range of techniques and tools are used to measure consumer responses and behavior. Three approaches that are used in neuromarketing can give access to the brain: input and output models, internal reflexes, and external reflexes.

Leon et al. [469] present a real-time recognition and classification method based on physiological signals to track and detect changes in emotions from a neutral state to either a positive or negative (i.e., non-neutral) state. They used the residual values of autoassociative neural networks and the statistical probability ratio test in their approach. When the proposed methodology was implemented to process a recognition level of 71.4% was achieved [469]. Monajati et al. [470] also investigated the recognition of negative emotional states, using the three physiological signals of galvanic skin response, respiratory rate and heart rate. Fuzzy-ART was applied to analyze the physiological responses and to recognize negative emotions. An overall accuracy of 94% was achieved in determining which emotions were negative as opposed to neutral [470].

Andrew et al. [471] described investigations of brain responses to modern outdoor advertising, focusing on memorability, visual attention, desirability, and emotional intensity. They also described ways in which the latest imaging tools and methods could be applied to monitor subconscious emotional responses to outdoor media in many forms, from multisensory advertising screens to simple paper posters. Andrew et al. [471] explained the cognitive processes behind their success, not solely in the context of the advertising to which people are typically exposed outside their homes, but also in the broader digital world. Andrew et al. findings have fundamental implications for media campaign planning, design, and development, identifying the possible role of outdoor advertising compared to other media, and possible ways of combining different media platforms and making them work for the benefit of advertisers.

Kaklauskas et al. [472] integrated Damasio's somatic marker hypothesis with biometric systems, multi-criteria analysis techniques, statistical investigation, a neuro-questionnaire, and intelligent systems to produce the INVAR neuromarketing system and method. INVAR can measure the efficiency of both a complete video advertisement and its separate frames. This system can also determine which frames make viewers interested, confused, disgusted, happy, scared, surprised, angry, sad, bored, or confused; can identify the utmost positive or negative video advertisement; measure the consequence of a video advertisement on long-term and short-term memory; and perform other functions.

Lajante and Ladhari [473] applied peripheral psychophysiology measures in their research, based on the assumption that measures of emotion and cognition such as SC responses and facial EMGs could make a significant contribution to new ideas about consumer decision making, judgments and behaviors. These authors believe that their approach can help in applying affective neuroscience to the field of consumer services and retailing.

Michael et al. [474] aimed to understand the ways in which unconscious and direct cognitive and emotional responses underlie preferences for particular travel destinations. A 3×5 factorial design was run in order to better understand the unconscious responses of consumers to possible travel destinations. The factors considered in this study were the type of stimulus (videos, printed names, and images) and the travel destination (New York, London, Hong Kong, Abu Dhabi, and Dubai). ET can provide reliable tracking of cognitive and emotional responses over time. The authors suggested that decisions on travel destinations have both a direct and an unconscious component, which may affect or drive overt preferences and actual choices.

Harris et al. [448] investigated ways of measuring the effectiveness of social ads of the emotion/action type, and then of making these ads more effective using consumer neuroscience. Their research offers insights into changes in behavioral intent brought about by effective ads and gives an improved understanding of ways of making good use of social messages regarding a certain action, challenge or emotion that may be needed to help save lives. It can also reduce spending on social marketing campaigns that end up being ineffectual.

Libert and Van Hulle [475] argue that the development of economically practicable solutions involving human–machine interactions (HMI) and mental state monitoring, and neuromarketing that can benefit severely disabled patients has put brain–computer interfacing (BCI) in the spotlight. The monitoring of a customer's mental state in response to watching an ad is interesting, at least from the perspective of neuromarketing managers. The authors propose a method of monitoring EEGs and predicting whether a viewer will show interest in watching a video trailer or will show no interest, skipping it prematurely. They also trained a k-nearest neighbor (kNN), a support vector machine (SVM), and a random forest (RF) classifier to carry out the prediction task. The average single-subject classification accuracy of the model was as follows: 73.3% for viewer interest and 75.803% for skipping using SVM; 78.333% for viewer interest and 82.223% for skipping using kNN; and 75.555% for interest and 80.003% for skipping using RF.

Jiménez-Marín et al. [476] showed that sensory marketing tends to accumulate user experiences and then exploit them to bring the users closer to the product they are evaluating, thus motivating the final purchase. However, several issues need to be considered when these techniques are applied to reach the desired outcomes, and it is important to be aware of recent advances in neuroscience. The authors explore the concept of sensory marketing, pointing out its possibilities for application and its various typologies.

Cherubino et al. [477] highlighted the new technological advances that have been achieved over the last decade, which mean that research settings are now not the only scenarios in which neurophysiological measures can be employed and that it is possible to study human behavior in everyday situations. Their review aimed to discover effective ways to employ neuroscience technologies to gain better insights into human behavior related to decision making in real-life situations, and to determine whether such applications are possible.

Monica et al. [478] explored the cognitive understanding and usability of banking web pages. They reviewed the theoretical literature on user experience in online banking services research, with a focus on ET as a research tool, and then selected two Romanian banking websites to study consumer attention, while consumers were navigating the sites, and memory, after their visits. The research findings showed that the layout and information display can make web pages more or less usable and can have an effect on cognitive understanding.

Singh et al. [328] discussed various methods of feature extraction for facial emotion detection. The algorithm they proposed could detect a total of six facial emotions, using a fuzzy rule-based system. During their experiment, neurometrics were recorded using a system comprising MegaMatcher software, Grove-GSR Sensor V1.2, and a 12-megapixel Hikvision IP camera. The participants were asked to watch a set of video ads for a range of well-known cosmetic products and wore SC sensors and sat in front of a camera that monitored their responses. Singh et al. [328] analyzed the cognitive processes of university students in relation to advertising and compliance with the code of self-regulation. A quantitative and qualitative methodology based on facial expressions, ET techniques and focus groups was used for this purpose. The results suggested that online game operators could be clearly identified. A high interaction of the public within the exhibition of supposed skills of the successful player and welcome bonuses also exists, and there was shown to be a lack of knowledge of the visual elements of awareness, a trivialization of compulsive gambling, and sexist attitudes towards women attracting public attention. A positive public attitude towards gaming was also observed by Singh et al. [328]; it was seen as a healthy form of leisure that was compatible with family and social relationships.

Goyal and Singh [461] proposed the use of research-based approaches for the automatic recognition of human affective facial expressions. These authors created an intelligent neural network-based system for the classification of expressions from extracted facial images. Several basic and specialized neural networks for the detection of facial expressions were used for image extraction.

Electromyography measures and assesses electric potentials in muscle cells. In medical settings, this method is used to identify nerve and muscle lesions, while in emotion recognition this method is used to look for correlations between emotions and physiological responses. Most EMG-based studies examine facial expressions drawing on the hypothesis that facial expressions take part in emotional responses to various stimuli. The hypothesis was first proposed by Ekman and Friesen in 1978; they described the relationships between basic emotions, facial muscles, and the actions they trigger. Morillo et al. [479] used lowcost EEG headsets and applied discrete classification techniques to analyze scores given by subjects to individual TV ads, using artificial neural networks, the C4.5 algorithm and the Ameva discretization algorithm. A sample of 1400 effective advertising campaigns was studied by Pringle et al. [480], who determined that promotions with exclusively emotional content achieved around double (31% vs. 16%) success as those with only rational content, while compared to campaigns with mixed emotional and rational content, the exclusively emotional campaigns performed only slightly better (31% vs. 26%).

According to Takahashi [481] some of the available emotion recognition systems in facial expressions or speech look at several emotional states such as fear, teasing, sadness, joy, surprise, anger, disgust, and neutral. Takahashi [481] investigated emotion recognition based on five emotional states (fear, anger, sadness, joy, and relaxed).

The authors [353,355–357,359,360,371–374] carried out an in-depth analysis of how blood pressure, SC, heart rate and body temperature depend on stress and emotions. Figures suggest that work-related stress costs the EU countries at least EUR 20 billion annually. Stress experienced at work can cause anxiety, depression, heart disease and increased chronic fatigue which can have a considerable negative impact on creativity, competitiveness and work productivity.

Research worldwide shows that people exposed to stress can experience higher blood pressure and heart rate. Light et al. [482] analyzed cases of daily elevated stress levels and looked at the effects on fluctuations in systolic and diastolic blood pressure. Gray et al. [483] investigated how systolic and diastolic blood pressure can be affected by psychological stress, while Adrogué and Madias [484] described the effects of chronic, emotional and psychological stress on blood pressure. The unanimous conclusion of research in this area is that diastolic and systolic blood pressure and heart rate depend on stress and can increase depending on the level of stress.

Blair et al. [485] analyzed the effect of stress on heart rate and concluded that heart rate rises sharply within three minutes of the onset of stress and starts to fall only after another five to six minutes. Gasperin et al. [486] concluded that high blood pressure was affected by chronic stress. A number of studies have shown that patients with heart rates higher than 70 beats per minute are more likely to develop cardiovascular diseases and to die from them; tests show that a rapid heartbeat increases the risk of heart attack by 46%, heart insufficiency by 56% and death by 34%.

Sun et al. [487] proposed an activity-aware detection scheme for mental stress. Twenty participants took part in their experiment, and galvanic skin response, ECG, and accelerometer data were recorded while they were sitting, standing, and walking. Baseline physiological measurements were first taken for each activity, and then for participants exposed to mental stressors. The accelerometer was used to track activity, and the data gave a classification accuracy between subjects of 80.9%, while the 10-fold cross-validation accuracy for the classification of mental stress reached 92.4%. This study focused on physiological signals for example photoplethysmography and galvanic skin response. The neural network configurations (both recurrent and feed forward) were examined and a comprehensive performance analysis showed that the best option for stress level detection was layer recurrent neural networks. For a sample of 19 automotive drivers, this evaluation achieved an average sensitivity of 88.83%, a precision of 89.23% and a specificity of 94.92% [488].

Palacios et al. [489] applied a new process involving two databases containing utterances under stress by men and women. Four classification methods were used to identify these utterances and to organize them into groups. The methods were then compared in terms of their final scores and quality performance.

Fever occurs when the body's thermoregulatory set point increases, and many findings suggest that the rise in core temperature induced by psychological stress can be seen as fever. A fever of psychological origin in humans might then be a result of this mechanism [490].

Wu and Liang [491] presented a training and testing procedure for emotion recognition based on semantic labels, acoustic prosodic information and personality traits. A recognition process based on semantic labels was applied, using a speech recognizer to identify word sequences, and HowNet, a Chinese knowledge base, was used as the source for deriving the semantic word sequence labels. The emotion association rules (EARs) of the word sequences were then mined by applying a text-based mining method, and the relationships between the EARs and emotional states were characterized using the MaxEnt model. In a second approach based on acoustic prosodic information, emotional salient segments (ESSs) were detected in utterances and their prosodic and acoustic features were extracted, including pitch-related, formant, and spectrum attributes. The next step was the construction of base-level classifiers using SVM, gaussian mixture models (GMM) and MLP, which were then combined (using MDT) by selecting the most promising option for emotion recognition based on acoustic prosodic information. The process ended when the final emotional state was determined. A weighted product fusion method was applied to combine the outputs produced by the two types of recognizers. The personality traits of the specific speaker, as determined from the Eysenck personality questionnaire, were then taken into consideration to examine their impact and personalize the emotion recognition scheme [491].

A hybrid analysis method for online reviews proposed by Nilashi et al. [492] allows for the ranking of factors affecting the decisions of travelers in their choice of green hotels with spa services. This method combined text mining, predictive learning techniques and multiple criteria decision-making methods, and was proposed for the first time in the context of hospitality and tourism, with an emphasis on green hotel customer grouping based on online customer feedback. Nilashi et al. [492] used the latent Dirichlet analysis method to analyze textual reviews, a self-organizing map for cluster analysis, the neurofuzzy method to measure customer satisfaction, and the TOPSIS method to rank the features of hotels. The proposed method was tested by analyzing travelers' reviews of 152 Malaysian hotels. The findings of this research offer an important method of hotel selection by travelers, by means of user-generated content (UGC), while hotel managers can use this approach to improve their marketing strategies and service quality.

A neuromarketing method for green, energy-efficient and multisensory homes, proposed by Kaklauskas et al. [493], can be used to determine the conditions that are required. The multisensory dataset (physiological and emotional states) collected as part of this research contained about 200 million data points, and the analysis also included noise pollution and outdoor air pollution (volatile organic compounds, CO, NO2, and PM10). This article discussed specific case studies of energy-efficient and green buildings as a demonstration of the proposed method. The results matched findings from both current and previous studies, showing that the correlation between age and environmental responsiveness has an inverse U shape and that age is an important factor affecting interest in eco-friendly, energy-efficient homes.

The VINERS method and biometric techniques developed by Kaklauskas et al. [494] for the analysis of emotional states, physiological reactions and affective attitudes were used to determine which locations are the best choice and then to show neuro ads of available homes offered for sale. Homebuyers were grouped into rational segments, taking into account consumer psychographics and behavior (happy, angry or sad, and valence and heart rate) and their demographic profiles (age, gender, marital status, children or no children, education, main source of income). A rational video ad for the respective rational segment was then selected. This study aimed to combine the somatic marker hypothesis, neuromarketing, biometrics and the COPRAS method, and to develop the VINERS method for use with multi-criteria analysis and the neuromarketing of the best places to live. The case study presented in the article demonstrated the VINERS method in practice.

Etzold et al. [495] examined the case of users booking appointments online, and the ways in which they interacted with the webpage interface and visualizations. The main point was to determine whether a new interface for online booking was easy to navigate and successful in attracting user attention. In this study, the authors particularly wanted to determine whether a new, more expensive customer website was seen as more user-friendly and supportive than the older, cheaper alternative. An empirical study was carried out by tracking users eye movements as they were navigating the existing website of Mercedes-Benz, a car manufacturer, and then a new, updated version of the same company's website. A total of 20 people were observed, and evaluations of their ET data suggested that the new service appointment booking interface could be further improved. Scan-paths and heatmaps demonstrated that the old website was superior [495].

In recent years, many different emotional values, such as the net emotional value (NEV), the service encounter emotional value (SEEVal), and others, have been analyzed. Attempts have been also made to put them into practice [496–503]. These studies are overviewed below. To calculate NEV, the average score for negative emotions (stressed, dissatisfied, frustrated, unhappy, irritated, hurried, disappointed, neglected) is subtracted from the average score for positive emotions (cared for, stimulated, happy, pleased, trusting, valued, focused, safe, interested, indulgent, energetic, exploratory). The average score obtained this way can be used to characterize a client's feelings about a service or a product [499]. A higher value of NEV indicates that the relationships forged by a business are more reliable. One advantage of the NEV is that it characterizes the total balance of a consumer's feelings related to products or services, and thus reveals the value drivers. The relationship between NEV and client satisfaction is linear [500].

The NEV can be used to highlight both aspects that need to be improved, and those that are positive. Since the NEV is calculated based on a subtraction, the result may be either a negative or a positive number. The overall score can indicate what is happening with the client at an emotional level, and suggest ways to use this to gain competitive advantage [501].

The SEEVal is another measure proposed by Bailey et al. [504], and is the sum of the NEV experienced by the client and the NEV experienced by the product or service provider's employee. The client's end results linked to SEEVal are typically loyalty, satisfaction, pleasure, and voluntary benevolence [504]. The IGI Global Dictionary defines an emotional value as a set of positive moods (feeling good or being happy) resulting from products or services and contained in the value gain from the customers' emotional states or feelings when using the products or services (IGI Global Dictionary). Emotional value acts as a moderator, and has significant effects on the roles of social, functional, epistemic, conditional and environmental values [497].

Zavadskas et al. [505] examined data on potential buyers to analyze the hedonic value in one-to-one marketing situations. They used the neutrosophic PROMETHEE technique to examine arousal, valence, affective attitudes, emotional and physiological states (AFFECT), and argued that hedonic value is tied to several factors including customers' social and psychological data, client satisfaction, criteria of attractiveness, aesthetics, and economy, the sales site rental price, emotional factors, and indicators of the purchasing process. Their research showed that an analysis of the aforementioned data on potential buyers can make an important contribution to more effective one-to-one marketing. The case study cited in this work concerned two sites in Vilnius and intended to calculate the hedonic value of these sites during the Kaziukas Fair.

The ROCK Video Neuroanalytics and associated e-infrastructure were established as part of the H2020 ROCK project. This project tracked passers-by at ten locations across Vilnius. One of our outputs is the real-time Vilnius Happiness Index (Figure 10 and https://api.vilnius.lt/happiness-index, accessed on 5 September 2022). The project also involved a number of additional actions (https://Vilnius.lt/en/category/rock-project/, accessed on 5 September 2022).

The intensity of the most intense negative emotion (scared, disgusted, sad, angry) subtracted from the intensity of "happiness" equals valence [430]. This way the single score of valence combines both positive and negative emotions. Our pool of data comprised 208 million data points analyzed using SPSS Statistics, a statistical software suite. Figure 10b presents the average values of valence per hour on weekdays. Every hour, the changes of average valence among Vilnius passers-by were recorded. Valence was measured every second and these values were accumulated by weekdays (marked in the chart with specific colors) at 95% confidence intervals. The y-axis shows the average values of valence (which fluctuates between −1 to 1) for each full day, for seven days, and the x-axis shows the hour starting at midnight [348].

**Figure 10.** Real-time Vilnius Happiness Index (**a**) and the mean magnitudes of valence, by the hour, on weekdays (**b**).

#### **5. Users' Demographic and Cultural Background, Socioeconomic Status, Diversity Attitudes, and Context**

Emotions are a means to engage in a relationship with others: Anger means that the person refuses to accept a specific treatment from others and expresses that they feel entitled to something more. Anger is expressed with the aim of influencing, controlling, and fixing the behavior of others [506].

Through emotions, people can adaptively respond to opportunities and demands they face around them [507–509]. When people face everyday stressors, stressful transitions, ongoing challenges, and acute crises, the adaptive function of emotions is evident in all of these situations. Emotions also depend on context [510]. This means that emotions are most effective when people express them in the situational contexts for which the emotions

most likely evolved. In addition, they are specifically most likely to promote adaptation in such scenarios. The experience of anger, for instance, is adaptive because it motivates the focus of energies and the mobilization of resources toward an effective response. When a person expresses anger, adaptive mechanisms are also at work because it shows the person's willingness, and perhaps even ability, to defend themselves. Emotional responses are sensitive to contexts, and are therefore, an integral part of our ways to adapt to daily life and the environment [511].

The ability to modify emotion responses according to changing context may be an important element of psychological adjustment [510]. An individual's capacity to modify emotion responses taking into account the demands of changing contexts (i.e., environmental or interpersonal) is particularly relevant. This mechanism is known as emotion context sensitivity [511].

Cultural and gender differences in emotional experiences have been identified in previous research [512]. For instance, these authors used the Granger causality test to establish how a person's cultural background and situation affect emotion. The conclusions drawn by [513] propose a top-down mechanism where gender and age can impact the brain mechanisms behind emotive imagery, either directly or by interacting with bottom-up stimuli.

Cultural neuroscientists are studying how cultural traits such as values, beliefs, and practices shape human affective, emotional, and physiological states (AFFECT) and behavior. Hampton and Varnum [514] have reviewed theoretical accounts on how culture impacts internal experiences and outward expressions of emotion, as well as how people opt to regulate them. They also analyze cultural neuroscience research that investigates how emotion regulation varies in different cultural groups.

Thus far, differences between nations have largely been the focus in studies of culture in social neuroscience. Culture impacts more than just our behavior—it also plays a role in how we see and interpret the world [515]. For instance, socioeconomic factors such as education, occupation, and income have a significant impact on how a person thinks. In one study, working-class Americans were shown to exhibit a more context-dependent thought process, similar to the collectivist patterns seen in other countries. Individuals of a lower social class in terms of their socio-economic status agreed with contextual explanations of economic trends, broad social outcomes, and emotions [516].

Gallo and Matthews [517] looked at the indirect evidence that socioeconomic status is associated with negative emotions and cognition, and that negative emotions and cognition are associated with target health status. They also proposed a general framework for understanding the roles of cognitive–emotional factors, arguing that low socioeconomic status causes stress, and impairs a person's reserve capacity for managing it, thus heightening emotional and cognitive vulnerability.

Choudhury et al. [518] explore critical neuroscience, a field of inquiry that probes the social, cultural, political, and economic contexts and assumptions that form the basis for behavioral and brain science research.

Numerous studies have illustrated that depending on the specific demographic background, there are major differences between users' emotions, behavior, and perceived usability. According to Goldfarb and Brown [519], scientific research is characterized by racial, cultural, and socioeconomic prejudices, which lead to demographic homogeneity in participation. This in turn spurs inaccurate representations of neurological normalcy and leads to poor replication and generalization.

According to Freud, the unconscious is a depository for socially unacceptable ideas, wishes or desires, traumatic memories, and painful emotions that psychological repression had pushed out of consciousness [520]. HireVue, which is a global front-runner in AI technologies, is one of the top emotional AI companies that is now turning to biosensors that read non-conscious data in lieu of facial coding methods to measure emotions [521].

The ideas of what it means to have good relationships and to be a good person differ in different cultural contexts [522]. People's emotional lives are closely related to these different ideas of how people see themselves and their relationships: Emotions usually match the cultural model [523,524]. Therefore, rather than being random, cultural variation in emotions matches the cultural ideals of ways to be a good person and to maintain good relationships with other people [506].

Aside from being biologically driven, emotion is also influenced by environment, as well as cultural or social situations. Culture can constrain or enhance the way emotions are felt and are expressed in different cultural contexts, and it can influence emotions in other ways. Studies have consistently shown cross-cultural differences in the levels of emotional arousal. Eastern culture, for instance, is related to low arousal emotions, whereas Western culture is related to high arousal emotions [525]. Many findings in cross-cultural research suggest that decoding rules and cultural norms influence the perception of anger [526]. Scollon et al. [527] look at five cultures (Asian American, European American, Hispanic, Indian, and Japanese) to assesses the way emotions are experienced in these cultures. Pride shows the greatest cultural differences [527]. As emotions are fundamentally genetically determined, different ones are perceived in similar ways throughout most nations or cultures [528].

#### **6. Results**

The present article aims to bridge the affective biometrics and neuroscience gap in existing knowledge, in order to contribute to the overall knowledge in this area. We also aim to provide information on the knowledge gaps in this area and to chart directions for future research.

We conclude this review by discussing unanswered questions related to the next generation of AFFECT detection techniques that use brain and biometric sensors.

By performing text analytics of 21,397 articles that were indexed by Web of Science from 1990 to 2022, we examined the key changes in this area within the last 32 years. Scientific output relating to AFFECT detection techniques using brain and biometric sensors is steadily increasing. As this trend suggests, there has been continuous growth in the number of papers published in the field, with the total number of articles appearing between 2015 and 2021 nearing the total number of articles published over the previous 25 years (1990 to 2014). In light of the increasing commercial and political interest in brain and biometric sensor applications, this trend is likely to continue.

With ground-breaking emerging technologies and the growing spread of Industry 5.0 and Society 5.0, AFFECT should be analyzed by taking into account demographic and cultural background, socioeconomic status, diversity attitudes, and context. Advanced computational models will be needed for this approach.

Quite a few biometric and neuroscience studies have been performed in the world, where AFFECT detection takes into account demographic and cultural background (age, gender, ethnicity, race, major diagnoses, and major medical history); socioeconomic status (education, income, and occupation); diversity attitudes; and context. Yet, to the best of our knowledge, none of the technologies available in the world offer AFFECT detection that incorporates political views, personality traits, gender, race, diversity attitudes, and cross-cultural differences in emotion.

Sometimes confusion exists in the spirit of some research about physiological effects due to emotional reactions and biometric patterns with regard to individual identification. To resolve this confusion, we analyze only physiological effects caused by emotional reactions (i.e., second generation biometrics; Section 3) in the part of the review discussing biometrics. Biometric patterns for individual identification are not analyzed in this research.

Human emotions can be determined by physiological signals, facial expressions, speech, and physical clues, such as posture and gestures. However, social masking when people either consciously or unconsciously hide their true emotions—often renders the latter three ineffective. Physiological signals are therefore often a more accurate and objective gauge of emotions [529]. For instance, researchers [530,531] performed many studies to analyze physiological signals and unconscious emotion recognition. Nonetheless, our years of research experience have proven that in public spaces, facial expressions,

speech, and physical clues, such as posture and gestures, are much more convenient and effective.

Emotion recognition can be more accurate when human expressions are analyzed looking at multimodal sources such as texts, physiological signals, videos, or audio content [532]. Integrated information from signals such as gestures, body movements, speech, and facial expressions helps detect various emotion types [533]. Statistical methods, knowledge-based techniques, and hybrid approaches are three main emotion classification approaches in emotion recognition [534].

The emotional dimensions follow the approach of representing the emotion classes. Categorized emotions can be represented in a dimensional form with each emotion placed in a distinct position in space: either 2D (Circumplex model, "Consensual" Model of Emotion, Vector Model,) or 3D (Lövheim Cube, Pleasure-Arousal-Dominance [PAD] Emotional-State Model, Plutchik's model, PAD Emotional-State Model), with each emotion occupying a distinct position in space. Most dimensional models have dimensions of valence and arousal or intensity or arousal dimensions: Valence dimension indicates how much and to what degree an emotion is pleasant or unpleasant, whereas arousal dimension differentiates between showing its state, either that of activation or deactivation [82]. The objectives of our study were most in line with Plutchik's 'wheel of emotions' model, which we used in this research.

The use of artificial intelligence to recognize emotions and affective attitudes is a comparatively promising field of investigation. To make the most of artificial intelligence, multiple modalities in context should be generally used. Artificial intelligence has enabled biometric recognition and the efficient unpacking of human emotions and affective and physiological responses and has contributed considerably to advances in the field of pattern recognition in biometrics, emotions, and affective attitudes. Many different AI algorithms are used in the world, such as machine learning, artificial neural networks [535–537], search algorithms [166,538,539], expert systems [540,541], evolutionary computing [542,543], natural language processing [544,545], metaheuristics, fuzzy logic [546–548], genetic algorithm [549–551], and others.

Based on our review, presented in Sections 1–5, we find that investigators should develop procedures to guarantee that AI models are appropriately used and that their specifications and results are reported consistently. There is a need to create innovative AI and machine learning techniques.

Based on the review (Sections 1–5), investigators should develop procedures to guarantee that AI models are appropriately used and that their specifications and results are reported consistently. There is a necessity to create innovative AI and machine learning techniques.

The existing emotion recognition approaches all need data, but the training of machine learning algorithms requires annotated data, and obtaining such data is usually a challenge [552]. The use of AI models may become less complex, and AI algorithms faster when certain database techniques are applied. These techniques can also provide AI capability inside databases. Supporting AI training inside databases is a challenging task. One of the challenges is to store a model in databases, so that its parallel training is possible with multiple tenants involved in its training and use, at the same that security and privacy issues are taken care of. Another challenge is to update a model, especially in case of dynamic data updates [553]. The following datasets can help with the task of classifying different emotion types from multimodal sources such as physiological signals, audio content, or videos: BED [554], MuSe [555], MELD [544,556], UIT-VSMEC [411] HUMAINE [557], IEMOCAP [558], Belfast database [559], SEMAINE [560], DEAP [561], eNTERFACE [384], and DREAMER [562]. Github [563], for instance, provides a list of all public EEG-datasets such as High-Gamma Dataset (128-electrode dataset from 14 healthy subjects with about 1000 four-second trials of executed movements, 13 runs per subject), Motor Movement/Imagery Dataset (2 baseline tasks, 64 electrodes, 109 volunteers), and Left/Right Hand MI (52 subjects).

The findings also suggest that the development of more powerful algorithms cannot address the perception, reading, and evaluation of the complexity of human emotions, by making an integrated analysis of users' demographic and cultural background (age, gender, ethnicity, race, major diagnoses, and major medical history); socioeconomic status (education, income, and occupation); diversity attitudes; and context. We can only hope that the future will bring further research to address this issue and help to develop more advanced AFFECT technologies that can better cope with issues such as demographic and cultural background (age, gender, ethnicity, race, major diagnoses and major medical history); socioeconomic status (education, income and occupation); diversity attitudes; and context (weather conditions, pollution, etc.).

Worldwide research has yet to resolve several problems, and additional research areas have arisen, such as missing data analysis, potential bias reduction, a lack of stringent data collection and privacy laws, application of elicitation techniques in practice, open data and other data-related issues. Olivas et al. [564] for instance, analyze various methods for handling missing data:


It was found that the median correlation of the dependent variable of the Publications— Country Success model with the independent variables (0.6626) is higher than in the Times Cited—Country Success model (0.5331). Therefore, it can be concluded that the independent variables in the Publications—Country Success model are more closely related to the dependent variable than in the Times Cited—Country Success model (Figure 11).

**Figure 11.** Distribution of correlations based on 15 criteria applied to 169 countries, their publications, and citations, as a CSP map.

The CSP maps of the world that have been compiled for this research provide a visualization of two aspects. A country's success (x-axis) is one of the aspects, while the publications dimensions (CSPN and CSPC; y-axis) are the other (Figures 12 and 13). The publications (x-axis) are one of the aspects, while the publications times cited dimensions (y-axis) are the other in Figure 14. The CSP maps group the countries into the same eight clusters as the Inglehart–Welzel 2020 Cultural Map of the World (English-speaking, Catholic Europe, Protestant Europe, Orthodox Europe, West and South Asia, African-Islamic, Confucian, and Latin America) [565]. Two clusters—English-speaking and Protestant Europe have been merged into one because of their shared history, religion, cultures, and degree of economic development. The parallels between the two aforementioned clusters have been confirmed by numerous studies [566]. The Inglehart–Welzel 2020 Cultural Map of the World includes many institutional, technological, psychological, and economic variables that demonstrate strong perceptible correlations [567]. The country success indicators in

the CSP maps can be characterized as a large set of variables within the criteria system, such as politics, human development and well-being, the environment, macroeconomics, quality of life, and values based.

**Figure 12.** CSP map showing the success of countries in terms of the numbers of publications on AFFECT recognition (CSPN) in Web of Science journals with impact factor.

**Figure 13.** CSP map showing the success of countries in terms of the number of citations of their publications on AFFECT recognition (CSPC) in Web of Science journals with impact factor.

**Figure 14.** CSP map showing the number of articles on AFFECT recognition and the numbers of citations in Web of Science journals with impact factor.

In addition, this is a quantitative study to assess how the success of the 169 countries impacted the number of Web of Science articles published in 2020 on AFFECT recognition techniques that use brain and biometric sensors (or the latest figures available).

For the multiple linear regressions, we used IBM SPSS V.26 to build two regression models on 15 indicators of country success and the two predominant CSP dimensions. Two CSP regression models were developed based on an analysis of 15 independent variables and two dependent variables. The 15 independent variables and the two regression models are summarized in Tables 4–8. Table 4 contains descriptive statistics for two of the CSP models. The minimum and maximum values indicate the value range for each variable in the set of values that the variable in question can take. The average value of the full range that each variable can take is the mean and is usually equal to the arithmetical average. The standard deviation is a measure of the dispersion in the values of the variable in relation to the mean. Kurtosis is a measure of whether the values are heavy-tailed or light-tailed relative to the center of the distribution, whereas skewness is a measure of the symmetry of the distribution of the values. Acceptable values are considered to be between −3 and +3 for skewness, and between −10 and +10 for kurtosis. When the skewness is close to zero and kurtosis is close to three, the distribution of the values of the variable within the specified value range is in line with a normal distribution.

Step 9 entailed the construction of regression models for the number of publications and their citation rates, and the calculation of the ES indicators describing them. Two dependent variables and 15 independent variables were analyzed to construct these regression models. The process was as follows:

	- - Pearson correlation coefficient (r): Beta weights and structure coefficients r are the two sets of coefficients that can provide a more perceptive stereoscopic view of the dynamics of the data [571]. Interpretation may be also improved through the use of other results (e.g., [572]).
	- - Standardized beta coefficient (β): Theoretically, the highest-ranking variable is the one with the largest total effect, since β is a measure of the total effect of the predictor variables [573].
	- - Coefficient of determination (R2): This is a measurement of the accuracy of a CSP model. The outcome is represented by the dependent variables of the model. The closer the coefficient of determination to one, the more variability the model explains. R2 can therefore be used to determine the proportion of the variation in the dependent variable that can be predicted by examining the independent variables [573].
	- - Standard deviation: If this is too high, it will render the measurement virtually meaningless [574].
	- *p*-values. There is no direct relationship between the *p*-value and the size, and a small *p*-value may be associated with a small, medium, or large effect. There is also no direct relationship between the ES and its practical or clinical significance: a lower ES for one outcome may be more important than a higher ES for another outcome, depending on the circumstances [570].


**Table 4.** Descriptive statistics for the dependent variables of two models.

Based on the results of descriptive statistics, it can be concluded that the values of the dependent variables of the models used in the study demonstrate normal distribution (skewness < 10 and kurtosis < 10), which allows for the use of parametric analysis methods in the analysis.

**Table 5.** Goodness-of-fit testing for two models.


Standardized beta coefficients: \*\*\* significant at α = *p* < 0.001.

A correlation analysis found that the strongest relationship in the Publications— Country Success model is between the dependent variable Publications and the independent variable GDP per Capita. Meanwhile, in the Times Cited—Country Success model, the strongest relationship is between the variables of Times Cited and GDP per Capita in PPP. It was also found that in both models, the relationships between the dependent variables and the independent variables are statistically significant (*p* < 0.001), except for the relationships between the dependent variables and the Unemployment Rate variable.


**Table 6.** Descriptive statistics for two models.

A reliability analysis of the compiled regression models allows us to conclude that the models are suitable for analysis (*p* < 0.05). It was also found that the changes in the values of the independent variables used in the models explain the variance of the Publications variable by 69.4%, and the variance of the Times Cited variable by 51.1%.

**Table 7.** Standardized beta coefficient values of the dependent variables.


Standardized beta coefficients: \* significant at—*p* < 0.1, \*\* significant at *p* < 0.01.

An analysis of the standardized coefficients of the model allows us to conclude that changes in the GDP per Capita variable have the biggest impact on changes in the Publications variable. The GDP per Capita in PPP variable also have a significant impact. Meanwhile, the Times Cited variable is most affected by the GDP per Capita in PPP variable, which has a statistically significant effect on the dependent variable.


**Table 8.** How country success and its factors influence the two indicators.

To confirm Hypothesis 1, we built two CSP models, which are formal representations of the CSP maps. These models demonstrate that on average, an increase of 1% in a country's success leads to an average improvement by 0.203% in the country's two CSPN and CSPC dimensions. As the success of a country increased by 1%, the numbers of Web of Science articles published and their citations grew by 1.962% and 2.101%, respectively. Figures 12 and 13 also illustrate that an increase in a country's success goes hand in hand with a jump in its CSPN and CSPC dimensions, thus confirming Hypothesis 1.

Hypothesis 2 was based on the results of the analysis pertinent to the CSP models, as well as on the correlations found between the 169 countries and the 15 indicators [66]. A clear visual confirmation of Hypotheses 1 and 2 are also provided by Figures 12 and 13, which show the specific groupings of countries in the seven clusters examined in this study. These models may be of major significance for policy makers, R&D legislators, businesses, and communities.

#### **7. Evaluation of Biometric Systems**

In this chapter, we outline the rationale behind the current biometrics and brain approaches, compare the efficacy of existing methods, and determine whether or not they are capable of addressing the kinds of issues and challenges associated with the field (with figures). Biometric systems have several drawbacks in terms of their precision, acceptability, quality, and security. They are generally evaluated based on aspects such as (1) data quality; (2) usability; (3) security; (4) efficiency; (5) effectiveness; (6) user acceptance and satisfaction; (7) privacy; and (8) performance.

Data quality measures the quality of biometric raw data [576,577]. This type of assessment is generally used to quantify biometric sensors and can also be used to enhance the system performance. According to the International Organization for Standardization ISO 13407:1999 [578], usability is defined as "[t]he extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use" [579]:


Security measures the robustness of a biometric system (including algorithms, architectures, and devices) against attack. The International Organization for Standardization ISO/IEC FCD 19792 [581] specifically addresses processes for evaluating the security of such systems [579].

Unlike traditional methods, biometric systems do not provide a 100% reliable answer, and it is almost impossible to obtain such a response. In a secure biometric system, there is a trade-off between recognition performance and protection performance (security and

privacy). The reason behind this trade-off arises from the unclear concept of security, which requires a more standardized framework for evaluation purposes. If this gap can be closed, an algorithm could be developed that would jointly reduce both of them. ISO 19795 contained standards for performance metrics and evaluation methodologies for traditional biometric systems. In addition to performance testing, it provided metrics related to the storage and processing of biometric information [582]. ISO/IEC 24745 specifies that, unlike privacy, security is delivered at the system level. In general, the ability of a system to maintain the confidentiality of information with the use of the provided countermeasures (such as access control, integrity of biometric references, renewability, and revocability) is referred as its security factor. When seeking to bypass the security of a biometric system, an invader may impersonate a genuine user to gain access to and control over various services and sensitive data. Privacy refers to secrecy at the information level. The following criteria were proposed in ISO/IEC 24745 for the purpose of evaluating the privacy offered by biometric protection algorithms: irreversibility, unlinkability, and confidentiality [583].

The discriminating powers of all biometric technologies rely on the extent of entropy, with the following used as performance indicators for biometric systems [584–587]: False match rate (FMR); False non-match rate (FNMR); Relative operating characteristic or receiver operating characteristic (ROC); Crossover error rate or equal error rate (CER or EER); Failure to enroll rate (FER or FTE), and Failure to capture rate (FTC).

Specific advantages and disadvantages are characteristic to each biometric technology. Table 9 shows these comparisons.


**Table 9.** Benefits and limitations of biometric technologies.


Upon completing the literature analysis, we then compared biometric technologies looking at the following seven parameters: universality, distinctiveness/uniqueness, permanence, collectability, performance, acceptability, and circumvention (Table 10). Another set of comparisons was the strengths and weaknesses characteristic to biometric technologies and related to their ease of use, error incidence, accuracy, user acceptance, long term stability, cost, template sizes, security, social acceptability, popularity, speed, and whether or not they have been socially introduced (Table 11). The working characteristics of various biometrics differ, as does their accuracy, and depend on the design of their operation. The level of security and the kinds of possible errors are also different in each biometric approach; the denial of access to the biometric sample holders is possible caused by various factors such as aging, cold, weather conditions, physical damages, and so on [600,601]. Other researchers also look at FAR, FRR, CER, and FTE in their comparisons of biometric technologies (Table 12).


**Table 10.** Comparison of biometric technologies by seven characteristics (traits).


**Table 11.**

Comparison

 of biometric technologies

 by various attributes.


**Table 12.** Comparison of performance metrics for biometric technologies by various authors.

Multimodal biometric systems take advantage of multiple sensors or biometrics to remove the restrictions of unimodal biometric systems [616]. While unimodal biometric systems are restricted by the integrity of their identifier, the change of several unimodal systems having the same restrictions is low [617]. Multimodal biometric systems can fuse these unimodal systems sequentially, simultaneously, both ways, or in series, meaning sequential, parallel, hierarchical, and serial integration modes, respectively. For instance, final results of decision level fusion of multiple classifiers are joined using methods such as majority voting [616]. This multimodal analysis will assist in identifying the actual reasons of such issues with the current biometrics and brain approaches, as well as the restrictions of the existing state-of-the-art approaches and technologies.

An efficient way to combine multiple classifiers Is needed when an array of classifiers outputs is developed. Various architectures and schemes have been proposed for joining multiple classifiers. The most popular methods are majority vote and weighted majority vote. In majority vote, the right class is the one most selected by various classifiers. If all the classifiers show different classes or in the event of a tie, then the one with the highest overall output is chosen to be the right class. Vote averaging method averages the separate classifier outputs confidence for every class over the entire ensemble. The class output with the highest average value is selected to be the right class [618]. The vote averaging method has been used to measure the efficacy of existing biometrics methods (Tables 10 and 11). In our case, High (Very High) was assigned 3 points, Medium was assigned 2, and Low was assigned 1. The calculations did not evaluate some qualitative indicators, such as error incidence and socially introduced. Additionally, not all biometrics technologies had data on the analyzed indicators. As a result, eye tracking we not evaluated in this case due to a lack of data. The highest average number of points was collected by Skin temperature-thermogram (2.57), Iris/pupil (2.43), Face (2.30), and Signature (2.09). Many of the metrics for biometric technologies in Tables 9–12 are analyzed in detail throughout the article.

#### **8. Discussion and Conclusions**

Nevertheless, there are still unanswered questions that need to be addressed. We evaluated the evidence available to find a relationship between brain and biometric sensor data and AFFECT in order to determine the primary digital signals for AFFECT. The multidisciplinary literature used was from the disciplines of engineering, computer science, neuroscience, physiology, psychology, mathematical modeling, and cognitive science. The distinct conventions of these disciplines resulted in certain variegations, depending on the features and characteristics of the research results being focused on. The literature under analysis has small sample sizes, short follow-up times, and significant differences in the quality of the reports, which limits the interpretability of the pooled results. On average, the current AFFECT detection techniques that use brain and biometric sensors achieved a

classification accuracy greater than 70%, which seems sufficient for practical applications. As part of this review, several issues that need to be addressed were identified, as well as numerous recommendations and directions for future AFFECT detection and recognition research being suggested. They are listed below:


important aspects of making data findable, accessible, interoperable, and reusable, or FAIR. Open data analysis should also include recognized and validated scales for AFFECT evaluation; any accessible confirmation on the reliability and validity of the AFFECT device and sensor applied should be presented. The open datasets have usually sought to obtain higher accuracy by using different sets of stimuli and groups of participants.

Emotional acculturation, happens when people, on contact with a different culture, learn new ways to express their emotions [619], incorporate new cultural values in their existing set, and then adjust their emotions to suit these new values [620–623]. This may be a research area in affective computing that needs more studies and focus. With growing global integration, emotional acculturation will become increasingly important, and advanced computational models will be needed to simulate the related processes. M.-T. Ho et al. [624] believe that this may be a key thematic change in the decades to come. The findings also suggest that developing more powerful algorithms cannot solve the perception, reading and evaluation of the complexity of human emotions. Instead, the complex modulators that affective and emotional states stem from need to be better understood by the scientific community. We can only hope that the future will bring further research that will remedy this and help develop more advanced technologies that can better cope with issues such as gender, race, diversity attitudes, and cross-cultural differences in emotion [624].

The substantial improvements in the development of affordable and simple to utilize sensors for recognizing AFFECT have resulted in numerous studies being conducted. For this review, we studied in detail 634 articles. We focused on recent state-of-the-art AFFECT detection techniques. We also took existing data sets into account. As this review illustrates, exploring the relationship between brain and biometric signals and AFFECT is a formidable undertaking, and novel approaches and implementations are continually being expanded.

The evaluation of the intensity of human AFFECT is a complex process which requires the use of a multidirectional approach. The main difficulties of this process include variations in the nature of human beings, social aspects, etc., due to these methods, which fits for average evaluation of customers majority, but shows poor results in personalized cases and vice versa. Moreover, the reliability of evaluations of human emotions strongly depends on the number of biometric parameters used, and the measurement methods and sensors applied. It is well known that a higher reliability of recognition can be achieved by increasing the number of parameters, but this will also increase the need for certain equipment and will slow down the evaluation process. The selection of measurement methods and sensors is no less important in the successful recognition of emotions. Contact measurement methods give the most reliable results, but their implementation is relatively complicated and may even be frightening for potential customers. The best solution in this case is non-contact measurement methods, that is, contact methods which do not require special preparation and allow measurements to be taken without the knowledge of the customer.

Future research possibly could focus on areas of reaction to emotion development stage, while sensing and evaluation became faster than emotion recognition by person itself.

This research has addressed the various issues that emerge when affective and physiological states, as well as emotions, are determined by recognition methods and sensors and when such studies are later applied in practice. The manuscript presents the key results on the contribution of this research to the big picture. These results are summarized below:


Information on diversity attitudes, socioeconomic status, demographic and cultural background, and context is missing in many studies. In this study, we have identified real-time context [347] data and have integrated them with AFFECT data. For example, the ROCK Video Neuroanalytics system and associated e-infrastructure were established as part of the H2020 ROCK project, in which passers-by were tracked at 10 locations across Vilnius [348]. One of the outputs was the real-time Vilnius Happiness Index (Figure 10 and https://api.vilnius.lt/happiness-index, accessed on 5 September 2022), and the project also involved a number of additional activities (https://Vilnius.lt/en/category/rock-project/, accessed on 5 September 2022) [625,626].

The analysis of the global gap In the area of affective biometric and brain sensors presented in this study and our aim of contributing to the current state of research in this area have led to the aforementioned research results.

Based on the evaluation of biometric systems performed in Section 7 and the conclusions presented in Chapter 8, future AFFECT biometrics and neuroscience development directions and guidelines are visible. We performed the above analysis by extensively discussing biometric and neuroscience methods and domains in the article.

Additionally, Sections 2 and 6 present statistical and multiple criteria analysis across 169 nations, our outcomes demonstrate a connection between a nation's success, its number of Web of Science articles published, and its frequency of citation on AFFECT recognition. This analysis demonstrates which country's success metrics significantly influence future AFFECT biometrics and neuroscience development.

Advancements in the development of biometric and neuroscience sensors and their applications are summarized in this review. Regardless of the encouraging progress and new applications, the lack of replicated work and the widely divergent methodological approaches suggest the need for further research. The interpretation of current research directions, the technical challenges of integrated neuroscience and affective biometric sensors, and recommendations for future works are discussed. The reviewed literature revealed a host of traditional and recent challenges in the field, which were examined in this article and are presented below.

Biometric research aims to provide computers with advanced intelligence so that they can automatically detect, capture, process, analyze, and identify digital biometric signals—in other words, so they can "see and hear". In addition to being one of the basic functions of machine intelligence, this is also one of the most significant challenges that we face in theoretical and applied research [627].

There are still many challenging issues in terms of improving the accuracy, efficiency, and usability of EEG-based biometric systems. There are also problems concerning the design, development and deployment of new security-related BCI applications, such as personal authentication for mobile devices, augmented and virtual reality, headsets and the Internet [628]. Albuquerque et al. [628] have presented the recent advances of EEGbased biometrics and addressed the challenges in developing EEG-based biometry systems for various practical applications. They have also put forth new ideas and directions for future development, such as signal processing and machine learning techniques; data multimodal (EEG, EMG, ECG, and other biosignals) biometrics; pattern recognition techniques; preprocessing, feature extraction, recognition and matching; protocols, standards and interfaces; cancellable EEG biometrics; security and privacy; and information fusion for biometrics involving EEG data, virtual environment applications, stimuli sets and passive BCI technology.

Some of these challenges (accuracy, efficiency, usability, etc.) are analyzed in the article. Each of these features can be examined in more detail. For example, Fierrez et al. [629] analyzed five challenges in multiple classifiers in biometrics: design of robust algorithms from uncooperative users in unconstrained and varying scenarios; better understanding about the nature of biometrics; understanding and improving the security; integration with end applications; understanding and improving the usability. "Design of robust algorithms from uncooperative users in unconstrained and varying scenarios" is a challenge that has been a major focus of biometrics research for the past 50 years [2], but the performance level for many biometric applications in realistic scenarios is still not adequate [629].

Recently, new challenges in the field have been appearing; some of which are presented below as an example. Sivaraman [630] argues that in the age of AI and machine learning, cyberattacks are more powerful and are sometimes able to crack biometric systems. Additionally, these attacks will become more frequent. Multimodal biometrics are increasingly important, where a combination of biometrics is used for greater security. The pandemic has resulted in changes to the biometric algorithm of various modalities. Facial recognition algorithms have been improved to recognize people wearing masks and cosmetics. Updates like these may improve the accuracy of biometrics systems. Biometric devices will take web and cloud-based applications to the next level, as many organizations will continue to operate remotely [630].

Furthermore, a few problems have not been solved, and additional research fields have emerged, namely: biometric and neuroscience technologies lack privacy, are invasive and persons do not like to share their personal data and be identified; lack of protection from hacking; lack of accuracy; a quite expensive life cycle (brief, design, development, set up, running, operation, etc.); lack of capability to read some human features; customer satisfaction is not always guaranteed; human figure form recognition and examination of figure fragments, examination of head vibrations, and human electrical fields are inefficient.

**Author Contributions:** Conceptualization and methodology, A.K.; investigation, A.K., A.A., I.U., R.K., V.L., A.B.-V., I.V. and L.K.; resources, writing—review, editing and visualization, A.K., A.A., I.U., R.K., V.L., A.B.-V., I.V. and L.K.; supervision, A.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported as part of the 'Building information modeling-based tools and technologies toward fast and efficient RENovation of residential buildings—BIM4REN' project, which received funding from the European Union's Horizon 2020 research and innovation program under grant agreement No. 820773. This research was also supported via Project No. 2020-1-LT01-KA203- 078100 "Minimizing the influence of coronavirus in a built environment" (MICROBE) from the European Union's Erasmus+ program.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All extracted data are included in the manuscript.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**




#### **References**


Howlett, R.J., Jain, L.C., Eds.; Smart Innovation, Systems and Technologies; Springer: Singapore, 2019; Volume 143, pp. 167–182. [CrossRef]


### *Article* **Multispectral Facial Recognition in the Wild**

**Pedro Martins 1,2, José Silvestre Silva 1,3,4,\* and Alexandre Bernardino 2,5**


**\*** Correspondence: jose.silva@academiamilitar.pt

**Abstract:** This work proposes a multi-spectral face recognition system in an uncontrolled environment, aiming to identify or authenticate identities (people) through their facial images. Face recognition systems in uncontrolled environments have shown impressive performance improvements over recent decades. However, most are limited to the use of a single spectral band in the visible spectrum. The use of multi-spectral images makes it possible to collect information that is not obtainable in the visible spectrum when certain occlusions exist (e.g., fog or plastic materials) and in low- or no-light environments. The proposed work uses the scores obtained by face recognition systems in different spectral bands to make a joint final decision in identification. The evaluation of different methods for each of the components of a face recognition system allowed the most suitable ones for a multi-spectral face recognition system in an uncontrolled environment to be selected. The experimental results, expressed in Rank-1 scores, were 99.5% and 99.6% in the TUFTS multi-spectral database with pose variation and expression variation, respectively, and 100.0% in the CASIA NIR-VIS 2.0 database, indicating that the use of multi-spectral images in an uncontrolled environment is advantageous when compared with the use of single spectral band images.

**Keywords:** deep neural networks; multispectral imaging; face recognition; in the wild

#### **1. Introduction**

The sense of sight allows us to observe dangers, identify objects, and recognize people. This last task is fundamental for human beings as social beings. It enables us to differentiate the level of trust someone can give to a specific person, with this being at the base of the construction of communities. Such is the importance of this task that it has become one of the main topics of research with the emergence of machine learning, thus allowing machines to incorporate this biological capacity.

Multi-spectral images have several military applications, from detection of camouflaged people [1], classification of vegetation types in military regions [2], landmine detection [3] and face recognition [4]. The current face recognition systems operating in the visible (VIS) domain have reached a significant level of maturity. It is possible to observe their wide use nowadays, from security mechanisms to unlocking electronic devices such as smartphones and personal computers to population control systems [5].

However, most current face recognition systems [6] require the cooperation of the user to ensure that pictures are taken in favorable conditions (frontal postures, good illumination, no occlusion) and have trouble dealing with uncontrolled scenarios. Uncontrolled environment scenarios, such as riots and violent demonstrations, can often be used by criminals and terrorist cell members to move around and cause damage to Homeland Security, as this type of environment adds difficulty to their detection. The uncontrolled

**Citation:** Martins, P.; Silva, J.S.; Bernardino, A. Multispectral Facial Recognition in the Wild. *Sensors* **2022**, *22*, 4219. https://doi.org/10.3390/ s22114219

Academic Editor: Mincheol Whang

Received: 5 April 2022 Accepted: 30 May 2022 Published: 1 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

environment is mainly characterized by a variety of lighting, pose, facial expressions and the existence of occlusions [5]. These features are challenges to face recognition systems due to the multiple intrapersonal variations they provide, making it difficult to correctly identify an individual's identity based on a collaborative image of the individual.

This work has as its main objective the development of a multi-spectral face recognition system in an uncontrolled environment. To achieve this goal, the solutions used by current recognition systems and the evaluation of the benefits of using multi-spectral images are explored. The developed face recognition system is evaluated in public multi-spectral image datasets with pose and expression variability.

This paper is organized into six sections. The Introduction section describes the motivation for the work, the objectives and the structure of the paper. The Background section explains important concepts, such as how a face recognition system works, what multispectral images are and what their advantages are. The Related Work section presents the study of the art of multispectral face recognition methods in an uncontrolled environment and of public multispectral databases. The Methodology section defines the proposed method in order to achieve the objectives. The Results and Discussion section describes the multispectral databases used, several experiments are also performed with the various modules proposed in the methodology, each experiment is accompanied by its respective analysis and discussion. The Conclusions section presents the conclusions of this work, thus consolidating the proposed objectives.

#### **2. Background**

#### *2.1. Face Recognition*

In general, a face recognition system is described in several phases. The first phase consists of acquiring the facial images and pre-processing them, such as locating the faces and cropping them. In a second phase, the features are extracted from the facial image, for instance, the position of facial landmarks, eye distance or even the face tones. Finally, these features are used in a classifier for identification or verification purposes.

Face recognition can be performed in a controlled or uncontrolled environment. The controlled environment, also known as consent recognition, is one in which the user cooperates in the recognition by facilitating it through correct and static posture in a place with good lighting. In the uncontrolled environment, recognition is dynamic, without the user cooperating in acquiring an image, making the face recognition process very difficult due to the diversity of the surrounding environment (e.g., low visibility), facial poses and expressions.

#### *2.2. Multispectral Imaging in an Uncontrolled Environment*

The databases of the VIS domain and the use of image synthesizers, which generate multiple poses and facial expressions from the obtained images, have allowed the difficulties associated with the variety of poses and facial expressions to be circumvented. However, two points have proved more difficult to overcome: the change of illumination and occlusions. This has led to the use of multiple spectral bands, with particular emphasis on the infrared (IR) spectral band, which can acquire images in environments with little or no brightness and overcome occlusions such as smoke and fog. In short, multispectral analysis allows a face recognition system to extract facial features that would be impossible to obtain with images from the VIS spectral band.

The IR bands can be categorized according to several spectral bands [7]. The active bands are the near-infrared (NIR) and short-wavelength infrared (SWIR). To acquire images in these bands, the object must receive illumination, even if scarce, because it is through reflection that the image is acquired. Such a fact means these images are commonly used in night vision devices. The NIR band allows the difficulties posed by the variation of illumination to be overcome, while the SWIR has the advantage of obtaining images through smoke and fog. The passive bands are the mid-wavelength infrared (MWIR) and long-wavelength infrared (LWIR). Unlike the active bands, the passive bands allow us to acquire images using only the thermal radiation emitted by a body, commonly known as thermal images.

The use of IR images for automatic face recognition is not without challenges, as these images are sensitive to the emotional, physical and health conditions of the individual, as well as the surroundings, and do not serve as an absolute alternative to the use of the VIS spectrum, but rather as a complement [8]. Another difficulty arises from the low number of public databases with images from both spectral ranges and in an uncontrolled environment [9], which limit the creation of rich classification models and the ability to characterize the performance of those systems in realistic conditions.

#### **3. Related Work**

Multi-spectral face recognition in an uncontrolled environment can be subdivided into two areas. The first is face recognition in an uncontrolled environment, which is already challenging. The second is multi-spectral face recognition, i.e., using different spectral bands in face recognition. This section briefly reviews the progress made in these two areas.

#### *3.1. Face Recognition in an Uncontrolled Environment*

The uncontrolled environment, strongly characterized by pose-light-expression factors, emerges as a problem for current recognition systems. A significant step was taken towards solving this type of problem by introducing very large databases to train Deep Convolutional Neural Networks (DCNN) in combination with the emergence of image synthesis methods [5]. The two main image synthesis methods are: (i) one-to-many augmentation, which consists of generating different poses of a face from a canonical face image; (ii) many-to-one normalization, which consists of normalizing any pose of the face to a canonical face pose [5]. The use of Generative Adversarial Networks (GAN), introduced by Goodfellow et al. [10], is characterized by the use of a generator and a discriminator (see Figure 1). The generator is responsible for producing samples given an input image so that the discriminator cannot discern which of the samples is real and which is false.

**Figure 1.** Schematic of the training of a GAN. The dashed line shows the process of sample generation.

Since their appearance in face normalization, with DR-GAN [11], GANs have taken the lead in solving the problem of pose and facial expression variation. As for one-to-many augmentation using GANs, as is the case with the DA-GAN network [12], their image production power also gives them an advantage compared to other algorithms.

Normalization of many-to-one images is an extreme image synthesis problem due to the pose differences of a face. Cao et al. [13] proposed HF-PIM, normalizing the face to a frontal pose through a texture fusion deformation procedure leveraging a dense matching field to interconnect the 2D and 3D surface spaces. Qian et al. [14] presented Face Normalization Module (FNM), which encodes images using a pre-trained network for feature extraction and generates realistic images.

One-to-many augmentation is another approach to achieve face recognition regardless of the pose. Tran et al. [15] synthesized different poses through 3D modeling and then trained a DCNN to perform face recognition with varied poses. The DA-GAN proposed by Zhao et al. [12] created 2D images through 3D modeling and then refined the obtained 2D images to be as realistic as possible, using a GAN to try to preserve the identity of the face. Thus, the DA-GAN network was also used to augment the training data.

#### *3.2. Multispectral Face Recognition*

The main multi-spectral face recognition methods can be characterized by three important features: Image Synthesis Methods, Fusion Methods and Loss Functions.

Fusion methods are subdivided into feature fusion and score fusion. In the first, a fusion of features from the different spectral bands of the facial image is performed, allowing the most relevant features to be extracted from the different bands and joining them in a vector. The second method combines the scores obtained from each classifier uni-band (e.g., a classifier operating only in the LWIR band and another operating only in the NIR band) [16].

The image synthesis methods allow an image of a spectral band to be transformed into another, helping to compare two images. The main advantage of image synthesis is that it enables an image to be passed from any spectral band to the VIS band, making it possible to use classifiers implemented to process images of the VIS spectrum [17]. One of the most recent works in this area synthesizes VIS images from NIR images using GANs [18].

Finally, all neural networks have cost functions for the training moment to update the network weights. However, certain cost functions have been proposed to proceed specifically to the classification of multi-spectral images. Examples of these cost functions are the Scatter Loss [19] and the Wasserstein Distance [20].

#### *3.3. Gaps*

Although several scientific works address multi-spectral face recognition, few of these demonstrate its power in an uncontrolled environment due to the limitations in current databases of multi-spectral face images. In existing datasets, the variations of conditions are not extreme, as they are usually semi-controlled environments and not *in the wild* (uncontrolled environment). For example, the most studied database in multi-spectral face recognition, CASIA NIR-VIS 2.0 [21], uses images in which the pose has few deviations from the frontal position, which does not reliably characterize the uncontrolled environment. Thus, the fact that these databases are incomplete (compared to those of the VIS band) is still a barrier to improving the capability of multi-spectral face recognition systems in an uncontrolled environment.

The present work proposes a system that integrates the capabilities of current face recognition systems in an uncontrolled environment in the VIS spectrum at the pose variation level and the capabilities of multi-spectral face recognition systems to surpass illumination variation.

#### **4. Methodology**

The proposed multi-spectral face recognition system consists of three tasks: Face Detection and Alignment, Image Synthesis and Face Recognition. In Figure 2, the general operation of the proposed face recognition system is shown, including the steps performed in each task.

**Figure 2.** Schematic of the operation of the proposed face recognition system.

In the initial phase of the system, it is necessary to acquire multi-spectral images, which can be obtained through mono-spectral equipment (collects the image in only one spectral band) or multi-spectral (collects the image in different spectral bands). After image acquisition, the Face Detection and Alignment module aims to obtain an aligned and centered facial image with predefined dimensions. To achieve this goal, it is necessary to detect the presence of human faces and then perform a face marking, detecting essential landmarks of the face, such as eyes and nose, allowing a correct alignment of the face and clipping around it. The following task is Image Synthesis, which aims to obtain a frontal facial image. The next task is Face Recognition, where facial image features are extracted through a CNN and a one-shot learning methodology is followed for the classification task, obtaining similarity scores for each spectral band. These scores are combined using a score fusion method, and the predicted identity is the one with the highest combined score.

#### *4.1. Face Detection and Alignment*

Face detection, in conjunction with face alignment, aims to detect the faces presented in the input image and identify facial landmarks so that faces are centered, aligned and equally sized. Since face detection algorithms detect faces in rectangular areas without rotating the image, a face landmark detection algorithm is needed to apply a rotation so that the face is aligned on the horizontal plan, using the imaginary eye line. Thus, the procedure of face detection and alignment module (see Figure 3) does the following: is given an image, identifies the different faces present, extracts the facial landmarks and processes the image to produce facial images where the face is centered and aligned.

**Figure 3.** Flowchart of the steps of a facial detection and alignment module.

The face detection algorithms explored in this work are based on SSD (single-shot multibox detector), a deep learning architecture for object detection [22]. The basic idea of the SSD is to generate scores for the presence of each object category in each predefined box and produce adjustments to the box to match the shape of the object. In this work, three SSD based methods are tested: (i) the S3FD algorithm [23], (ii) the facial detection deep neural network of OpenCV [24] and (iii) the DSFD algorithm [25]. The S3FD has contributions to better cope with scaling variations with a single deep network. The DSFD uses a feature enhancement module to extend the single-shot detector to a dual-shot detector, obtaining more robust and discriminable features.

As for the facial landmark detection algorithms, the DLIB library's 68 landmark network, adapted from Khazemi and Sullivan [26], and Bulat's 2D-FAN [27], also with 68 landmarks, were tested. The latter one uses an Hour-Glass [28] based architecture to estimate the human pose. Both networks receive an image of a person and produce, as output, the position of the different facial landmarks around the face.

All the algorithms addressed in this subsection were trained in databases that only contain images in the spectral band of the VIS. To achieve data normalization, it is necessary to (i) rotate the image to align the eye line with the horizontal, (ii) crop the image to center the face image, and (iii) resize the image so that all output images have the specified dimensions.

#### *4.2. Image Synthesis*

To overcome the problems related with image acquisition in an uncontrolled environment, such as variation in lighting, occlusions and changes of poses, a face normalization module is used. This module aims to synthesize (create) an image of a face with frontal pose and neutral expression from a non-frontal face image.

To exemplify the expected behavior, Figure 4 shows an input face image in a nonfrontal pose, with which the image synthesis module produces a frontal face image. Thus, it is intended that the image acquired helps obtain the identity features present in the facial image. The models FNM [14] and FFWM [29] are analyzed.

**Figure 4.** Input and output of the Image Synthesis module (intended function, not the result of a real experiment).

FNM is a GAN with two new features. First, it uses a network specialized in obtaining facial features to build the generator and provide the ability to preserve facial identity. Second, facial discriminators are used to refine local textures. Their authors claim that this model produces a face in the canonical pose without expression, which directly improves the performance of a face recognition system.

The normalization method of the FFWM model consists of using a deformation module, aiming to synthesize realistic frontal images with illumination preservation. For frontal image synthesis, it presents a module responsible for reducing pose discrepancy at the facial features level, thus preserving more details of profile images. The FFWM model uses pairs of face images for the training phase: one with a non-frontal pose and another with a frontal pose of the same person in the same conditions. Differently, the FNM model uses non-pair face images, where the images are not of the same person.

#### *4.3. Face Recognition*

This last module aims to identify the person present in an input face image, following the flowchart presented in Figure 5. For this purpose, it is necessary to perform two tasks: feature extraction and classification.

**Figure 5.** Schematic of the Face Recognition Module.

The extraction of representative features from a facial image is performed through a version of Light CNN [30] with 29 convolutional layers (Light CNN-29). To use this network for feature extraction in spectra other than VIS, transfer learning is used. According to [31], several models for biometric recognition are based on transfer learning when the databases are limited. Thus, one should use the Light CNN-29 model with the weights obtained by training on the VIS databases and fine tune with the facial image databases in spectra other than the VIS. At the end of the feature extraction phase, *B* vectors of 256 dimensions are generated, with *B* being the number of spectral bands in which the facial image was acquired.

The classification process applied by the one-shot learning technique determines the degree of similarity of the feature set extracted from the input image with the features sets extracted from the images of each class present in the support set, which is constituted by one example per class. The similarity functions to be used are the Euclidean distance and the cosine similarity. After obtaining the similarity values for each identity in the different spectral bands, a fusion of the obtained scores is performed, inspired by [27]:

$$S\_{ic} = \sum\_{b=1}^{B} S\_{ib} \mathcal{W}\_b \tag{1}$$

where *Sic* is the combined score for each identity *i* and *Sib* is the score obtained for each band *b* for each identity *i*. *Wb* is the weight of each spectral band. The weights associated with each band are fixed, determined by the accuracy obtained when classifying with only that band. In this way, the band that usually obtains the most reliable similarity scores to classify will have a greater weight in the fusion of scores. The prediction is then made by choosing the identity *i* of the support set that has the highest combined similarity score:

$$prediction = \max(S\_{ic}) \; \forall i \in [1, \dots, N] \tag{2}$$

#### **5. Results and Discussion**

#### *5.1. Databases*

We performed both qualitative and quantitative evaluations of the proposed methods. These images are in the VIS, NIR and LWIR bands. Two multi-spectral databases were used for quantitative evaluation: TUFTS [9] and CASIA NIR-VIS 2.0 [19]. The TUFTS database has facial images in the VIS, NIR and LWIR bands of 113 people with different poses and different illumination conditions. The TUFTS database has different subsets, divided into TUFTS-Pose (facial images with nine different poses per individual, in visible, NIR and LWIR) and TUFTS-Exp (four facial images with different expressions and one with sunglasses per individual, in visible and LWIR) to study pose variation and expression variation separately. CASIA NIR VIS 2.0 comprises 17,489 facial images of 715 people in VIS and NIR spectral bands under different light conditions.

#### *5.2. Metrics*

The metrics used are Rank-1, Rank-5 and TAR@FAR = 0.001. When using a generic expression Rank-n, given an image of a face as input, the classifier obtains the n most probable identities, one of which is the correct identity. TAR (true accept rate) is defined as the percentage of faces that, compared to the corresponding gallery identity, are identified as matches, while FAR (false accept rate) is the percentage of incorrect identities to which a face is matched.

#### *5.3. Face Detection and Alignment*

#### 5.3.1. Face Detection

Regarding the qualitative results presented in Figure 6, all algorithms produced similar results in the VIS band. This was expected since they were all trained in databases of the spectral band of the VIS. In the LWIR spectral band, a failure of the OpenCV network was observed in the second facial pose, where it cannot detect any face. In addition, when OpenCV and S3FD detect the faces, there is a variation in the rectangle area compared to the VIS spectral band. The DSFD maintained the same results, which is a good indicator of its ability to extract characteristics even in the LWIR spectral band.

**Figure 6.** Results obtained by facial detection methods in the spectral bands of VIS (**a**–**e**), NIR (**f**–**j**) and LWIR (**k**–**o**). S3FD—red, DSFD—blue, OpenCV—green.

The quantitative results are presented in Table 1. It can be observed that the OpenCV network results are lower than the others, especially in infrared bands. Comparing results between the S3FD network and the DSFD, it is observed very similar results in the spectral band of the VIS and NIR. However, the results in LWIR are about 8 percentage points better. We observe that the DSFD maintains a very high accuracy for the different spectral bands, thus being the best network for face detection in a multispectral facial analysis system.


**Table 1.** Accuracy of the different face detection algorithms in the TUFTS database.

#### 5.3.2. Landmark Detection and Facial Alignment

The results for face landmark detection are shown in Figures 7 and 8. For the more challenging poses, we can see that the DLIB network fails, even in the VIS band (right eye, in Figure 7c), as it tends to maintain the shape of a near-frontal face. One possible cause of this behavior is that the face landmark detection model was trained in a dataset without significant variations at the pose level. The DLIB network reveals even more difficulties in the spectral band of LWIR.

2D-FAN reveals a good extraction of landmarks in any of the poses, including the LWIR band, where the results are somewhat like those obtained in the VIS band (Figure 8). In the case of Figure 8n, although it looks like there was a total failure, it is possible to observe that the eyes are correctly identified. 2D-FAN, unlike DLIB, was trained on a database with pronounced pose variations (including profile images), which is the justification for achieving better results.

Given the previous considerations, we decided to use the 2D-FAN over the DLIB's network due to two factors: (i) it shows better results with face pose variation, and (ii) it is the only one capable of producing positive results in the LWIR spectral band. After the face detection with DSFD and landmark face detection with 2D-FAN, the align, crop and resize phase took place, which aligned the imaginary eye line of all detected faces with the horizontal, centered the faces in the images, cropped them and resized to the same size, resulting in the results presented in Figure 9. The alignment effect is strongly noticeable on the rightmost facial image. This normalization of the facial images can help a multispectral face recognition system in an uncontrolled, where faces can be presented in several poses.

**Figure 7.** Results achieved by DLIB in the spectral bands of VIS (**a**–**e**), NIR (**f**–**j**) and LWIR (**k**–**o**). Yellow—jawline, green—eyes and mouth, purple—nose, blue—eyebrows.

**Figure 8.** Results achieved by 2D-FAN in the spectral bands of VIS (**a**–**e**), NIR (**f**–**j**) and LWIR (**k**–**o**). Yellow—jawline, green—eyes and mouth, purple—nose, blue—eyebrows.

**Figure 9.** Results achieved by the proposed facial detection and alignment module in the different spectral bands. The images on the top are the originals in the VIS, before processing. Remain images correspond to facial alignment and detection in the spectral bands of VIS (**a**–**e**), NIR (**f**–**j**) and LWIR (**k**–**o**).

#### *5.4. Image Synthesis*

For all images used in the qualitative and quantitative evaluations, the images were previously processed to be properly centered, aligned and scaled. The FFWM model needs to receive the facial images with certain facial landmarks always in the same coordinates. Therefore, the face detection and alignment module provided by the authors of FFWM was used to obtain the results. The images used by the FNM model were processed by the face detection and alignment module developed by the authors of this work. The rightmost images used in the previous tasks were replaced by ones with a strong expression, to evaluate the capacity of the models to normalize expressions.

#### 5.4.1. Selecting the Best Model

In Figure 10, the results obtained by the FFWM are shown. One of the images of the dataset could not be detected by the module provided by the authors of FFWM (see Figure 10n). It is possible to see that the performance of FFWM has a sharp drop as it moves away from the VIS band. Analyzing only the spectral band of the VIS and the images with pose variation (Figure 10b,c), a suitable normalization of the pose in Figure 10c is present.

**Figure 10.** Results achieved by the FFWM in the different spectral bands. The images on the top are the originals in the VIS. The images (**a**–**e**), (**f**–**j**), and (**k**–**o**) were generated by the proposed methodology when it receives as input the images from the VIS, NIR and LWIR bands, respectively.

In Figure 10b, the FFWM produces a deformed face when the person looks upwards. The exclusive use of the Multi-PIE database [32] in training the FFWM means that it can only normalize the face where the pose varies along the horizontal plane.

The FNM presents more satisfactory results (see Figure 11) in the NIR spectral band, where the facial images are more realistic than those of the FFWM. It should be noted that with the FNM model, identities change, i.e., the person in the output face image appears to be different from the person in the input face image. However, the use of a face feature extractor by the FNM model allows the most relevant features in the output face image to be kept. It is also relevant to point out that the FNM normalizes pose and expression, eliminates face masks, as is the case of the surgical mask, and normalizes to the VIS spectral band. However, this normalization does not produce realistic results with the LWIR images due to the difference between the LWIR and VIS spectral bands.

**Figure 11.** Results achieved by the FNM in the different spectral bands. The images on the top are the originals in the VIS. The images (**a**–**e**), (**f**–**j**), and (**k**–**o**) were generated by the proposed methodology when it receives as input the images from the VIS, NIR and LWIR bands, respectively.

Given the previous considerations, we decided to use the FNM instead of the FFWM due to two factors: (i) the FFWM requires a specific face detection and alignment module and that the face is perpendicular to the horizontal, while the FNM is more robust to pose variations in the input image; (ii) all images normalized by the FNM tend to maintain the face proportions, without deforming them, in the NIR and VIS spectral bands.

#### 5.4.2. Evaluation of Selected Model

Identification with and without the use of FNM was performed to verify its advantage. For this purpose, the Light CNN-29 was used for feature extraction, and the identification was performed based on the score obtained by cosine similarity.

The results presented in Table 2 show that, without using the FNM, the use of the NIR spectral band produces better results than the VIS band in all metrics analyzed. A possible explanation is that the images obtained in the NIR band are not so affected by the illumination variation (due to pose variation), thus not causing as many occlusions as in the VIS band. The results improve with the use of the FNM in the VIS and NIR spectral bands, with increases in performance in Rank-1 of 15.9% and 0.7%, respectively. In the remaining metrics, it is also observed better values with the use of the normalization model. This shows that the apparent identity change in the qualitative tests (see Figure 11) does not have a negative impact. The results in the LWIR spectral band indicate that using the FNM does not improve the performance in any of the metrics.

**Table 2.** Results (in %) with and without FNM on the TUFTS-Pose database, using the Light CNN-29 and cosine similarity score.


Due to FNM's ability to normalize facial expression, tests were performed with TUFTS-Exp to verify whether normalization of expression allowed Light CNN-29 to extract more representative facial features. The results presented in Table 3 show that the sets of features extracted by Light CNN-29 without facial expression normalization are already representative enough, obtaining a Rank-1 of 99.6% in the VIS and 67.5% in the LWIR and a TAR@FAR = 0.001 of 99.4% in the VIS band and 57.0% in the LWIR band. The use of FNM impairs the feature extraction and consequently the results, especially in the LWIR spectral band, where FNM has more difficulties in generating realistic images. Analyzing the results obtained, the FNM model is used only to normalize facial images from the TUFTS-Pose database in the VIS and NIR spectral bands.

**Table 3.** Results (in %) with and without FNM on the TUFTS-Exp database, using the Light CNN-29 and cosine similarity score.


Table 4 presents the results obtained for Rank-1 with the variation of the quantized pose. The values achieved in the VIS band show a significant improvement in the Rank-1 metric with the use of the FNM, resulting in an increase from 77.5% to 97.7% with pose variations of 45◦ and from 43.3% to 87.4% with pose variations of 60. In the NIR, there is only an improvement when the pose variation is 60◦, where the results go from 93.4% to 96.5%. The results obtained prove the ability of the FNM network regarding the pose normalization, where a higher pose variation results in a higher benefit of using it.

**Table 4.** Results (in %) of rank-1 with and without FNM on TUFTS-Pose database with quantification of pose variation, using the Light CNN-29 and cosine similarity score.


*5.5. Face Recognition*

5.5.1. Network Training

For the training phase, and considering the results presented above, it was decided to make only one fine adjustment to the LWIR band feature extraction network, because the results obtained in this band are considerably lower, due to the network having been trained in the visible. Thus, the fine-tuning aims for the network to learn to extract more representative features from facial images in the LWIR spectral band. In order to train the Light CNN-29 with identities (people) different from the test ones, a last connected layer was added for training purposes and the LWIR spectral band images from the IRIS database [33] were used. This last layer is used as the input of the softmax cost function and is simply set to the number of training set identities, as proposed by [30].

The optimization algorithms SGD and SGD with Nesterov were used, along with the Cross-Entropy loss function. Table 5 summarizes the parameters used during the training phase.


**Table 5.** Parameters used in the training procedure.

The objective of the training is that Light CNN-29 learns to extract representative features from facial images and not only to classify them. In this way, Light CNN-29 can be applied to other databases to extract features from facial images to be used as input for similarity functions. Thus, all the following processes make use of the 256-dimensional feature set obtained by Light CNN-29. Table 6 shows the results achieved by the original model and the models trained on the LWIR spectral band, using as similarity function the cosine similarity.

**Table 6.** Rank-1 results (in %) achieved by different models for extraction of LWIR band features.


With the results achieved, it is seen that the fine-tuning allowed the network to learn to extract more representative features of facial images of the LWIR spectral band. It is also noticeable that the model that achieved the best results was the SGD without Nesterov, which was chosen for the remaining experiments.

#### 5.5.2. Similarity Functions and Score-Level Fusion

At this stage, we have three Light CNN-29 models, each responsible for extracting features from a specific band. Only the Light CNN-29 responsible for the extraction of features from the LWIR spectral band underwent a fine-tuning. To proceed with classification, it was necessary to find the similarity function that best fits the face recognition task.

Table 7 present the results achieved with the similarity functions cosine similarity and Euclidean distance. The results show that the cosine similarity function is the one that obtains the best score, which is in agreement with [34,35].


**Table 7.** Rank-1 results (in %) achieved in the face recognition task with the cosine similarity (CSim) and Euclidean Distance (EDis).

It is now possible to use the scores obtained by each spectral band to proceed to the final classification. A fusion of the achieved scores was performed using (1). Two studies were conducted, with different weights of each band (*Wb* of Equation (1)) as shown in Tables 8 and 9.


**Table 8.** *Wb* values to be used for each spectral band in the different studies.

**Table 9.** Results (in %) obtained in the face recognition task, in the TUFTS-Pose database.


In study 1, the previously obtained test results are not considered; thus, the same weight is used in all spectral bands. The final score is a simple arithmetic mean of the scores of the individual bands, which assumes that all spectral bands have the same classification capacity.

The *Wb* values in study 2 are derived from the mean of the Rank-1 average precision of each of the spectral bands in the tests performed on the TUFTS-Pose, TUFTS-Exp and CASIA NIR-VIS 2.0 databases (results obtained with the cosine similarity function in Table 7) rounded to tenths. Thus, in study 2, the final score was obtained as weighted arithmetic mean, where each band presents different weights reflecting its classification accuracy.

Tables 9–11 show our final face recognition results using both the individual bands and the combination of bands with the two different weight sets (Study 1 and Study 2).




Table 9 presents the results obtained with the TUFTS-Pose database. These results show that study 2 achieved better results than study 1, in the Rank-1 and Rank-3 metrics by 0.1 percentage points, and the TAR@FAR = 0.001 metric by 3 percentage points. The superiority of the results obtained by study 2 compared to study 1 shows that the weight assigned to the LWIR spectral band should be lower than the weight assigned to the others because the characteristics obtained in the LWIR spectral band are the least representative of the identity.

Analyzing the results of the different spectral bands separately, the NIR spectral band achieved the best results due to its robustness towards the variation of illumination present in the TUFTS-Pose database. Despite the promising results of the NIR band when used solo, study 2 obtained superior results in all metrics, with particular emphasis on Rank-1 (from 99.0% to 99.5%) and TAR@FAR = 0.001 (from 93.1% to 93.5%). It is relevant to point out that only the results obtained with score fusion reached the 100% accuracy rate in the assessed Ranks (Rank-4 for study 1 and Rank-3 for study 2).

Table 10 shows the results obtained with the TUFTS-Exp database. An analysis of the results allows us to see that the face recognition results obtained are better with score fusion, where both studies obtained the same result as the VIS spectral band in Rank-1 (99.6%) but managed to achieve a higher result in Rank-2 (100% against 99.6% of the VIS spectral band). However, the best result for TAR@FAR = 0.001 is obtained using only the VIS spectral band, with 99.4%, while the second-best result was obtained in study 2, with 99.3%.

The results achieved using the CASIA NIR-VIS 2.0 database (Table 11) show that study 1 reached a value of 100% in Rank-1. Using the VIS and NIR spectral bands separately, the results were 99.9% and 99.6%, respectively, using the same metric. It should be noted that study 2 was not performed for the CASIA NIR-VIS 2.0 database, as the difference between study 1 and study 2 is the weight assigned to the LWIR spectral band, which it does not have. In the TAR@FAR = 0.001 metric, study 1 matches the result for the VIS spectral band with 100%.

Performing a global analysis of all results, we can observe that the fusion of scores mainly favors cases where the results obtained by the different spectral bands separately were less satisfactory. Looking at the results obtained with the TUFTS-Exp and CASIA-NIR-VIS 2.0 databases (Tables 10 and 11), it is clear that the VIS spectral band already obtains very high values in all metrics. This fact makes the fusion of scores not so effective. However, despite a decrease of the TAR@FAR = 0.001 in Table 10, the results obtained by the fusion of scores, in general, were higher than those obtained by the spectral bands separately. The results obtained thus demonstrate the benefit of using multi-spectral images in a face recognition system.

#### **6. Conclusions**

In this paper, a multi-spectral face recognition system in an uncontrolled environment has been proposed, aiming to make a decision with the largest amount of data available, i.e., using the facial images obtained by the different spectral bands. The system is composed of three modules: (i) face detection and alignment, (ii) image synthesis and (iii) face recognition.

The state of the art regarding face recognition systems in an uncontrolled environment has led to the conclusion that image synthesis methods, mainly with GANs, have been used to combat intrapersonal variations, such as the difference in pose and facial expression. On the other hand, in the area of multispectral face recognition, with a plurality of solutions presented by the use of multispectral images, fusion methods are those that make the most use of images captured in different spectral bands in order to make a decision. The main problem encountered is the limited number of images (and people) in multispectral databases in an uncontrolled environment, which makes it challenging to train convolutional neural networks, which are the most used method for feature extraction.

Several techniques were implemented to validate them in different multi-spectral bands, since all of them were trained on visible databases, as well as to analyze the influence of facial image features (pose, illumination and expression). This analysis aimed to select the most appropriate technique for each module of the proposed face recognition system.

For the face detection task, three networks were evaluated qualitatively and quantitatively, which allowed us to conclude that the DSFD network was the most appropriate since it maintained a high accuracy in the different spectral bands. For the landmark detection task, three networks were evaluated qualitatively, and it was concluded that the 2D-FAN network was the best fit due to its ability to correctly identify facial landmarks in different spectral bands with a diversity of facial poses. Such evaluations allowed us to select the methods that are best suited for these tasks with multispectral images in an uncontrolled

environment. Thus, this work presents an efficient face detection and face alignment module for a multispectral face recognition system in an uncontrolled environment.

The present work also performed evaluations of different face normalization methods, through image synthesis, to produce face images with a frontal pose. The FFWM and FNM models were analyzed, where the FNM model produced the most realistic facial images for the visible and NIR spectral bands, maintaining the proportions of the face and the most relevant facial features. Further analysis of the FNM model allowed us to conclude that: (i) the greater the pose variation, the greater the advantage in using the FNM model and (ii), the NIR images allow us to obtain a better identification/verification than the visible images because pose variation can entail variations in illumination, to which the NIR band is resistant.

The analysis of the performance of the different models allowed the selection of the most suitable one for a multispectral face recognition system in an uncontrolled environment, as well as the identification of the most advantageous situations for its use.

The extraction of the feature sets of the facial images from the different spectral bands is performed using Light CNN-29 [30], with a fine adjustment to the network weights for the LWIR spectral band since it was trained on the visible spectral band. For the classification phase, identification is performed in the different spectral bands, each producing different scores for each identity. These scores are computed by the similarity between the feature sets of each identity and the feature set of the input facial image. In this work, two different studies were performed for score fusion, which allowed us to conclude that: (i) simply using the different spectral bands to identify is advantageous (study 1) and (ii) a weighted average is beneficial when the different classifiers (of each spectral band) have different levels of reliability (study 2).

On the multi-spectral TUFTS database, with pose variation and expression variation, the results obtained in Rank-1 by the proposed system and with score fusion with a weighted average (study 2) were 99.5% and 99.6%, with the best results obtained using only one spectral band being 99.0% and 99.6%. On the TAR@FAR = 0.001 metric, the results obtained by weighted average are 93.5% and 99.3%, while with only one spectral band 93.1% and 99.4% were obtained. In the CASIA NIR-VIS 2.0 database, score fusion achieved the results of 100.0% in the Rank-1 and TAR@FAR = 0.001 metrics, where without score fusion, 99.9% and 100.0% in Rank-1 and TAR@FAR = 0.001, respectively, are obtained as the best result.

The original contributions of this work include the analysis of several techniques for different tasks, which allowed: (i) the presentation of an efficient face detection and alignment module to be used by any multi-spectral face analysis system, (ii) the identification of the situations in which the FNM model should be used to normalize facial images and (iii) the selection of a similarity function and the weights to be used in the fusion of scores to identify/verify identities. From the experimental results, it is also concluded that the proposed system allows us to obtain high results in multi-spectral face recognition in an uncontrolled environment, where the use of the scores obtained from different spectral bands allows us, in general, to achieve results that are superior to using only the scores obtained by one spectral band.

After performing the work described in this paper, the authors suggest as future work several relevant hypotheses. The first suggestion consists of the creation of a multispectral database to overcome the limitations in the public multispectral databases that currently exist. The second suggestion is to create a prototype and put it to work for access control in high security areas. The third suggestion for future work consists of the adaptation of the image input, to be able to process images obtained by drones with cameras in the spectrum of visible, NIR, SWIR and LWIR, having as an objective the processing of images in real time.

**Author Contributions:** J.S.S. and A.B. proposed the idea and concept; P.M. developed the software under the supervision of J.S.S. and A.B.; all authors revised and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported in part by the Military Academy Research Center (CINAMIL) under the project Multi-Spectral Facial Recognition, and by FCT with the projects UID/FIS/04559/2019, HAVATAR (PTDC/EEI-ROB/1155/2020) and LARSyS (UIDB/50009/2020).

#### **Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** The Portuguese Military Academy (AM) database is a private database; the reproduction of the images present in this database without the explicit authorization of the authors is not allowed. During the image acquisition, a declaration of consent was presented to all participants for the use of the multispectral facial images in scientific works. As such, all persons present in the AM database gave informed consent for publication of identifying images in an online open-access publication. The image acquisition at the Military Academy was authorized by the Major-General (OF-7) Commander of the Military Academy according to the Information No. CINAMIL-2020-000224, Proc. 00.020.0181, of 28 February 2020. All methods were carried out following relevant guidelines and regulations and authorized by the Military Academy Commander.

#### **Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Robust Facial Expression Recognition Algorithm Based on Multi-Rate Feature Fusion Scheme**

**Seo-Jeon Park 1, Byung-Gyu Kim 1,\* and Naveen Chilamkurti <sup>2</sup>**


**Abstract:** In recent years, the importance of catching humans' emotions grows larger as the artificial intelligence (AI) field is being developed. Facial expression recognition (FER) is a part of understanding the emotion of humans through facial expressions. We proposed a robust multi-depth network that can efficiently classify the facial expression through feeding various and reinforced features. We designed the inputs for the multi-depth network as minimum overlapped frames so as to provide more spatio-temporal information to the designed multi-depth network. To utilize a structure of a multi-depth network, a multirate-based 3D convolutional neural network (CNN) based on a multirate signal processing scheme was suggested. In addition, we made the input images to be normalized adaptively based on the intensity of the given image and reinforced the output features from all depth networks by the self-attention module. Then, we concatenated the reinforced features and classified the expression by a joint fusion classifier. Through the proposed algorithm, for the CK+ database, the result of the proposed scheme showed a comparable accuracy of 96.23%. For the MMI and the GEMEP-FERA databases, it outperformed other state-of-the-art models with accuracies of 96.69% and 99.79%. For the AFEW database, which is known as one in a very wild environment, the proposed algorithm achieved an accuracy of 31.02%.

**Keywords:** deep learning; facial expression recognition (FER); 3D convolutional neural network (3D CNN); multirate signal processing; minimum overlapped frame structure; self-attention; multi-depth network

#### **1. Introduction**

Communication skills have been developed based on the senses that play an important role in human interaction. There are five human senses: sight, sound, touch, taste, and smell. There is no doubt that sight is the most important one of the five senses for most people, since up to 80% of all senses are recognized through sight [1].

In recent years, the importance of human–computer interaction (HCI) grows larger as the artificial intelligence (AI) field develops. The basic goal of the HCI field is to improve the interaction between human and computer systems by making the computers more useful and accessible to humans. Additionally, the ultimate goal of the AI technology is to allow the machine to catch the user's intentions or emotions by itself, thereby reducing the burden of the user and making it more enjoyable. Therefore, understanding the feelings and the action of the human becomes important in various human-centric services. This technology based on the human face is called facial expression recognition (FER) technology.

FER technology has many applications in customer service [2], the automotive industry [3], entertainment [4], and home appliances [5]. There are good examples including: games with different modes based on classifications of the user's facial expression [6], identifying the driver's drowsiness and instructing an appropriate response [7,8], automatically collecting vast amounts of data necessary for the study of human emotional behavior

**Citation:** Park, S.-J.; Kim, B.-G.; Chilamkurti, N. A Robust Facial Expression Recognition Algorithm Based on a Multi-Rate Feature Fusion Scheme. *Sensors* **2021**, *21*, 6954. https://doi.org/10.3390/s21216954

Academic Editors: Mincheol Whang and Sung Park

Received: 27 July 2021 Accepted: 14 October 2021 Published: 20 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

patterns [9], detecting the emotional state of the patient and predicting the situation in need of help [10,11], and establishing an adaptive learning guidance strategy by grasping a student's psychological state using facial expressions and words that are used [12–14]. In recent years, interest and research on the development of intelligent home appliances and software that respond to the user's emotional state have been focused on.

One of the main technologies for emotion recognition is to recognize a user's emotional state from facial expression from an image sensor. Among various fields of biometrics, the face is a very important element that can be easily encountered in daily life. The emotional state that appears in the facial emotions is sufficient to be used as a human interface when sharing opinions with other people in the process of communicating with each other or conveying one's feelings. Reflecting this importance, many studies related to FER have been conducted. In the field of psychology, many studies on facial analysis and recognition have been done for many years.

According to a study by psychologist Ekman and Friesen, six emotions of a person, happiness, sadness, anger, surprise, disgust, and fear, have been classified as basic emotions that are perceived in common without being influenced by each culture [15,16]. Based on this, many studies have classified six emotions or seven emotions adding neutral expressions to identify emotional states. In recent years, research targets are expanding to expressions including not only depression, pain, and sleepiness but also expressions representing mental states such as agreement, concentration, interest, thinking, and confusion. In addition, research is also being conducted on the recognition of natural facial expressions and not only the research through an ideal database containing exaggerated expressions to the limited environment. However, despite these efforts, FER technology is still at a level that can be applied only in limited circumstances.

The FER system that recognizes facial expressions consists of three steps. The first step is to detect a person's face. This step is to detect a face area from an input image and to detect face elements such as the eyes, nose, or mouth. Representative algorithms include Adaboost [17], Haar-cascade [18,19], and the histogram of oriented gradients (HOG) [20]. Second, facial features are extracted from the recognized face using a geometric featurebased method or an appearance feature-based method. Finally, there is a classification step in which emotions are classified using the method based on the extracted features.

Facial expression recognition is a field with high dependency on datasets. There are two types of factors that influence facial expression recognition. The first type of external factors is uniqueness of each person such as gender, race, and age. The second type of external factors is the environment such as lighting, poses, resolution, and noise. However, many facial-expression datasets were created in controlled environments, so the second type of external factors were affected less than the first type. To overcome this problem, the dataset must be rich enough to accommodate these factors. Therefore, we used data augmentation to supply various information. Another method is to create a cross dataset that uses multiple datasets. This is to learn and test by combining different datasets under the same conditions. Through this, there is an advantage that facial expressions in a more diverse environment can be generalized.

Datasets used for FER are largely divided into two types according to the type of dataset. A static dataset consists of static images, and a dynamic dataset consists of dynamic images, which are called videos. In order to apply the FER in practice, we need to use a dynamic dataset, which is found in real life. In general, the accuracy of a dynamic dataset is lower than a static dataset, because dynamic images have different features such as facial movements over time. Therefore, temporal dynamics must be considered. Through the 2D convolutional neural network (2D CNN) [21], only spatial features can be identified within an image. Therefore, to classify the facial expressions in dynamic images through this 2D CNN, there is a limitation in processing temporal motion.

To solve this problem, a network dealing with the time axis is needed. Recurrent neural networks (RNN) are a type of artificial neural network that forms a circular structure in which hidden nodes are connected by directional edges. Data appearing sequentially can be usefully processed through RNN [22]. However, if the distance between the relevant information and the point where the information is used is long, the gradient gradually decreases during back-propagation, leading to a problem that the learning ability is greatly degraded. This is called the vanishing gradient problem. The long short-term memory (LSTM) [23] was devised to overcome this problem. LSTM is a structure in which the cell state is added to the hidden state of the RNN. Another method is the use of 3-dimensional convolution neural networks (3D CNN) [24]. Unlike conventional 2D CNN, 3D CNN uses a 3D convolution kernel to extract features not only for the space domain but also for the time domain.

In [25,26], they used geometric features such as landmarks, and the reference of a facial expression such as neutral expression was required while extracting features. However, in the case of real-life FER, no reference is given, and it cannot be guaranteed that the face of the neutral expression will be given. Therefore, a model that can recognize facial expressions without a reference is needed to use FER in practice.

We proposed a new facial expression recognition model to solve these problems. First, a 3D CNN structure that can simultaneously extract spatial and temporal features was used to obtain more accurate facial expression recognition results. Second, we used multinetworks with different frame rates to extract various features. The frames used for inputs entering each network should not overlap as much as possible, so we can utilize more spatio-temporal information. Third, we applied self-attention to the features that were extracted from each network, to make more reinforced features. The facial expressions were classified by combining these features through a joint fusion classifier.

In order to make a facial expression recognition model, the most relevant contributions are as follows:


The rest of the article is organized as follows. The related works for facial expression recognition are introduced in Section 2. Section 3 introduces a detailed description of the facial expression recognition algorithm composed of five main steps. Section 4 provides several experimental results and the performance comparison results with the latest models. Finally, the concluding remarks of this article are given in Section 5.

#### **2. Related Works**

#### *2.1. The Facial Expression Recognition Methods*

#### 2.1.1. Classical Feature-Based Approaches

Features representing facial expressions are divided into the permanent facial features (PFF), which expresses permanent facial features such as the eyes and nose, and the transient facial features (TFF), which expresses wrinkles or protrusions that occur temporarily as facial muscles move [27]. In face recognition, the proportion of the PFF is large, but in the field of facial expression recognition, the TFF also plays an important role as well as the PFF. Representative methods of expressing these facial features in an image include a geometric feature-based method and an appearance feature-based method. Analyzing the existing studies in terms of expressing facial features is as follows:

#### Geometric Feature-Based Method

Systems based on geometric features express changes in the shape and expression of a face by using the positions of various facial elements and the relationships between

them. Since the positions and movements of the mechanical features of the face are changed according to the difference between the shape of the face and the facial expression, an intuitive expression recognition method can be used by using dynamic information obtained by tracking these features from a video image. The difficult point of the geometric feature-based method is that because each person has a different face shape, the location of the feature cannot be used as it is. To solve this problem, the facial parts are modeled with the active appearance model (AAM) [28] or action unit (AU) [29] according to facial expressions, and based on the information extracted from the image, they are tracked to obtain the relative distance between the parts.

The geometric feature has the advantage of being able to implement a system that requires less memory and can easily adapt to changes in the position, size, and orientation of the face because the motion of the feature can be simply expressed with a few factors. On the other hand, since it is difficult to express the TFF that appears temporarily while the expression occurs, the geometrical features are similar, but there is a limit to distinguishing expressions with different facial textures such as wrinkles.

#### Appearance Feature-Based Method

The facial expression recognition method based on appearance features can accommodate both permanent features such as the eyes and mouth according to facial expressions and temporary features such as wrinkles for the entire image or the regional image. The appearance feature-based method is divided into a holistic image-based approach and a local image-based approach according to the size of the image used for feature extraction.

**Holistic Feature-Based Method** The holistic feature-based method considers each pixel constituting a face image as one feature element and expresses the entire image as one feature vector. Therefore, when the number of pixels constituting the face image is large, the size of the feature vector becomes excessively large, and the amount of calculation increases accordingly. As a solution to this problem, the linear subspace method (LSM) was proposed. LSM [30] improved the overall processing speed and accuracy by expressing the feature vector composed of the pixels of the face image as a low-dimensional spatial vector through linear transformation. Representative LSMs include principal component analysis (PCA) [31], linear discriminant analysis (LDA) [32], and independent component analysis (ICA) [33].

This holistic feature-based method is simple because it targets the entire image without going through a separate feature extraction process, but it has a disadvantage in that its performance is poor in a dynamic environment in which the face pose, lighting, and facial expressions move.

**Local Feature-Based Method** The regional feature-based method constructs a feature vector representing the overall face shape by setting a regional window in a region where changes can occur due to facial expressions in a face image and extracting features based on the brightness distribution within the window. In general, since the lighting of an image or changes in facial expressions appear in a part of the facial image, the regional feature-based method sets a local window only in the area where changes in the face can occur. Therefore, it has the advantage of being relatively less sensitive to these changes compared to the global feature-based method. Representative methods based on regional features include the Gabor filter [34], the Haar-like feature [18], and the local binary pattern (LBP) [35].

#### 2.1.2. Deep-Learning-Based Approaches

Most facial expression recognition algorithms used in recent studies use deep learningbased methods. When AlexNet showed a performance improvement in the ImageNet challenge [36], many researchers began to apply the 2D CNN structure to various fields, and it was also applied to the FER [37,38]. There have been many attempts to apply 2D

CNN to the video frames. However, 2D CNN has structural limitations because they cannot provide temporal information to the neural network.

Many studies use two architectures to overcome this problem. First, 3D CNN was designed by transforming the structure of 2D CNN [24]. 3D CNN uses a 3D convolution operation, which has three-dimensional convolution filters. Therefore, the feature map generated by one filter is also three-dimensional, and 3D CNN can learn temporal learning of successive frames from the convolution filter. This structure enabled spatio-temporalfeature learning for short-term input frames. Second, a hybrid method that combines multiple networks was also used. A CNN-RNN or CNN-LSTM [39,40] structure is one of the examples. It learns spatial features with CNN and then learns temporal features by RNN or LSTM.

A hybrid method is also used for improving accuracy as well as solving 2D CNN problems. In [26,41], they used two networks to extract temporal appearance and geometric features from image sequences and facial landmark points. They combined these two networks with a new integration method to make the two models cooperate with each other and improve the performance. Based on these methods, a hybrid method that combines multiple-depth networks based on 3D CNN is suggested.

#### *2.2. Multirate Filter Bank*

In [42], multi-rate filter banks produced multiple output signals by filtering and subsampling a single input signal or, conversely, generating a single output by up-sampling and interpolating multiple inputs. An analysis filter bank divides the signal into *M*-filtered and sub-sampled versions. A synthesis filter bank generates a single signal from *M*-upsampled and interpolated signals. The proposed algorithm looks like a sub-band coder, which was combined by an analysis filter bank and a synthesis filter bank.

We divided the input video (dynamic image) into multiple outputs, which have different frame rates, and put them into networks, which have different network-depth models. By using this structure, we could construct various spatio-temporal features. These features were combined into one feature, and we classified it by a joint fusion classifier.

#### *2.3. Self-Attention*

Attention is a methodology that started from the perspective of "let the model learn even the parts that need to be learned intensively for better performance." It makes network-to-weight features and uses the weighted features to help achieve the task. It is widely used in natural language processing (NLP), multivariate time series, and machine translation.

The attention mechanism was first devised for sequence learning [43]. It figures out which output sequence of the encoder is most associated with the particular output sequence of the decoder.

The attention itself is almost similar to the transformer [44]. The transformer can be divided into self-supervision and self-attention. By self-supervision, it is possible to train a model with an unlabeled dataset and learn generalizable representations. Self-attention calculates the attention by itself, and it assumes a minimum inductive bias unlike models such as CNN and RNN.

The self-attention method has been applied in computer vision tasks such as [45–49]. In [45], they inserted an attention block between convolutional layers to improve image feature creation performance. In [46], the attention was performed per channel through a dot product on the channel characteristic vector, and the authors used a channel and spatial attention block in [47]. Figure 1 shows some examples of the visual attention.

**Figure 1.** The examples of attending to the correct object (*white* indicates the attended regions, and *underlining* indicates the corresponding objects).

#### **3. Proposed Scheme**

This section introduces the proposed method in detail. Section 3.1 introduces the method of how we pre-processed the input images before feeding them into the networks. Additionally, we describe the data augmentation process in Section 3.2. Section 3.3 elaborates the network that was used to extract the feature maps. Section 3.4 goes into detail about how to reinforce the features and the joint fusion classifier, which classifies the facial expressions with the reinforced features.

Figure 2 shows the overall structure of the proposed algorithm based on multirate inputs and multi-depth networks to make a robust scheme.

**Figure 2.** The process of the proposed facial expression recognition scheme.

#### *3.1. Data Pre-Processing*

The environments of each database such as resolution, brightness, and pose are changeable. In order to have a general environment, a data pre-processing step is required, and Figure 3 shows the entire process of input with one sequence.

**Figure 3.** The entire process of making input dataset with a sequence.

We augmented the pre-processed dataset to avoid the overfitting problem. Then, each network received those dataset as input since CNN requires the fixed size of the input. Through this process, unnecessary sequence parts were removed, and important features were highlighted, so that the network can extract informative features efficiently.

#### 3.1.1. Image Pre-Processing

In order to have a general condition of the input, we went through four steps. The flowchart of the image pre-processing algorithm is shown in Figure 4.

**Figure 4.** Architecture of data pre-processing algorithm.

#### 3.1.1.1. Face Detection

For FER, we needed to detect the face area first. Then, we cropped the detected face area not to be affected by unnecessary parts such as hair or accessories.

We used the FaceBoxes module [50] to detect the face region. It consists of the rapidly digested convolution layers (RDCL), the multiple scale convolution layers (MSCL), and the divide and conquer Head (DCH).

#### 3.1.1.2. Face Alignment

Through facial landmarks, we checked whether the face is frontal or not and aligned the askew frontal face in order to fix the posture. We used the style-aggregated network (SAN) module [51] to extract the landmark of the face. We tilted the face by aligning the *x* axis of the tip of the nose and the *x* axis of the center of the eyes vertically. The tip of the nose was the 34-th landmark, and the center of the eyes was the average of the 37-th to 46-th landmark—refer to Figure 5a.

After alignment, the face was judged to be front if the 34-th landmark, which is the tip of the nose, was between the 40-th landmark and the 43-th landmark, which are the nearest points from the nose of the left and right eye. The example of this part is shown in Figure 5b. After the face alignment process, we cropped the minimized face area without empty data again. Then, we resized the image into 128 × 128 in order to make the same resolution. This alignment process can be considered as a kind of affine transformation based on two points. This had two constraints as: (1) the images of the two lines were also parallel, and (2) translations are isometries.

**Figure 5.** Aligning the face: (**a**) 68 facial landmarks; (**b**) face alignment with the landmarks.

#### 3.1.1.3. Image Normalization

There are two ways to normalize an image. The first is to normalize the size of the image. In general, when using CNN, the dimension of an input image or feature needs to be fixed. Therefore, we resized all the input images into the same size 128 × 128. This accelerates the convergence of the network. The second is to normalize the image numerically. It means we normalized the pixel distribution of the original image. Through Equation (1), which has been reported in [35], the values followed the standard normal distribution standardized by the Z-score. The standard conversion formula for this is as follows:

$$
\mathbf{x}' = \frac{\mathbf{x} - \mu}{\sigma}. \tag{1}
$$

Here, *x* is the pixel value of the original image, and *x* is the new value of the converted image. In addition, *μ* is the average pixel value of the image through calculation, and *σ* is the pixel standard deviation value of the image through calculation. The data subjected to Z-score standardization showed a normal distribution with an average of 0 and a deviation of 1 approximately. This intensity normalization can give better features than using one by 255.

In most of the deep learning approaches, an input image is given into the designed deep neural network after normalizing it by 255, to make robustness in illumination change. However, it always gives an intensity range as (0, 1.0). That is, this normalization by 255 compresses into too small an intensity range. However, the suggested Z-score maintains a larger range as (−1.0, +1.0) by the standard deviation of illumination in the given image. Through experiment, we verified the suggested normalization to be more effective to make features in convolution neural networks.

#### 3.1.1.4. Feature Extraction Using LBP

We extracted features from the resized image to reduce the computational complexity and to emphasize facial characteristics. In this study, facial features were extracted through an LBP. The LBP classifies the texture of the image and is widely used in fields such as facial recognition and gender, race, and age classification [52,53]. Additionally, the LBP function was used to eliminate the effect of lighting.

In [54], Timo et al. proposed a method of applying LBP to facial recognition problems for the first time. This showed a better result than many of the existing approaches.

In order to have the LBP feature value for one pixel, a 3 × 3 size block was used, and it is shown in Figure 6. Each pixel value except the center was compared with the pixel value located in the center, and if it was brighter than the center, it was encoded as 1; if it was darker than the center, it was encoded as 0. The formula is as follows:

$$LBP(\mathbf{x}, y) = \sum\_{n=0}^{N-1} s(p\_o - p\_c) \times \mathbf{2}^n,\tag{2}$$

where

$$s(\mathbf{x}) = \begin{cases} 1, & \text{if } \mathbf{x} > \mathbf{0}, \\ 0, & \text{otherwise.} \end{cases} \tag{3}$$

The value of the center point *s*(*x*) was converted to a binary number 0 and 1 through Equation (3) where *x* refers to the difference between the center pixel *pc* and the other pixel *po*. As value of the center pixel *pc*, a different 8-bit binary string was generated if *N* is 8. Then, the binary code was converted to decimal *LBP*(*x*, *y*) by Equation (2). The LBP's capabilities help reduce computational complexity compared to the original image. It also emphasizes the main texture of the face in the image.

**Figure 6.** The LBP feature extraction.

#### 3.1.2. Minimum Overlapped Frame Structure

The proposed model extracts features from multiple networks, whose inputs are various using different input frame rates, and classifies facial expression by combining extracted features. Therefore, we thought that it would be more efficient to learn if various information is given.

In the conventional structure, frames are extracted with regular intervals. This assumes that the expression of the sequence goes from the neutral to the peak. When the number of the sequence is *n*, then the structure of the number of *N* input frames *S*(*N*) is made from the *X* sequence as follows:

$$S(N) = \{X[1], X[2], \dots, X[N]\},\tag{4}$$

where

$$X[i] = X[round(\frac{n-1}{N-1} \times (i-1))].\tag{5}$$

However, in this case, the first *X*[1] and the last *X*[*N*] images are always given as an input into each network. Additionally, middle part of the input can be overlapped. Then, the same information is overlapped into each network. As a result, the same spatial features are extracted. This is not good situation to learn the given input sequences. The example of the original structure of picking 3, 5, and 7 input frames is in Figure 7.

**Figure 7.** The example of the original structure of selecting input frames.

As in Figure 7, when *n* = 22, which means the sequence has 22 image frames, 3 frames of input are selected as S(3) = {X[0], X[11], X[21]}. In the case of 5 frames of input, S(5) is chosen as {X[0], X[5], X[11], X[16], X[21]}, and 7 frames of input sequence are constructed as S(7) = {X[0], X[4], X[7], X[11], X[14], X[18], X[21]}. All of them have the same images of X[0], X[11], and X[21] when constructing input sequences. In terms of information, the overlapped portion is not desirable to make reliable learning.

In order to solve this problem, we designed an input frame structure that can make a minimized overlapped between the generated input sequences. We extracted frames with regular intervals the same as the existing structure, but it made a different condition by making the start and end points different. The equation for the structure of the number of 3, 5, and 7 input frames *S*(*N*) from the original *X* sequence where the number of the sequence is *n* as follows:

$$S(N) = X[1] \, X[2] , \dots \, X[N] \, \tag{6}$$

where

$$X[i] = \begin{cases} X[0 + round(\frac{(n-1)-2}{3-1} \times (i-1))], & N = 3 \text{ input frames}, \\ X[2 + round(\frac{(n-1)-2}{5-1} \times (i-1))], & N = 5 \text{ input frames}. \\ X[1 + round(\frac{(n-1)-2}{7-1} \times (i-1))], & N = 7 \text{ input frames}. \end{cases} \tag{7}$$

The start and end point of the seven input frames were set between the start and end points of the three and five input frames. In our example, seven frames was the largest number of the selected frames in a sequence with the number of *n*. If the start and the end point of seven input frames is shifted by one order from other input frames, the probability of overlap may be decreased. The example of the designed minimum overlapped frame structure, which selects three, five, and seven of input frames, is shown in Figure 8.

**Figure 8.** The example of selecting input frames through the minimum overlapped frame structure.

When *n* = 22, three frames of input have S(3) = {X[0], X[10], X[19]}. Five frames of the input sequence can be selected as S(5) = {X[2], X[7], X[12], X[16], X[21]}, and seven frames of input are constructed by S(7) = {X[1], X[4], X[8], X[11], X[14], X[17], X[20]}. None of the input images overlap as shown in Figure 8. The proposed structure can give more spatio-temporal information to extract features in the neural network. With the suggested three multi-depth network, full frames cannot be utilized. However, if we add one or two more different depth networks, then we can utilize almost-full frames with a larger frame rate for our FER task.

#### *3.2. Data Preparation*

#### 3.2.1. Data Augmentation

For FER, we needed enough datasets of human faces. However, most of the FER databases have been labeled with a well-controlled environment, and it needs a highcost task. Therefore, there are not enough datasets for the experiment in most cases. When training through deep learning with insufficient datasets, the network can be easily overfitted. Therefore, most researchers use data augmentation to solve this overfitting problem.

Data augmentation is largely divided into two types. The first method is to utilize some deep learning technologies such as autoencoder (AE) [22,55] or generative adversarial networks (GAN) [56]. Usually, autoencoder (AE) [22,55] with generative adversarial networks (GAN) [56] together could be used for input data augmentation. The second method is augmentation through image pre-processing like rotation, skewing, and scaling. Flipping horizontally is also effective in increasing the dataset. This is effective to increase the number of data while maintaining the geometric relationship between the eyes of the face image and important parts of the face such as the nose and mouth. Another method is to add noise to the image. This method includes salt and pepper noise, speckle noise, and Gaussian noise. In [57], the amount of the dataset was increased by 14 times through horizontal flipping and rotation. Figure 9 shows data augmentation using image pre-processing.

In this experiment, the second data augmentation method was used to increase the amount of the dataset. Table 1 shows the number of the original input dataset in each database. The CK+ database contains images labeled with "contempt," but other databases do not have this label [58,59]. Therefore, we excluded sequences labeled with "contempt" to establish the same experimental conditions. For the MMI dataset [60], we separated frames for each emotion before making inputs.

In the case of the GEMEP-FERA database [61], the total number of the emotion class was 5. However, one of them was not "neutral" but had a label of "relief." We changed the label of "relief" into "neutral." In Table 1, the first row is an abbreviation for the emotion classes such as neutral, anger, disgust, fear, happiness, sadness, and surprise in that order. The AFEW database [62] has already been divided into training, validation, and test datasets. However, the test dataset had no annotation about the expression. Therefore, we used the provided train dataset for the train, and the validation dataset was used for the test stage.

**Table 1.** The number of the original input datasets.


The expression input data set was constructed in the following way. First, several frames were extracted from the input sequence through the minimum overlapped fame structure. If there was a separate sequence with the neutral label in the database, the neutral

label dataset was also configured in the same way. However, in the case of a database where the neutral label was not specified, the neutral dataset was created through the first three frames of each sequence.

In the case of CK+ databases, there were no neutral labeled sequences. Therefore, a labeled emotion dataset was created through the minimum overlapped frame structure method, and an neutral dataset was created through the first three frames of each sequence. Because of this, datasets labeled with neutral existed in all sequences. Since each emotionlabeled dataset can only be created in a specific labeled sequence, the difference between the amount of neutral datasets and the other emotion datasets became large. In order to avoid the overfitting problem that can be caused by insufficient and biased distribution of the datasets, it was necessary to increase the emotion-labeled dataset.

Data augmentation was mainly performed to increase the amount of the emotionlabeled dataset so that the dataset was evenly distributed. For the created neutral expression dataset, each image was flipped horizontally to increase two times. For the other dataset, each image was flipped horizontally and rotated by {−7.5◦, −5◦, −2.5◦, 2.5◦, 5◦, and 7.5◦}. Through this process, the emotion-labeled dataset increased 14 times. Table 2 shows the specific values of the increased datasets for the CK+, MMI, and GEMEP-FERA datasets. In particular, we augmented two times for the neutral dataset, which was created from all emotion-labeled sequences due to no neutral emotion in the MMI dataset. The '−' symbol in Table 2 means that the class does not exist in the GEMEP-FERA dataset.

**Table 2.** The number of the augmented input datasets.


The provided AFEW train dataset, which was used as a train and validation dataset in our experiment, was augmented four times. We flipped horizontally and rotated by {−2.5◦, 2.5◦}. The provided AFEW validation dataset, which was used as a test dataset in our experiment, was augmented two times by flipping horizontally. The result of the augmented AFEW dataset is in the fifth and sixth rows of Table 2.

#### 3.2.2. Making Neutral Label of Dataset

The CK+ database is composed of images to go from neutral to the peak of expression. Thus, the neutral sequence in the CK+ database is at the beginning of the video. To make three consecutive frames as inputs, the first three frames were assigned to the frames labeled as neutral. Input consisted by five frames was made by using the first frame once, the second frame twice, and the third frame twice among the first three consecutive frames. Input consisted by seven frames was created by using the first frame twice, the second frame twice, and the third frame three times among the first three consecutive frames. Figure 10 shows an example of a "neutral" label frame extracted from a sequence.

**Figure 10.** The method of creating a neutral dataset where the neutral label is not specified.

Unlike the CK+ database, the MMI database had an emotion flow, which was "neutral" to one of the peaks of expression and then to "neutral." We judged that the peak of the emotion was in the middle of the video. Therefore, the dataset was created using only the half that was the first to middle sequence out of the total sequence. Then, it had the same emotion flow like in the CK+ database as the "neutral" emotion to the peak of one expression. The dataset for the neutral expression was made through the same method, which created the neutral dataset from the CK+ database.

On the other hand, the GEMEP-FERA database did not have a label for "neutral" but a label for "relief." In order to match the conditions with other databases, we defined the "relief" as the "neutral" label.

#### *3.3. 3D Convolutional Neural Network (3D CNN)*

Spatial and temporal information was simultaneously captured using a 3D convolution and a 3D input dataset. Unlike the kernel used in 2D CNN, 3D CNN has a 3D cube-shaped convolution kernel, which has one more depth in the time axis. This preserves the time information of the input sequence and creates an output that forms the volume. Therefore, motion information can be obtained by connecting the feature map of the convolutional layer from multi-frames as input. Additionally, it considers adjacent pixels within the frame like the operation of 2D convolution at the same time. Therefore, spatial and temporal information can be simultaneously extracted through 3D convolution.

Shuiwang et al. [24] have explained 3D CNN mathematically. The value at position (*x*, *y*, *z*) on the *j*-th feature map in the *i*-th layer is given by:

$$w\_{ij}^{xyz} = \tanh(b\_{ij} + \sum\_{m} \sum\_{p=0}^{P\_i - 1} \sum\_{q=0}^{Q\_i - 1} \sum\_{r=0}^{R\_i - 1} w\_{ijm}^{pqr} w\_{(i-1)m}^{(x+p)(y+q)(z+r)}),\tag{8}$$

where (*p*, *q*) is the spatial dimension index, *r* is the temporal dimension index of the kernel, *wpqr ijm* is the (*p*, *q*,*r*)-th value of the kernel connected to the *m*-th feature map in the previous layer, and *Ri* is the size of the 3D kernel. *tanh*() assumed that activation function is the hyperbolic tangent, so other activation function can also be used.

In this study, we used a 3D CNN from study [26], which is called an "Appearance Network," as a basic model to capture spatio-temporal information. Figure 11 shows the detailed configuration of the network.

**Figure 11.** The architecture of 3D CNN from study [26], which has five 3D convolution layers.

First, the 3D convolutional layer extracts spatial and temporal features. All convolutional layers use a 5 × 5 × 3 kernel and a restricted linear unit (ReLU) activation function. In addition, 3D pooling is applied to reduce the number of parameters and cope with changes in the position of image elements. In this case, the pooling layer is max pooling that transfers only the maximum value of the volume area. After the maxpooling operation, the size of the feature map is reduced. Due to 3D pooling, dimension reduction on the time axis also occurs. The maximum value of 2 × 2 × 2 blocks is mapped to a single pixel of the output 3D feature map.

After max pooling layers, a batch normalization layer follows. Batch normalization is one of the ideas for preventing the disappearance or explosion of the gradient [63]. During deep learning training, if the hierarchy is deep and the number of epochs increases, the slope may explode or disappear. This problem arises because the scale of the parameters is different. This means the distribution of input to each layer or activation function of the network would be better to be controlled in the signal scale. To solve this problem, the input distribution needs to be normalized. However, this method is very complicated because the covariance matrix and the inverse matrix must be calculated. Instead, through batch normalization, the mean and standard deviation are obtained from each feature rather than the entire dataset, and they are normalized for each feature.

At the end of the network, emotions are classified as consecutive values through the softmax function. However, this classification module is not used in this study because we designed different joint fusion classifier based on the self-attention.

#### *3.4. Joint Fusion Classifier Using Self-Attention*

In this section, a joint fusion classifier is designed for a combination of multiple networks. This classifier serves to classify facial expressions based on various pieces of information by combining features extracted from each different input frame. In other words, it is possible to obtain more accurate results by supplementing the results of each network. Here, feature vector 1, ..., *N* were extracted to make the final 3D features from each depth network in Figure 11. In this study, there were three features since we employed three depth networks. When we extended this up to the *N* depths network, we could obtain *N* number of features before the classification module.

Additionally, we employed a squeeze-and-excitation network (SENet) for selfattention [46]. For any given transformation *Ftr* : *<sup>X</sup>* <sup>→</sup> *<sup>U</sup>*, *<sup>X</sup>* <sup>∈</sup> *<sup>R</sup>H* ×*W* ×*C* , *<sup>U</sup>* ∈ *<sup>R</sup>H*×*W*×*<sup>C</sup>* (e.g., a convolution or a set of convolutions), we employed the squeeze-and-excitation (SE) block [46] to perform feature re-calibration as follows. In this structure, the features *U* are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions *H* × *W* to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an excitation operation, in which sample-specific activation, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps *U* are then re-weighted to generate the output of the SE block, which can then be fed directly into subsequent layers.

Therefore, we could obtain emphasized and reinforced features through self-attention. Those features were concatenated in one-dimension and fed into the joint fusion classifier, which is depicted in Figure 12.

**Figure 12.** The architecture of the joint fusion classifier using self-attention.

In Figure 12, a joint fusion classifier was composed as follows: fully connected (FC) layer one and fully connected (FC) layer two of each network use ReLU. Fully connected (FC) layer three uses the softmax as an activation function. Additionally, cross entropy was used as the loss function, and loss was reduced by using the Adam optimizer. This determined the final emotion and used the same training dataset for each network to use it.

As mentioned in the above, we designed a multi-depth network based on multi-rate feature fusion for efficient facial expression recognition. Additionally, we developed a new image normalization and different depth networks as frame rates to give more robustness for various datasets. We verified the robustness and effectiveness of the proposed algorithm through experiments.

#### **4. Experimental Results and Discussion**

This section introduces the experiment and its environment in detail. We present and analyze the performance through several experimental results. Additionally, we compare the proposed FER algorithm with other latest algorithms. To train this network, the Adam optimizer was used with the default parameter setting [64]. We implemented all methods on a GPU server with Intel i-7 CPU and GTX 1080 Ti 11G memory.

#### *4.1. Ablation Study*

#### 4.1.1. Performance of Image Normalization

This experiment confirmed the better performance when the image was normalized as described in Section 3.1.1.3. In the AFEW dataset, most of the image sequences were not taken from the controlled environment but were the same as in real-life conditions. Therefore, the brightness of the images varied, even being too dark or too bright. By using image normalization, we could overcome such problems, and the result of using image normalization is shown in Figure 13.

**Figure 13.** Examples of applying image normalization in AFEW dataset.

The results of image normalization in CK+, MMI, GEMEP-FERA, and AFEW datasets is in Table 3. In MMI and GEMEP-FERA datasets, this method mostly showed a better result than not using image normalization. The bold letter in the Table 3 means a better or same accuracy than not using image normalization. In the CK+ database, the employed image normalization was 0.14% better on average. However, in MMI, GEMEP-FERA, and AFEW datasets, most of the results using image normalization showed better performances of 0.7%, 0.61%, and 0.23% on average.


**Table 3.** The results of image normalization.

4.1.2. Correlation between Depth of the Network and Frame Rate of Input

This experiment was to find out the correlation between the depth of the network model and the number of the input frame. We took the experiment with applying and transforming the depth of the base model (5 layers) based on the 3D appearance network [26]. We gave three, five, and seven frames input into the 3D CNN with 5 (base model), 10, 15, 20, and 25 layers. As mentioned, we gave the depths of the models as 5 layers, 10 layers, 15 layers, 20 layers, and 25 layers to check on the relationship.

We used CK+, MMI, and GEMEP-FERA datasets to deduce the relationship. The results of the experiment by combining each depth of the network and input frame rate are shown in Table 4. The bold face denotes the maximum accuracy for each network depth according to input frame rate. In Figure 14, it was converted into a graph to visually show the results of Table 4. The dotted lines indicate the trend line.The result shows that if the depth of the model and the frame rate of the input are proportional, then the accuracy is inclined to increase. This means the accuracy is higher as the depth of the model is large and the number of frames of the input increases. Additionally, as the depth of the model is shallow and the number of frames of the input is smaller, the accuracy tends to be high. We utilize this observation to design our multirate-based network model.


**Table 4.** The results of correlation between the depth of network and frame rate of the input.

4.1.3. Performance of the Minimum Overlapped Frame Structure

In this experiment, when creating the input dataset structure that is used in multiple networks, we verified that more various temporal information is helpful for learning. The minimum number of frames in the dataset was set to nine frames. Previously, the input dataset entering each network was determined as follows. If there is an arbitrary sequence of images, the total number of images is divided by equal intervals to obtain the required number of input frames. In this case, the beginning and end of three frames of input, five-frames of input, and seven frames of input always contained the same image. It means that the probability of overlapping the intermediate image was also high. In order to compensate for this problem, the method described in Section 3.1.2 was designed to create an input frame that does not overlap as much as possible. Because of the minimum overlapped frame structure, it was possible to give more various information when the network was learning.

Based on the correlation between the depth of the network and the frame rate of the input in Section 4.1.2, we fed 3 frames of input into the 3D CNN with 5 layers, 5 frames input into the 3D CNN with 10 layers, and 7 frames input into the 3D CNN with 15 layers. We obtained the feature from the networks without using image normalization and LBP feature extraction. Through Table 5, it can be seen that providing a variety of information to the network improves the performance in all of the databases. In the CK+, MMI, and GEMEP-FERA datasets, better performances of about 1.97%, 1.53%, and 0.46%, respectively, were shown. Moreover, the network using a minimum overlapped frame structure showed an improvement of 0.97% in the AFEW database.

**Table 5.** The results of using minimum overlapped frame structure.


#### 4.1.4. Performance of Self-Attention Module

We also fed 3 frames into the 3D CNN with 5 layers, 5 frames into the 3D CNN with 10 layers, and 7 frames into the 3D CNN with 15 layers using a minimum overlapped frame structure, and we did not use the image normalization and LBP feature extraction. When the features came out from each network, we reinforced the feature using the self attention. Then, we concatenated the reinforced features into one-dimension and fed them into the joint fusion classifier.

We checked whether the self-attention module reinforced the features or not by comparing between the concatenated feature without the self-attention module and the concatenated feature with the self-attention module. The result is shown in Table 6. We can see that the self-attention module reinforced the feature and improved the FER performance in most of the databases. In the CK+, MMI, and GEMEP-FERA databases, the proposed scheme showed about 0.21%, 0.91%, and 0.23% better performances, respectively. Additionally, in the AFEW database, it showed a 0.42% better performance with the self-attention module.

**Table 6.** The results of the performance using self-attention.


#### 4.1.5. Effectiveness of Multi-Depth Network Structure

To show the effectiveness of the proposed multi-depth network structure, we tested a single layer network, which was from [26], as shown in Figure 11. We set three frames as the input sequence. For obtaining the results, we used a 10-fold validation approach.

Table 7 summarizes the number of trainable parameters of the proposed three-depths network model. It was assumed that three frames were given as input. It is also showed only the layers with trainable parameters in the entire network. As shown in the table, the number of layers in the individual network increased to a multiple of fie according to the number of frames given as input, and the number of parameters increased significantly accordingly. The outputs of each network were finally combined into the last three FClayers, with the total number of parameters including them reaching about 237 million. If we extend the proposed network with more depths, then the complexity will be further increased.

**Table 7.** Summary of trainable parameters of the proposed multi-depth network (*input shape = (3, 128, 128, 1)*).


For the CK+ dataset, the proposed multi-depth network gave slightly better accuracy than the single network in Table 8. Additionally, we observed up to 7% of the accuracy in the MMI and GEMEP-FERA datasets. From these results, we can conclude that the proposed multi-depth network was effective for the facial expression recognition task.

**Table 8.** The performance results of the proposed network (multi-depth network) and single network (%).


#### *4.2. Overall Accuracy Performance of the Proposed Scheme*

In this section, we demonstrate that the proposed scheme shows competitive performance compared with the recent existing methods. Among various techniques for facial expression recognition, we compared with spatio-temporal network approaches or hybrid network approaches. Table 9 shows input construction and model setting of the recent existing methods, which were compared with the proposed method.


**Table 9.** Analysis of the recent existing methods for performance comparison.

For experiments, we used three datasets: the CK+, MMI, and GEMEP-FERA datasets. The number of image sequences in each dataset was listed in Table 2. We used 3 frames, 5 frames, and 7 frames as input, and the multi-depth network was composed of 5 layers, 10 layers, and 15 layers. We used self-attention to reinforce the features, which came from each network and fed into the joint fusion classifier.

The results from the 10-times trial on the CK+ dataset are in Table 10. "Without Preprocessing" means that we did not use the image normalization, the LBP feature extraction, the minimum overlapped frame structure, and the self-attention module. In contrast, "With Pre-processing" means that we used all proposed image pre-processing methods, including the minimum overlapped frame structure and the self-attention module. The average accuracy was shown as the bold face in each processing. For the CK+ dataset, the network performance of "Without Pre-processing" showed better results—about 1.11% on average. This CK+ dataset is a very static one. However, the proposed scheme was based on several video frames to extract more temporal information. This means that the proposed algorithm works well for more dynamic video sequences.

The accuracy comparisons of each method using the CK+ database is shown in Table 11. For the CK+ database, the proposed scheme which was denoted as the bold face, did not get the best result compared with some existing methods [26,68–70].


**Table 10.** Overall accuracy and improvement on the CK+ dataset (%).


**Table 11.** Comparison results of accuracy in the CK+ dataset (%).

The results from the 10-times trial on the MMI dataset are in Table 12. The proposed scheme showed a better result by about 4.79% on average (as the bold face) than "Without Preprocessing." The comparison of experimental results showed the outperformed results for the MMI dataset in Table 13. Here, the bold face denotes the performance of the proposed scheme.

**Table 12.** Overall accuracy and improvement on the MMI dataset (%).


**Table 13.** Comparison results of accuracy in the MMI dataset (%).


Additionally, the proposed method outperformed in the GEMEP-FERA dataset. The result from the 10-times trial on the GEMEP-FERA dataset is displayed in Table 14. The network performance of "With Pre-processing" showed better results of about 0.64% in average (as the bold face) than "Without Pre-processing." Table 15 shows the comparison of experimental results in the GEMEP-FERA dataset. The proposed scheme (the bold face in average) achieved an improvement of 8%, at least compared to the recent methods.


**Table 14.** Overall accuracy and improvement on the GEMEP-FERA dataset (%).

**Table 15.** Comparison results of accuracy in the GEMEP-FERA dataset (%).


The proposed method showed a little weak performance on the CK+ dataset. However, in the MMI and GEMEP-FERA datasets, it showed the highest performance. According to the results of the CK+, MMI, and GEMEP-FERA datasets, the proposed model showed better performance in the more complex dataset.

In the AFEW dataset, the result is shown in Table 16. The AFEW dataset is well known as data capture in a very wild environment. The network performance of "With Pre-processing" showed a result that was about 3.32% better than "Without Pre-processing" by using only video data. From this result, we can expect that the proposed scheme can improve the recognition accuracy of the facial expression in real environments.

**Table 16.** Overall accuracy and improvement on the AFEW dataset (%).


For the processing time of the proposed scheme, the inference time was measured. This inference time contained the consumed time of the data pre-processing, the construction of frame structures, and the prediction for the final decision. When testing the proposed multi-depth network (three layers and three, five, and seven frames of input), the inference time was measured by about 102.0 ms on our GPU server with an Intel i7 CPU and GTX 1080 Ti 11G memory. In terms of the frame processing rate, a value of 9.8 frames per second (FPS) was obtained. When we used a single0layer network with an input of three frames, as shown in Figure 11, 49.3 ms was measured due to a very small network structure.

#### **5. Conclusions**

We proposed a robust facial expression recognition algorithm on the variation of datasets and different facial expression acquisition conditions. The proposed scheme extracted various features by combining several networks based on external features and classified them by putting them in a joint fusion classifier. This network simultaneously extracted spatial and temporal features using 3D CNN to overcome the problem of the

existing 2D CNN model trained only with spatial features. In addition, in order to obtain a various features, we designed a multi-depth network structure by multiple input frames which were the least overlapped and composed of LBP features. The features extracted from each network were reinforced through the self-attention module. Then, these were combined and fed into the joint fusion network to newly learn and classify the emotions.

Through experiments, we found the correlation between the number of input frames and the depth of the network. When the number of frames increases, the network depth increases. When the number of frames decreases, the shallower the network depth, which showed the better performance. Through comparative analysis, we proved that the proposed multirate feature fusion scheme could achieve more accurate results than the state-ofthe-art methods. The performance of the proposed model enhanced by 96.23%, 96.69%, and 99.79% the average accuracy of the CK +, MMI, and GEMEP-FERA datasets, respectively. Additionally, a 31.02% accuracy was achieved in the AFEW dataset through the features enhanced by the self-attention module and the proposed multi-depth network structure.

**Author Contributions:** Conceptualization, B.-G.K.; methodology, B.-G.K. and S.-J.P.; formal analysis, B.-G.K. and S.-J.P.; investigation, B.-G.K. and S.-J.P.; writing—original draft preparation, S.-J.P.; writing—review and editing, B.-G.K. and N.C.; supervision, B.-G.K. and N.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All sources and data can be found at https://github.com/smu-ivpl/ MultiRateFeatureFusion\_FER (accessed on 18 October 2021).

**Acknowledgments:** Authors thank the reviewers for valuable suggestion and comments for improving this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **Changes in Computer-Analyzed Facial Expressions with Age**

**Hyunwoong Ko 1,2,3,†, Kisun Kim 2,†, Minju Bae 1,2, Myo-Geong Seo 2, Gieun Nam 2, Seho Park 4, Soowon Park 5, Jungjoon Ihm 1,3 and Jun-Young Lee 1,2,\***


**Abstract:** Facial expressions are well known to change with age, but the quantitative properties of facial aging remain unclear. In the present study, we investigated the differences in the intensity of facial expressions between older (*n* = 56) and younger adults (*n* = 113). In laboratory experiments, the posed facial expressions of the participants were obtained based on six basic emotions and neutral facial expression stimuli, and the intensities of their faces were analyzed using a computer vision tool, OpenFace software. Our results showed that the older adults expressed strong expressions for some negative emotions and neutral faces. Furthermore, when making facial expressions, older adults used more face muscles than younger adults across the emotions. These results may help to understand the characteristics of facial expressions in aging and can provide empirical evidence for other fields regarding facial recognition.

**Keywords:** facial action unit; facial aging; facial expression; posed emotion

#### **1. Introduction**

Expression and recognition of emotions through facial expressions are fundamental functions of basic communication. Facial expressions are critical for communicating with one's surroundings in terms of their role to convey the primary meaning of social information [1,2]. People can communicate and convey their emotions in diverse manners; however, facial expressions can be used in the most flexible way [3]. Investigating how facial movements are controlled and how people recognize others' facial expressions, therefore, is an essential way to understand the nature of human beings as social beings and can also facilitate emotional functioning.

It has been well established that emotional expression and recognition skills through facial expressions change with age [4,5]. A previous study showed older and young people a variety of facial expressions and confirmed how they recognized them [6]. Young and old people were both aware of expressions of positive emotion, while older people were less aware of negative facial expressions. In addition, the performance of the older group declined in sadness facial expression recognition but improvement in disgust facial expression recognition [7–9]. The older people were also more inclined to think that they felt happy when they were shown smiles [10]. A recent meta-analysis demonstrated that older adults showed lower performance on emotional face identification than a younger group of adults [11].

**Citation:** Ko, H.; Kim, K.; Bae, M.; Seo, M.-G.; Nam, G.; Park, S.; Park, S.; Ihm, J.; Lee, J.-Y. Changes in Computer-Analyzed Facial Expressions with Age. *Sensors* **2021**, *21*, 4858. https://doi.org/10.3390/ s21144858

Academic Editor: Wataru Sato

Received: 17 April 2021 Accepted: 15 July 2021 Published: 16 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Owing to physical aging, sarcopenia, such as atrophy of facial skeleton, malposition of fatty muscles, and loss of soft tissue happen most commonly in the areas of the maxilla, mandible, and anterior nasal spine [12]. A previous study showed that human facial aging demonstrated a common pattern of morphological, chronological, and dermatological changes in various biomedical studies [13]. In an aspect of neuromuscular mechanism, voluntary facial expressions (i.e., posed facial expressions) using the lower part of the face are prominently controlled by the left hemisphere and vice versa [14–16]. Specifically, aging of the orofacial motor cortex, which involves involuntary facial expressions, can cause a decline in cognitive control for the lower part of the face [17,18]. While facial aging is natural and inevitable for most people, multiple studies have suggested there are several markers of facial expression and recognition in neuropathological changes including epilepsy [19], Parkinson's disease [20], Alzheimer's disease [21], and other neurocognitive disorders [22]. Despite this, identifying the quantitative characteristics of facial aging is still limited.

The posed facial expression, which is commonly exhibited on portrayal of other's facial expression, has distinct characteristics compared to spontaneous facial expression in aspects of neuromotor system and display rules. Whereas posed facial expression is generated cognitively within the pyramidal system, spontaneous facial expression exhibits independent motor control and is driven by extrapyramidal system [15,23]. The movements inherent to posed facial expression tend to display intended emotions in the context of social interactions (i.e., display rules), while spontaneous facial expression correspond to a primary emotional system [15,24]. Although, several studies have pointed out the limitations of the characteristic of the posed facial expression for its artificiality by actor's and variability by experimental conditions [25–27], research leveraging posed facial expression has clear advantages. For interpretability, posed facial expression is less ambiguous than spontaneous facial expression [28] and is also universal across the basic emotion [29]. Such universality has also been identified in recent study for East Asian population [27]. Since cumulative literatures have studied the pose facial expression [30], posed facial expression is may expected to be a valid indicator for investigating aging.

Quantitative measurements of facial expressions and their analyses has been an active research topic in behavioral science. Among several studies, a facial action coding system (FACS) [31,32] is the most widely used in this area. A series of facial muscle movements that represent facial expressions, termed as action units (AUs), can help a facial recognition-based analysis to be more standardized [33]. Since AUs were originally developed from basic emotion theory and manually rated by highly trained coders, the FACS-based AUs have had limited accessibility for standardization. Recently, automated computer vision and multidiscipline study for facial expression analysis have emerged [34]. These studies enable scaling facial expressions more feasible; facial aging study remains in three-dimensional (3D) morphometric [13,35] or electromyography (EMG) studies [36,37]. In that regard, little is known about quantitative facial aging.

Given that facial expressions are crucial indicators of human health status [38,39], applying machine learning algorithm techniques to facial expressions, such as computeraided diagnosis (CAD) in the biomedical signal [40], and the medical imaging field [41], can contribute to digital health. This technique is often used in facial paralysis [42,43], face transplant [44], pain detection through facial expression [45], and neurologic studies such as those involving autism [46], Turner syndrome [47], and Parkinson's disease [48]. Since language production and discourse decrease with aging [49], identifying the characteristics of facial expressions in the older adults is a promising and challenging research area in gerontology, which can diagnose disease regardless of patient communication skills. Moreover, the uniqueness of facial expressions has led to consistent studies in the area of personal identification for health records [50], to improve performances on CAD and identification using facial expressions, to develop the algorithm, and to provide interpretable results for facial expressions with aging. Although there has been much work on automatic facial expression recognition in computer vision research, the algorithms have

been experimentally validated primarily on younger faces. For facial expressions to be better used as digital markers related to aging, finding quantitative differences in facial changes with aging should be studied.

The aim of this study was to identify the characteristics of facial expressions based on the basic emotion theory and to compare the differences in facial expressions between younger and older adults for each basic emotion and AU, respectively. Additionally, a feature-selection approach was used to identify multivariate patterns of the changes in facial expressions related to aging. Finally, the predictive ability for selected AUs was evaluated.

#### **2. Materials and Methods**

#### *2.1. Ethics Statement*

This study was approved by the Institutional Review Board of the SMG-SNU Boramae Medical Center (IRB No. 30-2017-63), and all participants submitted written consent for participating in the study.

#### *2.2. Participants*

A total of 61 older adults and 115 younger adults were recruited for this study. The older adults were between 62 and 84 years old and recruited from the Alzheimer's disease research center of the SMG-SNU Boramae hospital. Healthy young participants were recruited from the university student participant pool and aged between 18 to 39. None of them had a history of psychiatric disorder. Major medical diseases, severe head injury, and visual impairment were excluded in all groups. Especially, all the older adults were free from the diagnosis criteria of Alzheimer's disease and depressive spectrum disorder with DSM-IV [51]. All medical judgements were determined by a board-certified psychiatrist (J.-Y.L.).

To screen the potential emotion related problems such as depression, anxiety, and alexithymia, participants were asked to answer self-reported measures: Beck Depression Inventory (BDI), Beck Anxiety Inventory (BAI), and Toronto Alexithymia Scale (TAS). The Korean version of BDI involves 21 questions to evaluate the severity of depression, with scores ranging from 0 to 63 [52,53]. A higher score indicates severe depressive symptoms, and the cutoff score is 18 in the Korean version [54]. The Korean version of BAI utilizes 21 questions to measure the severity of anxiety, with scores ranging from 0 to 63 [55]. A higher BAI score indicates severe anxiety symptoms with a cutoff score of 19 [56]. A twenty-item TAS was developed and validated to measure the severity of alexithymia. A score ranging from 20 to 100 [57,58], with a cutoff score at 61 was used for the Korean version [59]. The TAS is made up of three subscales: Difficulty identifying feeling, difficulty describing feeling, and externally oriented thinking. Neither group had an abnormal level of emotional problems (Table 1).


**Table 1.** Demographic characteristics across the groups.


Note. Botox, botulinum toxin; BDI, Beck Depression Inventory; BAI, Beck Anxiety Inventory; TAS, Toronto

Alexithymia Scale; SD, standard deviation; BOLD indicates statistically significant differences.

Since data for five older adults and two younger adults failed to pass the quality check, 169 of 176 participants were included in the analysis. Table 1 summarizes the demographic and clinical characteristics of the participants. Significant differences were found in age, education, left-handed, BDI score, and TAS score. Except for age, these variables were adjusted in further analyses.

#### *2.3. Procedures*

A series of photos containing six basic emotions and a neutral facial expression were presented to participants, which consisted of seven stimuli and had been selected by researchers from a photography dataset used in a previous study [50]. Instructions were given in both verbal and visual form, and the participants were asked to answer verbally for stimuli. Then, participants performed posed facial expressions for the given list of six basic emotions and the neutral emotion. For example, for happy facial expression, a photograph of a person with a happy face was presented; participants were asked to identify the emotion conveyed; and "make a happy face for 15 s towards the camera" to be video recorded. The facial stimuli were given once participants were fully aware of the instruction of the study. Examples of stimuli are shown in Figure 1. Each facial stimulus was presented for a maximum of 7 s; the researcher moved on to the next stimulus when the participant made a verbal response. Facial expressions were acquired for a total of 105 s for each emotion.

**Figure 1.** The facial stimuli representing the six basic emotions and the neutral emotion, adapted from [60].

#### *2.4. Data Acquisition*

The participants' video recordings of posed facial expressions were administered with a Canon EOS 70D DSLR Camera with a 50 mm prime lens, 720 p resolution, and 60 fps frame rate. The camera was positioned on a fixed stand approximately 120–140 cm above the floor to correctly capture the entire face of the participants. The posed facial expressions were recorded for 15 s after a clear instruction to imitate a previously recognized emotional face.

For each frame of the recorded videos, the presence and intensity were estimated using OpenFace 2.0, an open-source toolkit for facial behavior analysis, which consists of four pipelines: (1) facial landmark detection and tracking, (2) head pose estimation, (3) eye gaze estimation, and (4) facial expression recognition [34]. For analyzing facial expressions, OpenFace 2.0 recognizes facial expressions by detecting AU intensity and presence according to FACS [31]. Without using all the AUs listed in FACS, OpenFace 2.0 offers a subset of 18 AUs by cross-dataset learning, specifically, 01, 02, 04, 05, 06, 07, 09, 10, 12, 14, 15, 17, 20, 23, 25, 26, 28, and 45. The occurrences and intensities in AUs are estimated by using machine learning algorithms. The methods for AU estimation and analysis are described in more detail elsewhere [61]. In the present study, AU intensities were used to derive measures of individual emotional facial expression and six basic emotions were created according to emotional FACS (EMFACS) [62]. The EMFACS were based on the FACS that have been proven to have significant reliability for the assessment of human facial movements [63,64]. The highest intensity for each AU was calculated as the maximum score across all the video frames, which is validated in prior work [65]. Examples of each AU and emotion are shown in Table 2.

**Table 2.** Action unit descriptions and combination of each emotion.


Note. AU, action unit; FACS, facial action coding system; L, lower face; U, upper face.

#### *2.5. Statistical Analysis*

Descriptive statistics for demographic variables were calculated as mean scores and standard deviations. The difference in AU was compared, applying for multiple comparisons (followed by Bonferroni correction). Chi-squared tests were used to compare categorical outcomes such as sex and usage of botulinum toxin (botox). The correlation between age and the AU intensity was investigated. To explain multivariate profiles with respect to input features that were accurately distinguished from the older group, the adaptive least absolute shrinkage and selection operator (LASSO) ML algorithm were applied to the dataset [66]. The adaptive LASSO, which is a regularized regression method with L1-norm penalty [67] is a popular technique for simultaneous estimation and consistent

variable selection [66]. It is a powerful model that performs regularization and feature selection, and it can provide model interpretability by excluding irrelevant features that are not related to the class from the model. L1 regularization, which penalizes elements of redundant complexity, focuses on the most significant features, and thus prevents overfitting of the data and is supported by well-grounded theoretical analysis [68]. The regression coefficients of unimportant variables shrank to 0 upon implementing the adaptive LASSO. In that regard, the adaptive LASSO algorithm provided interpretable results related to the older adults. Due to its high accessibility and low computational complexity as compared with other feature selection models, recently, this approach has been highly recommended in behavioral science [69].

In order to avoid the overfitting issue and to evaluate the generalizability of the results from the ML algorithms, 10-fold cross-validation was applied during the variable selection process [70]. First, the data were randomly split into a training set (66.7% of the data) and a test set (33.4% of the data). All the ML models were fitted using the training set, and classifications were separately made on the test and training datasets. The optimal parameter, lambda, was determined across 1000 iterations of 10-fold cross-validation to minimize the deviance of the model. Then, predictions were made on the test set based on the ML models trained in the training set. All reported *p* values have been adjusted for multiple comparison analyses.

#### **3. Results**

#### *3.1. The Differences in Facial Expression between the Older Adults and Younger Adtuls*

Figures 2 and 3 demonstrate the AU values of the older and younger adults for the neutral and emotional face. The results applied for multiple comparisons are presented in Table 3. In AU 06, 07, 12, and 14, older adults showed higher intensity compared to younger adults. For AU 45, older adults showed lower intensity than younger adults.

**Figure 2.** Prevalence of AU values by groups for neutral face. AU, action unit.

**Figure 3.** Prevalence of emotional AU values by groups for emotional face. AU, action unit.


Note: AU, action unit; BOLD, indicates significant *p*-values; ang, angry; dis, disgust; fea, fear; hap, happy; neu, neutral; sur, surprise; L, lower face; U, upper face. Comparisons were adjusted for covariates. *p*-values were adjusted for multiple comparisons.

To explore the relationship between age and each AU, a correlation analysis was conducted. The patterns of the results were similar to differences in group comparisons (Figure 4). For AU 06, 07, 12, 10, and 14, positive correlations between AU and age were found, while negative correlation were found in AU 45 across the emotions.

**Figure 4.** Correlation plot for age and AUs. AU, action unit; ang, angry; dis, disgust; fea, fear; hap, happy; neu, neutral; sur, surprise. *p*-values were adjusted for multiple comparisons.

#### *3.2. Feature Selection for Predicting Age*

The adaptive LASSO model was implemented to identify significant features for distinguishing the older group among the input variables. Demographics (education, sex, left-handed, and botox), self-reported measure (TAS and BDI), and all AUs were assessed for their ability to classify the older adults. Figure 5 shows the multivariate profiles for distinguishing the older adults from the participants in the current study. Demographics and self-reported measure were not significant in the adaptive LASSO result. Among the total 119 AUs, only 11 AUs remain significant: AU 10 in angry; AU 02, 10, 14, and 45 in sad; AU 05 and 14 in surprise; AU 06, 10, 20, and 45 in neutral, respectively. The receiver operating characteristic (ROC) demonstrated an AUC of 0.924 for the adaptive LASSO model.

**Figure 5.** The adaptive LASSO results. AU, action unit; ang, angry; neu, neutral; sur, surprise.

#### **4. Discussion and Conclusions**

The purpose of the present study was to investigate the differences in facial expressions of older and younger adults and to examine how facial muscles contributed to aging through AUs for six basic emotion and neutral facial expression. Throughout the emotions and AUs, the older adults appeared to exhibit greater intensity in facial expression than the younger adults. In some area, the older adults showed lower facial intensity than the younger adults.

#### *4.1. Degenerative Changes in Facial Expression Differences with Age*

The main findings show that the older adults have higher AU values than young people for neutral and negative emotion (i.e., angry and sad). An increasing amount of the literatures has demonstrated that aging is associated with dramatic reductions in muscle strength (i.e., dynapenia) and motor control [71–73]. With advancing age, decreased neuromuscular changes may result in deficits in voluntary activation for facial activities [73,74]. In that regard, the facial expressions of older adults can naturally differ from those of younger adults [75].

Given that the cortex, spinal cord, and neuromuscular junction are functionally correlated, and they influence voluntary activation of muscle fibers [76], voluntary facial expressions can be addressed by neurological evidence [77]. For older adults to make facial expressions as intended, therefore, it is necessary to utilize their brain in the top-down processing format to ensure that the commands from the brain are correctly delivered to the facial muscles. In addition to facial aging due to sarcopenia, this suggests that changes in the motor cortex with aging can cause changes in facial expressions in the older adults [78,79].

Regarding the expression of strong negative emotions in the older adults representing our results, age differences are reported between the older and the young adults when they discriminate negative emotion. A previous study demonstrated that older adults had more difficulty distinguishing low intensity negative emotions [80]. They may tend to make facial expressions excessively because the older adults themselves may not be able to identify low intensity negative emotions.

Previous studies well support the differences in AUs intensity between the two groups. On upper facial expression, namely AU 06 and 07, the older adults can show greater intensity than the younger adults. Increased activity in orbicularis oculi muscle [81], deeply set of eye [82], and changes in eyelid due to poor visual acuity [83] may have affected the changes in upper facial expression. For lower facial expression, AU 10, 12, 14, the strength of the face may have been further tapped due to the highlighted facial contour caused by loss of subcutaneous fill around the nose and mouth in the older adults [84]. In AU 45, the older adults rather showed reduced AUs than the younger people. Elevated duration of eye blink may explain this reason. Duration of the eye-blinking decreases with aging, apparently reflecting decreased intensities in AU 45 [85], since the deterioration of the orbicularis oculi muscle can affect the complete eye closure rate [86].

As for the adaptive LASSO, the result was shown to be similar to the comparisons between two groups, expect for the AU 02, 05, and 20. The increase in AU 02 in sad condition, as previously mentioned, may have resulted in increased activity in the eyebrow and strong representation of negative representations [80,81]. For the AU 05 in surprise condition, the reduction of muscles may also involve in eye activity have affected the weaker construction of surprise facial expressions [85,86]. For the AU 20, aging may lead to the relaxation of the lip stretcher owing to decreased muscle around the mouth [17,87].

#### *4.2. Limitations and Future Direction*

There are several limitations in the current study. First, we employed only posed emotions. Given that the mechanisms of the posed emotions and the spontaneous facial expressions differ [88], further studies are needed to compare the difference between two distinct facial expressions. Secondly, we did not employ physiological assessment. The OpenFace software, unlike EMG, could not measure sensitive intensities in facial muscles at a physiological level. However, since the OpenFace library is based on FACS and provides reliable results along with recent technological advances, measurement errors are not likely to be a problem. In addition, recent study on the difference between computer vision and EMG has demonstrated only a few differences among the two techniques with respect to accessing overt facial expressions, and that computer vision showed better performance as compared with human [89]. Thirdly, age group is less continuous. Thus, future studies should be designed for providing normative data for facial aging with respect to demographics, such as age and sex. Lastly, the presence of the imbalanced class between the younger group and older group can be a potential limitation of the current study. This issue may not be critical, if the ratios between two classes are not too different. An experimental study showed that low class imbalance ratios do not cause significant performance loss [90], where the class ratio of 40:60, which is similar to our study (Table 2), seemed to converge to nearly zero with respect to performance loss. Another study used metabolomics data and showed that a false positive ratio even decreases as the class-imbalanced ratio rises, due to the prevention of over selection in identifying biomarker features with the LASSO algorithm [91]. Despite these studies, our findings should be interpreted with caution.

With the above limitations, our study has the following strengths. Our findings regarding posed emotions, which require conscious effort of facial muscles, can be used as an evidence to censor individuals who deliberately deceive others, especially for lie detection [92]. In situations where biophysiological assessment is limited, computer visionbased face recognition tools would be beneficial. In a clinical setting, our findings can be used for detecting frailty and other senile changes in muscle. For computer vision-based facial recognition, our findings may also provide researchers with empirical evidence for the characteristics of a human aging face, which would help develop the service and/or product for recognizing the faces of older adults. Notably, so far, there has been little attempt for facial expression recognizing study that compares the characteristics between the younger and the older. Our findings can provide interpretable evidence and explainable features for aging faces. This could provide an important basis for CAD studies for older people in the future.

#### *4.3. Conclusions*

Taken together, the present study is the first to investigate the differences in posed facial expressions between older adults and younger adults using a computer analysis method. Our findings provide evidence for implications in facial expression intensity based on FACS-AU-derived emotional faces. The older adults expressed more intense expressions in neutral and negative emotions than younger adults and tended to use more muscles when they were making facial expressions. In some part of the facial expression, the older adults showed weaker intensity than the younger adults. Our findings may suggest that changes in the muscles around the eyes and mouth due to aging can be indicators of the characteristics for identifying the aging face. The results of this study were obtained quantitatively from a normal population, which has several strengths as compared with previous studies of facial expression based on EMG, 3D morphometry, or subjective rating. They can be used as a basic methodology for analyzing and for identification of the characteristics of facial aging. We hope that the various features of the posed emotions of the older adults in this study can be a significant contribution to other scientific fields with respect to facial expressions, such as criminological research using lie detection, behavioral medicine, and computer vision research based on facial recognition. Future studies are needed for investigating other attributes in facial expressions regarding dynamic emotions, natural environments, and diverse groups.

**Author Contributions:** J.-Y.L. and S.P. (Soowon Park) designed the study; S.P. (Soowon Park) and J.-Y.L. recruited participants and collected facial and clinical data; M.B., M.-G.S., G.N. and J.I. wrote the protocol and performed interpretation of data; H.K. and S.P. (Seho Park) contributed to facial behavioral data analyses and wrote the methodology; K.K. and H.K. undertook statistical data analyses; K.K. and H.K. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Education through the National Research Foundation of Korea (NRF), grant number (NRF-2017R1D1A1A02018479).

**Institutional Review Board Statement:** This study was conducted in accordance with the Declaration of Helsinki and the protocol was approved by the Institutional Review Board of SMG-SNU Boramae Medical Center (IRB No. 30-2017-63).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Acknowledgments:** We would like to thank the anonymous reviewers for their time and constructive comments.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **The Analysis of Emotion Authenticity Based on Facial Micromovements**

**Sung Park 1,\*, Seong Won Lee <sup>2</sup> and Mincheol Whang <sup>2</sup>**


**Abstract:** People tend to display fake expressions to conceal their true feelings. False expressions are observable by facial micromovements that occur for less than a second. Systems designed to recognize facial expressions (e.g., social robots, recognition systems for the blind, monitoring systems for drivers) may better understand the user's intent by identifying the authenticity of the expression. The present study investigated the characteristics of real and fake facial expressions of representative emotions (happiness, contentment, anger, and sadness) in a two-dimensional emotion model. Participants viewed a series of visual stimuli designed to induce real or fake emotions and were signaled to produce a facial expression at a set time. From the participant's expression data, feature variables (i.e., the degree and variance of movement, and vibration level) involving the facial micromovements at the onset of the expression were analyzed. The results indicated significant differences in the feature variables between the real and fake expression conditions. The differences varied according to facial regions as a function of emotions. This study provides appraisal criteria for identifying the authenticity of facial expressions that are applicable to future research and the design of emotion recognition systems.

**Keywords:** facial micromovement; emotion recognition; emotion authenticity

#### **1. Introduction**

Humans utilize both verbal and nonverbal communication channels. The latter category includes facial expressions, gestures, posture, gait, gaze, distance, and tone and manner of voice [1]. Facial expressions, which account for up to 30% of nonverbal expressions, are the most rapidly processed type of expression by visual recognition [2]. Facial expressions project the communicator's intentions and emotions [3]. However, people may conceal their true feelings and produce fake expressions [4]. Such false expressions are exhibited for a very short time with only subtle changes [5], and it is extremely difficult to detect their authenticity with eyesight [6]. Identifying fake expressions is paramount to counter deception and recognize users' true intent in advanced intelligent systems (e.g., social robots and assistive systems).

Early research involving facial expressions focused on establishing a quantitative classification framework to recognize emotions. Ekman built a facial action coding system (FACS), a computation system that encodes facial features' movements to taxonomize emotions from facial expressions. Analysis of facial expressions also spurred interest in the authenticity of expressions.

Researchers have found asymmetric intensity in facial expressions. Dopson revealed that the intensity of expressions in the left face was stronger than that in the right face in the case of voluntary expressions [7]. Conversely, the intensity was weaker than that in the right face in the case of involuntary expressions. These results suggest that the comparison of both sides may identify the authenticity of expressions. The sensitivity of left-face expressions is because facial movements are connected to the right hemisphere of the brain. Patients with

**Citation:** Park, S.; Lee, S.W.; Whang, M. The Analysis of Emotion Authenticity Based on Facial Micromovements. *Sensors* **2021**, *21*, 4616. https://doi.org/10.3390/ s21134616

Academic Editor: Mario Munoz-Organero

Received: 15 April 2021 Accepted: 2 July 2021 Published: 5 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

right-brain injuries are reported to experience significant degradation in recognizing emotions from facial expressions compared with patients with left-brain injuries [8].

Studies have also found differential activation of facial muscles between real and fake expressions. Duchenne experimented on facial muscular contractions with electrical probes to understand how the human face produces expressions [9]. He observed that participants produced a genuine smile with a unique contraction of the Orbicularis oculi muscle [10]. This "smiling with the eyes" is called the Duchenne smile, in his honor.

Ekman analyzed human false expressions and identified minute vibrations or spontaneous changes in the facial muscles responsible for emotional expression [11]. Such micromovements are observed in false (e.g., deception) or pretended (e.g., to be polite) expressions [5]. Facial micromovement is also called microexpression. Micromovement occurs with less than a second of movement and with vibration lasting between 0.04 and 0.5 s [12–14]. Simultaneously, in a typical interaction, an emotional expression begins and ends with a macroexpression that occurs in less than4s[15]. The degree of movement or the vibration of the facial muscles between real and fake expressions can be significantly different [11].

Recent advances in AI technology have led to research on identifying the authenticity of facial expressions using repetitive training with paired data of facial expressions and visual content (an image and a video clip) [16,17]. Microexpression recognition (MER) researchers have put massive effort into open innovation (e.g., facial microexpressions grand challenge [18,19]) to improve the state-of-the-art algorithm. Academic challenges include all aspects of MER sequences such as data collection, preprocessing (face detection and landmark detection), feature extraction, microexpression recognition, and emotion classification within the computer vision domain (for a comprehensive review, see [20] and [21]). Similar to other AI domains, convolutional neural networks (CNNs) have been used the most for MER [22]. A generative adversarial network (GAN), with a generator and an adversarial discriminator model, has been used for feature extraction [23] and facial image synthesis [24]. Most recently, extended local binary patterns on three orthogonal plans (ELBPTOP) were introduced to counter information loss and computational burden of the previous dominant descriptors, LBPTOP [25].

While researchers continue to pursue better algorithms to improve MER accuracy and reliability, in the most recent survey of facial microexpression analysis [20], Xie observed that MER literature on facial asymmetrical phenomena is scarce and limited. While researchers have found an asymmetric intensity in facial expressions, less is known regarding where in the facial region such microexpressions are the most salient and how they interact with different emotions. Specifically, feature variables (i.e., the degree and variance of movement, and vibration level) of emotions that are primarily expressed with the relaxation of facial muscles (e.g., contentment, sadness) may have weaker intensity in the real condition. Systematic research identifying reliable indicators of authenticity per facial region as a function of emotion is imperative.

Physiological data, including electrocardiogram (ECG), are powerful signals for emotion identification [26]. ECG correlates with the contraction of the heart muscles and varies as a function of emotion [27]. In order to achieve a deeper understanding of MER, facial vision data should be fused with cardio signals [20]. To the best of our knowledge, no research has combined the two.

In summary, the study hypothesized that (1) there is a significant difference in the micromovements at the onset of expression between real and fake conditions, and (2) such differences vary by representative emotions (happiness, sadness, contentment, anger). The findings were cross-validated with neurological measurements (ECG).

#### **2. Methods**

#### *2.1. Experiment Design*

The present study used a 2 × 4 within-subject design. The authenticity factor had two levels (real and fake), and the emotion factor had four levels (happiness, sadness, anger, and contentment). The visual stimulus consisted of a still photo and a video clip. The still photo depicted a facial expression of the target emotion. The video clip, which was shown after the still photo, was a recording that was designed to induce either the target emotion shown in the still photo or a neutral emotion.

The participants were then asked to produce a facial expression that the participant felt while watching the still photo. The real condition was manipulated by showing the two materials, the still photo and the video clip, congruently. The false condition was manipulated by having the video clip induce a neutral emotion. In this case, participants were forced to produce a facial expression based on the photo that they viewed earlier. If a different emotion was induced other than neutral, the participant's emotion may have been compounded, which made the measurements difficult to explain. After every 30 s during the video, participants were signaled with a visual cue to produce a facial expression.

The dependent measurements involved micromovements in the face. That is, the average movement, standard deviation, and variance of the facial muscle movements were measured. Facial vibration was analyzed with the dominant frequency elicited by the fast Fourier transform (FFT).

#### *2.2. Participants*

Fifty university students were recruited as participants. The participants' average age was 22.5 years (SD = 2.13) with an even ratio in gender. We selected participants with corrective vision of 0.7 or above to ensure the participants' reliable recognition of visual stimuli. The participants were not allowed to wear glasses. All participants were briefed on the purpose and procedure of the experiment and signed a consent form. Participants were compensated with participation fees.

#### *2.3. Procedure and Materials*

Figure 1 illustrates the experimental procedure. Each participant's neutral facial expression was captured for 210 s before the main task. This was considered as the individual's reference expression. Participants were then exposed to eight combinations of visual stimuli—four sets (happiness, sadness, anger, and contentment) to elicit real emotions and four sets to elicit fake emotions. The order was randomized to counter order and learning effects. A set of stimuli consisted of a still photo and a video clip. A set used to induce real emotion had congruent emotions between the two materials. Conversely, a set to induce false emotions had inconsistent emotions between the two materials. In this case, the video clip induced a neutral emotion.

**Figure 1.** Experimental procedure.

After viewing the visual stimuli, the participants were given a resting period of 60 s. During this period, participants reported their current emotional state with a subjective evaluation. Participants reported their (1) emotional state (happiness, sadness, anger, disgust, fear, surprise, and contentment), (2) degree of arousal, and (3) degree of pleasantness. The latter two were rated on a five-point Likert scale. (1) We provided a comprehensive set of seven emotions to select from to exclude any data from participants who felt nothing or had a different emotion from the target emotion. The exclusion was determined for each condition, even for neutral video clips, to eliminate any compounding factors from the data.

The participant's facial data were acquired using a webcam. A Logitech c920 webcam (Logitech, Lausanne, Switzerland) was used to obtain image data with a resolution of 1280 × 980 at 30 frames per second. To analyze the activation level of the autonomic nervous system (ANS) when participants were exposed to visual stimuli, participants' heart rate variability (HRV) and electrocardiogram (ECG) data were acquired. The latter was obtained through a Biopac (Biopac, Goleta, CA, USA) system with a frequency of 500 Hz.

Figure 2 shows the experimental setup. Participants were asked to sit and view the experiment monitor at a distance of 60 cm. A webcam, which acquired facial data from the participant, was placed on top atop the monitor.

**Figure 2.** Experimental setting.

#### *2.4. Statistical Analysis*

The present study compared the differences in the micromovement of facial expressions between real and fake emotions. From the participant's expression data, feature variables (i.e., the degree and variance of movement, as well as vibration level) obtained at 4 s (macromovement), 1 s, and 0.5 s (micromovements) after the onset (*t*) of facial expression were analyzed. For each representative emotion (happiness, contentment, anger, and sadness), a *t*-test was used to compare the differences between the feature variables in the two conditions (real and fake) for all 11 AUs responsible for emotional expression. The following section explains how the feature variables were extracted and how the ECG data were obtained.

#### **3. Analysis**

Figure 3 outlines the analysis process. To analyze the data, we established an operational definition of facial expression muscles and extracted facial movement data for such muscles. In total, 40 datasets were analyzed; participants who had excessive facial movements or participants who did not display emotion were excluded. That is, the experimenter screened each recorded video clip and excluded participants who had turned their faces, clearly looking at an object outside of the screen, or when the system had failed to track their faces. To minimize the exclusion, we had instructed the participants to reduce the facial movement and look straight ahead.

The expression onset segment was defined (<sup>2</sup> in Figure 3) based on the threshold of facial movement. Feature variables were then extracted by comparing the rate of change in action units (AUs) between data frames. The effective feature variables were selected by comparing the feature variables of real and fake expressions for each emotion.

**Figure 3.** Data analysis process.

#### *3.1. Operational Definition of Facial Muscles*

The present study recognized the participants' emotions by identifying the activation of anatomical regions that represent a particular emotion. The AUs were extracted using facial landmarks through a Python program. Figure 4 depicts the extraction process.

**Figure 4.** Facial muscle extraction process.

Each frame obtained from the webcam was analyzed. First, the location of the face in the image was identified using a face detection model, the Haar cascade classifier [28]. Face detection models extracted the target object's features from the dataset and compared the features from the pretrained data to identify the object. Specifically, the present system used the Haar-like feature to detect the region of a face (region of interest (ROI)) by

identifying the location of the nose and eyes. The system then identified 68 facial landmarks by tracking the eye, eyebrows, nose, lips, and chin line using the Dlib library [29], which was trained with a massive quantity of data. The differential facial muscles per facial expression were predefined and utilized to extract 11 muscle areas (i.e., coordination). Eleven facial muscle units (AUs) involving the brow, eyes, cheeks, chin, and lips responsible for facial expressions were predefined and extracted from the participant's dataset (see Table 1). Figure 5 visualizes the relative locations of action units.


**Table 1.** Action Unit Definitions.

**Figure 5.** The relative locations of action units.

These 11 AUs are the centroid values of the three corresponding facial landmarks, computed as follows:

$$A(x\_1, y\_1), B(x\_2, y\_2), C(x\_3, y\_3), D(x\_4, y\_3), D(\frac{x\_1 + x\_2 + x\_3}{3}, \frac{y\_1 + y\_2 + y\_3}{3})$$

For further analysis, facial data from the last 30 s were extracted and analyzed. That is, we defined the first 180 s as time for the visual content to sufficiently "sink in" for the participants.

#### *3.2. Feature Variable Extraction*

To extract feature variables involving facial micromovement, we developed a micromovement extraction program built by LabVIEW 2016 for massive data processing. From the last 30 s of the participant's dataset, 11 AUs (Table 1) were calculated. A threshold was used, the average movement of an AU, using the following min-max algorithm to determine the onset of facial expressions. The micromovement section before the onset was extracted.

$$Threshold = \frac{(Max + Min)}{2}$$

The expression section after the onset consisted of one macromovement section (4 s) and two micromovement sections (1 s, 0.5 s). These three sections may overlap. The movement data from the three sections were extracted. That is, the degree of change (delta) in the coordination of an AU between the current and previous frames was computed as follows, which was performed to analyze the degree of facial vibration.

$$\begin{array}{l} \mathbf{x}\_n = prevAU[n] \cdot \mathbf{x} - curvAU[n] \cdot \mathbf{x} \\ y\_n = prevAU[n] \cdot y - curvAU[n] \cdot y \end{array}$$

Finally, we extracted feature values by analyzing the delta value. That is, the average and standard deviation of the delta and FFT values were extracted. The former two were used to analyze the degree and variance of the change. The latter was used to analyze the degree of facial vibration through the dominant frequency obtained by the FFT.

#### *3.3. Heart Rate Variability Analysis*

In addition to the facial data, ECG data were measured while the visual stimuli were shown for 210 s. The participants' time-series data were transformed into a frequency band using FFT. This enabled measurement of the ANS responses of participants exposed to emotion-inducing stimuli [30,31]. Table 2 outlines the HRV variables used in this study. To measure the change in the serial heart rate data, a 180-s sliding window was used.


**Table 2.** Heart Rate Variability Variables.

#### **4. Results**

The current study analyzed changes in facial micromovements between real and fake expressions of representative emotions. A *t*-test was used to compare the differences between the participants' facial expressions in the two conditions. The feature variables obtained at 4 s (macromovement), 1 s, and 0.5 s (micromovements) after the onset (*t*) of facial expression were analyzed.

Figure 6 shows the template used to visualize the results. The blank squares on the right indicate the 11 AUs (Table 1) representing the facial muscles responsible for facial expressions. The statistical difference between the real and fake conditions are color coded in Figures 7, 9, 11 and 13 in three levels: *p* < 0.001: \*\*\* ; *p* < 0.01: \*\* ; *p* < 0.05: \* .

**Figure 6.** (**a**) Eleven AU regions (red dots) for feature variable extraction; (**b**) visualization framework for reporting the results.

In the HRV analysis, we compared the difference in ANS activation between the real and fake conditions.

#### *4.1. Authenticity of Happiness*

The results of the analysis of micromovement involving expressions of happiness are as follows. Figure 7 depicts the differential movement of the facial regions between the two conditions through the visualization of a face. All 11 AUs had at least one significant difference in the dependent variables (dominant peak frequency, average, and standard deviation of movement).

**Figure 7.** Statistical differences between real and fake happiness expressions (AVG = Average, SD = Standard Deviation).

The average at t + 0.5 (0.5 s after the onset) showed a significant difference in all AUs, whereas only partially significant differences appeared at t + 1, mostly in the left face. This implies that expressions of happiness may be most prominent in the early stage (0.5 s) of a microexpression but persist until t + 1 in the left face. Further regression analysis on average movement showed that the time segment factor enters the regression equation (R<sup>2</sup> = 0.97), *p* < 0.001, along with the authenticity factor, *p* < 0.05.

However, for the standard deviation, the values att+1 significantly differed in all 11 AUs. The domain peak frequency also showed a significant difference at t + 1 in all AUs. The domain peak frequency at t + 0.5 showed a significant difference in the lips, left eyebrows, and brow.

Figure 8 presents a statistical comparison between dependent variables for each AU, collapsing data from the three sections (t + 0.5, t + 1, and t + 4). The measured values were higher in real expressions in almost all regions.

**Figure 8.** Comparison between feature variables of happiness expressions.

#### *4.2. Authenticity of Contentment*

The results of the analysis of micromovements involving expressions of contentment are as follows. Figure 9 depicts the differences in the movement of facial regions between the two conditions.

**Figure 9.** Statistical differences between real and fake contentment expressions (AVG = Average, SD = Standard Deviation).

At t + 1, except for the left eyelid, all 10 AUs were found to have a significant average difference. Similar results were observed for the standard deviation. At t + 0.5, nine AUs were reported to have a significant average difference. This indicates that the microexpression of contentment, compared to happiness, may persist longer. Further regression analysis on average movement showed that the time segment factor enters the regression equation (R2 = 0.97), *p* < 0.001, along with the authenticity factor, *p* < 0.001 and the face side factor, *p* < 0.001.

The vibration of the macromovement (dominant peak frequency at t + 4) was significantly different in many facial regions, including the mouth tail and eyelid of the right side and the eye tail, eyelid, and mouth tail of the left side. Similar results were observed for the standard deviation in the same regions.

As shown in Figure 10, similar to the happiness condition, the average was significantly higher in the real condition, but the dominant peak frequency was significantly higher in the fake condition. That is, there was more facial movement in the real condition but more facial vibration in the fake condition.

**Figure 10.** Comparison between feature variables of contentment expressions.

#### *4.3. Authenticity of Anger*

The results of the analysis of micromovement involving expressions of anger are as follows. Figure 11 depicts the differences in the movement of facial regions between the two conditions.

Similar to the results in the happiness condition, micromovements at t + 0.5 had a statistical difference in all regions, 11 of them at *p* < 0.001. Unlike with happiness, however, the differential micromovements of anger persisted through t + 1, except for in two of the facial regions. Further regression analysis on average movement showed that the time segment factor entered the regression equation (R2 = 0.96), *p* < 0.001, along with the authenticity factor, *p* < 0.001 and the face side factor, *p* < 0.05.

**Figure 11.** Statistical differences between real and fake anger expressions (AVG = Average, SD = Standard Deviation).

A significant difference in dominant peak frequency was found in all regions except for the right eye tail and the left eyelid in all time segments.

As shown in Figure 12, similar to the happiness condition but unlike the contentment condition, all three measurements were higher in the real condition than in the fake condition.

**Figure 12.** Comparison between feature variables of anger expression.

#### *4.4. Authenticity of Sadness*

The results of the analysis of micromovement involving the expression of sadness are as follows. Figure 13 depicts the differential movement of the facial regions between the two conditions.

**Figure 13.** Statistical differences between real and fake sadness expression (AVG = Average, SD = Standard Deviation).

The significant differences were not dominant in all facial regions compared to other emotion conditions, but instead concentrated on the left side of the face. Specifically, similar results were found in the micromovements (t + 1 and t + 0.5) in the left eyelid and mouth tail. Further regression analysis on average movement showed that the face side (left or right) factor enters the regression equation (R<sup>2</sup> = 0.96), *p* < 0.001, along with the authenticity factor, *p* < 0.001 and the time segment factor, *p* < 0.001.

A significant difference was found in the mouth region in all segments with respect to the dominant peak frequency. However, the difference in vibration was prominent and salient at t + 4 and t + 0.5.

As shown in Figure 14, when the data are collapsed, similar to the contentment condition, the standard deviation and the dominant peak frequency were higher in the fake condition than in the real condition.

**Figure 14.** Comparison between feature variables of sadness expressions.

#### *4.5. Analysis of Heart Rate Variability*

The HRV data of the fake condition were compared to those of the real condition of the three frequency bands (very low, low, and high) (see Figure 15). This was performed to compare the ANS response, independent of emotions. Except for the LF (%) variable, a significant difference was found in all variables (*p* < 0.001). Specifically, VLF and VLF (%) were higher in the real condition than in the fake condition. Conversely, HF and HF (%) were higher in the fake condition than in the real condition. LF was significantly higher in the fake condition.

**Figure 15.** Comparison of frequency domain.

#### **5. Conclusions and Discussion**

The present study compared the differences in the micromovement of facial expressions between real and fake emotions. The study utilized 11 AUs based on anatomical muscle location responsible for emotional expression. That is, we identified the difference in the feature variables (average and standard deviation of movement, as well as dominant peak frequency) between the real and fake conditions by facial regions for each representative emotion (happiness, contentment, anger, and sadness). In conclusion, the study showed that the degree of activation is higher if the expression is authentic, implying more micromovement.

The study analyzed the feature variables in three time segments (0.5, 1, and 4 s) after the onset (*t*) of facial expression for each representative emotion. Results indicated that micromovements are more informative at an early stage (less than a second) of expression. In the case of t + 1 and t + 0.5, a significant difference between the real and fake conditions was observed in the left face than the right in the happiness condition. The asymmetric difference in the activation of the face can be explained by activation of the right brain region [32]. Campbell found that the left face expresses more than the right in voluntary expressions. Conversely, the left face expresses less than the right in involuntary expressions [33]. In the anger condition, compared with other emotions, the brow had the highest number of feature variables that were significantly different between the real and fake conditions. This was a result of muscle movement from the participant's frowning.

At t + 4, compared to the time segments in which less than a second had elapsed, less statistical differences were observed between the two conditions for all four emotions. This confirms that measurements at t + 4 cannot reliably capture the differential micromovements between real and fake expressions. The data at t + 4 also include the macromovements of facial muscles and hence may not be sensitive enough to identify abrupt changes in facial movements (i.e., micromovements).

Collapsing the data across time segments, all three feature variables (average, standard deviation, and dominant peak frequency) of the real condition were significantly higher than those of the fake condition in the happiness and anger conditions. Conversely, in the contentment and sadness conditions, the standard deviation and dominant peak frequency of the fake condition were significantly higher. That is, emotions that are primarily expressed with the relaxation of facial muscles, such as contentment and sadness, were observed with weaker intensity in the real condition. The results support the hypothesis that the degree of expression differs between the real and fake conditions as a function of emotions.

Our findings were cross-validated with neurological measurements involving the PSNS and ANS. In the HRV analysis, both HF and HF (%) indicators for the parasympathetic nervous system (PSNS) were higher in the fake condition than in the real condition. Conversely, both VLF and VLF (%) indicators for the ANS were higher in the real condition than in the fake condition. LF (%), an indicator that involves both the PSNS and ANS, did not show a significant difference. In conclusion, the stimuli in the real condition led to the activation of the ANS, which implies an increase in the participant's arousal. In addition, the stimuli in the fake condition led to the activation of the PSNS, which implies the participant's relaxation.

The study acknowledges the individual variance in participants' emotions when they were exposed to visual stimuli. To minimize this difference, a target facial expression was provided. In the fake condition, to ensure that other emotions did not interfere, visual content inducing a neutral emotion was used. That is, participants had to pretend an expression while the stimuli conveyed neutrality. We acknowledge the limitations of this experimental design, which may lower the ecological validity. However, future studies may investigate when a real emotion is replaced by another emotion and study the change in microexpressions.

Follow-up studies may introduce experimental treatments that are congruent with real-world settings. Specifically, micromovements of expressions in complex emotions merit further analysis. In addition, the study was limited to four representative emotions. Although not related to emotion authenticity, Adegun and Vadapalli analyzed microexpressions to recognize seven universal emotions with machine learning [34].

Another limitation of the study involves facial landmark detection. Proper landmark detection is necessary to secure recognition accuracy [20]. We have identified 68 facial landmarks by tracking the eye, eyebrows, nose, lips, and chin line using the Dlib library [29]. However, recent state-of-the-art methods, including tweaked convolutional neural networks (TCNN), may improve the robustness of facial landmark detection [35].

The breakdown of feature variables may be used as an appraisal criterion to authenticate facial data with emotional expressions. This study identified that data at less than one second is critical for analysis of the authenticity of an expression, which may not be reportable by the participants.

Systems capable of recognizing human emotions (e.g., social robots, recognition systems for the blind, monitoring systems for drivers) may use the authenticity of the user's facial expression to provide a useful and practical response. Recognizing fake expressions is imperative in security interfaces and systems that counter crime. For a social robot to provide effective services, identifying the user's intent is paramount. A recent human-robot interaction study applied deep neural networks to recognize a user's facial expressions in real time [36]. Further recognition of the user's false (e.g., deception) or pretended (e.g., to be polite) expressions might introduce more social, rich, and effective interactions.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/article/10 .3390/s21134616/s1, Python Code\_AU Extraction.

**Author Contributions:** S.P.: methodology, validation, formal analysis, investigation, writing, review, editing, S.W.L.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing, visualization, project administration, M.W.: conceptualization, methodology, writing, review, supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2020R1A2B5B02002770).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Sangmyung University (protocol code BE2018-31, approved at 10 August 2018).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the subjects to publish this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets**

**Kyoung Ju Noh \*, Chi Yoon Jeong, Jiyoun Lim, Seungeun Chung, Gague Kim, Jeong Mook Lim and Hyuntae Jeong**

Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute, Daejeon 34129, Korea; iamready@etri.re.kr (C.Y.J.); kusses@etri.re.kr (J.L.); schung@etri.re.kr (S.C.); ggkim@etri.re.kr (G.K.); jmlim21@etri.re.kr (J.M.L.); htjeong@etri.re.kr (H.J.) **\*** Correspondence: kjnoh@etri.re.kr; Tel.: +82-42-860-1764

**Abstract:** Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and grouploss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.

**Keywords:** speech emotion recognition; domain adaptation; SER generalization; Korean Emotional Speech Database; ensemble model; multi-path; group-loss; BLSTM network

#### **1. Introduction**

Human speech is a natural communication method in human–computer interaction (HCI) and human–robot interaction (HRI). Speech emotion recognition (SER), which is based on natural human language, is a key method used to recognize individual emotions in everyday speech. SER uses the acoustic features of a speech segment, not the lexical features having the semantic information of the segment [1]. Hence, it recognizes subjects' emotions from "how" they speak rather than the content of their words. The predicted emotional context of a target speaker can then be used as an important factor for decision making in intelligent HCI and HRI services [2,3].

Prior to deploying SER models in real applications, the lack of SER databases tagged with emotion labels must be addressed, because they are not sufficient for training deep-SER models. Another challenge is the limited generality of the SER model, owing to the high variability of the acoustic signals of the emotional speech samples.

Emotions have characteristics of high subjectivity and diversity, depending on the individual or culture. Therefore, it is time-consuming and expensive to build a largescale emotional database annotated with reliable gold-standard emotion labels via human observation. Most SER datasets having gold-standard labels contain thousands of speech samples collected from a limited number of speakers in a specific environment [4–7]. Therefore, the performance of an SER model trained on single-domain samples is inherently degraded when applied to unseen domain samples that reflect different languages, cultures,

**Citation:** Noh, K.J.; Jeong, C.Y.; Lim, J.; Chung, S.; Kim, G.; Lim, J.M.; Jeong, H. Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets. *Sensors* **2021**, *21*, 1579. https://doi.org/10.3390/ s21051579

Academic Editor: Raffaele Gravina

Received: 7 January 2021 Accepted: 21 February 2021 Published: 24 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

speakers, genders, microphone types, positions, and signal-to-noise ratios [8–10]. This study defines a single SER domain dataset collected using one collection procedure at one place using the same collection device.

Many studies have effectively utilized limited emotion databases to improve the SER performance. In addition to the typical augmentation methods of speech samples [11,12], there exists a domain adaptation method that utilizes speech datasets already established in the unknown target domain [8–10,13–16]. In comparison with the results of data augmentation in a single domain, it is difficult to guarantee good performance because of the high variability of the acoustic features of the emotional speech samples in the domain [8–10,13,14]. However, domain adaptation based on multi-domain datasets can be used to construct better SER models to support such generalities without overfitting.

We propose a multi-path and group-loss-based network (MPGLN) for SER, which supports supervised domain adaptation in multi-domain datasets acquired from multiple environments. The proposed MPGLN for SER (MPGLN SER) is based on an ensemble learning structure for multi-level embedding vector learning for speech segments. It includes a temporal embedding feature generator, transferred feature extractor, and prediction function network that classifies the emotion labels based on the generated and extracted feature vectors. The bidirectional long short-term memory (BLSTM)-based temporal feature generator network learns an embedding vector as a 74-D input of handcrafted low-level descriptions (LLD) of a speech segment. The transferred feature extractor creates feature vectors from the pre-trained VGG-like audio classification model (VGGish) [17], and the proposed MPGLN SER is trained based on multiple losses by the association between the discrete and continuous dimensional emotion labels [1] of the multi-domain samples.

The proposed MPGLN SER is evaluated over five multi-domain SER datasets: the benchmark English Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [7], which was widely used in previous studies for SER model evaluation, and the four Korean Emotional Speech Database (KESD) datasets that are built for this study.

In our evaluation, we use an SER model comprising a BLSTM-based temporal feature generator and the MPGLN predicting network, excluding transferred features, as our baseline model. We then verify the reliability of the baseline SER model using the IEMOCAP dataset. Comparing it with the performance of the baseline SER model, it is confirmed that the proposed MPGLN SER is effective in supporting supervised multi-domain adaptations and reinforcing generalizations [18] of the SER model in multi-domain datasets.

This paper is organized as follows. In Section 2, we present a brief overview of related SER and domain adaptation works. Section 3 describes the proposed MPGLN, which supports multi-domain adaption of SER in multi-domain datasets. Section 4 details the evaluation results of the MPGLN SER, and Section 5 concludes this study and suggests future works.

#### **2. Related Works**

Recent SER models based on deep-learning architectures [19–30] have demonstrated state-of-the-art performance with an attention mechanism [19,20,22,23,25,26]. The deep-learning architectures adopted in previous studies included recurrent neural networks (RNN) [19], convolutional neural networks (CNN) [24], and convolutional RNNs (CRNN) [20,26]. Liu et al. [21] presented an SER model of a decision tree for an extreme learning machine having a single hidden-layer feed-forward neural network, using a mixture of deep learning and typical classification techniques.

The input features for deep-learning-based SER models are generally extracted from the time or spectrum axis in units of speech segments or frames. There are various LLDs and high-level statistical functions of the LLD single features [19,20,31–33]. The spectrum LLD features of speech signals include logMel filter-banks and mel-frequency cepstral coefficients (MFCC). Zero-crossing rates and signal energies are representative time-domain features [27–30], whereas spectral roll-off and spectral centroid are classified as spectral parameters [33]. A set of multiple single features for acoustic signal processing, such as the extended Geneva Minimalistic Acoustic Parameter Set [34] and the INTERSPEECH 2010 Paralinguistic Challenge (IS10) dataset [35], is now accessible from open-source frameworks, such as OpenSmile [36]. Some studies have investigated the mechanism of modeling and integrating of temporal acoustic features to improve the performance of speech emotion recognition or audio classification [31,32]. Jing et al. [37] presented an evaluation of multiple acoustic feature sets that combined features generated from the pre-trained acoustic model [15,17,38,39].

A typical deep-learning model requires large-scale samples for training. Unfortunately, SER datasets annotated with emotion labels are scarce. Furthermore, collecting SER speech samples and tagging them with emotion labels is time-consuming and expensive. Thus, to overcome the limitations of volume and diversity of labeled speech samples for deeplearning SER models, studies have been performed using data augmentation [11,12,40–42], active learning [12,43] based on collected datasets, and domain adaptation [8–10,13–16] to adapt the existing SER datasets to the target domains.

Park et al. [11] presented a data augmentation experiment for speech samples using warping and masking in a frequency channel with a time step. Chatziagapi et al. [40] proposed a method that used generative adversarial networks [44] to extract artificial spectrograms of augmented data to balance each emotion class.

Active-learning methods have been used to present greedy selection methods of speech samples to construct an initial SER model suitable for a target speaker based on limited samples [12,43]. Abdelwahab et al. [43] proposed the active learning of greedy sampling to select the most informative samples to improve the performance of DNN-based SER models. In a study by Bang et al. [12], samples that were close to the target speaker's samples in the embedding space were selected; the synthetic minority oversampling technique was applied to increase the number of samples of the minority class.

Domain adaptation techniques are actively being studied in the field of visual classification [18,45]. Metric-based learning is a representative method of learning distances containing the features of inter-domain and -class samples to minimize domain mismatches between the source and target domains. Gao et al. [46] proposed an acoustic model based on ResNet [47] for acoustic scene classification; its learning process is such that it is difficult to distinguish the domain to which a sample belongs.

The domain adaptation for SER models based on multi-domain datasets has the purpose of building an SER model that is not overfitted to a specific dataset and is generalized for unknown target-domain speech data. However, the SER model based on multi-domain datasets has a different applicability from the case that applies data augmentation by oversampling a single domain dataset. It does not guarantee the SER performance improvement, even if several multi-domain speech samples are used to train the SER model, because there is high domain discrepancy in the speech signal, which depends on the collection environments [8–10,13,14].

Liang et al. [9] proposed a structure that learned emotion-salient features based on audio and video data through an adversarial learning framework, generating embedding features for the purpose of reducing domain discrepancies. Huang et al. [13] presented a network model that aligned the distribution shift in the intermediate feature space between the source and target domains. Neumann et al. [14] introduced an adaptive technique to fine-tune the weights of SER neural networks trained in the source domain using a small number of samples from the target. By using the transferred features from the pre-trained model, Li et al. [15] demonstrated improvements in the SER performance using additional embedding vectors extracted from the pretrained VGGish in AudioSet [48]. Lee et al. [16] presented the generalization effect of emotion recognition by applying dropout and normalization methods in multilingual heterogeneous datasets.

#### **3. Ensemble Learning Model for SER in Multi-Domain Datasets**

We propose an ensemble learning model to improve the performance of SER generalization in multi-domain datasets. The operational flow of the supervised multi-domain

adaptation of the proposed MPGLN SER is shown in Figure 1. We denote speech-input samples and class-label spaces as X and Y, respectively, and the domain datasets are *D* = {*D*1, *D*2, ..., *Dk* }. This study assumes a supervised learning environment wherein each domain sample has common emotion labels. In this study, each domain dataset consists of pairs *Dk* = X*<sup>k</sup> <sup>i</sup>* , (y*<sup>k</sup> i*\_*d*, y*<sup>k</sup> <sup>i</sup>*\_*v*) *Nk i*=1 , where *Nk* is the number of speech samples of the *k*-th domain dataset, and datasets in each speech sample have multiple Y labels. The discrete emotion label is y*<sup>k</sup> <sup>i</sup>*\_*<sup>d</sup>* (e.g., "happy" and "sad"), and that of the valence-level is <sup>y</sup>*<sup>k</sup> i*\_*v* in the continuous dimensional emotion model.

**Figure 1.** Supervised multi-domain adaptation of the multi-path and group-loss-based network (MPGLN) speech emotion recognition (SER). The model generates the temporal embedding feature and the transferred embedding feature for the speech segment and learns based on multiple losses.

The source-domain dataset used for model training is domain *Ds*, and the domain to which test samples to be predicted belong is the target domain, *Dt*. There are variant shifts and domain discrepancies of the feature distribution, *d X<sup>S</sup>* and *d X<sup>T</sup>* , of data samples of different domain datasets, *Ds* and *Dt*, respectively [45].

The goal of the SER model is to learn the classifier function, *f* : *X* → *Y*, in the target domain. Function *f* consists of the composition of two functions, *f* = *h* ◦ *g*, where *g* is an embedding feature generator from the input data space, X, to an embedding feature space, and *h* is the function used to predict the embedding feature to label-space Y.

Figure 2 shows the architecture of the proposed MPGLN SER, which generates the multi-level embedding vectors from the multi-path generators. The BLSTM-based feature generator, *gBLSTM*, generates a temporal embedding vector, and the transferred feature extractor, *gvgg*, extracts a transferred embedding vector from the pre-trained VGGish model [17].

In the prediction function, *h*, of the proposed ensemble structure, discrete emotional labels are classified based on the fusion of multi-path embedding vectors from *gBLSTM* and *gvgg*. It also includes a dimensional valence-level classification function based on the temporal embedding feature generated by *gBLSTM*.

**Figure 2.** Architecture of the multi-path and group-loss-based network for SER. The MPGLN SER model comprises a bidirectional long short-term memory (BLSTM)-based temporal embedding generator and a transferred feature extractor from the VGG-like audio classification model (VGGish) and its prediction function.

#### *3.1. Multi-Path Embedding Features*

In this study, the speech segments of an utterance unit are embedded in the feature space through *gBLSTM*, a temporal feature generator of the ensemble structure, and *gvgg*, a transmitted feature extractor. In Figure 2, the temporal feature generator, *gBLSTM*, of the BLSTM architecture reflects a characteristic of the temporal relevance of before-and-after speech features. The 74-D LLD-per-frame speech segment comprises a 13-D MFCC and 40-D Mel-spectrogram, along with 21-D time- and frequency-domain LLDs such as zerocrossing rate, energy, spectral centroid, and spectral roll-off. The 74-D LLD are extracted by the frame that applies sliding windows of 200 ms with a 50% shift in the speech segment. Each speech segment is padded with a zero value to have a fixed number of 100 frames, and the sequence of 100 × 74 per segment is input to *gBLSTM*. The padded input sequence is fed into the *gBLSTM*, comprising 128 cells in each direction, and *gBLSTM* produces a 256-D feature vector.

The feature generator, *gBLSTM*, adopts an attention mechanism and focuses on those more discriminative parts of the BLSTM output sequence before activation of the final emotion classification. The attention mechanism for SER assumes that there are certain words and salient parts that express emotions well in the speech segment. Using the attention method, it gives more weight to relevant speech frames of an utterance-level segment for emotion recognition.

The attention layer focuses on relevant parts of the output sequence of the BLSTM by giving different weight scores and generates the high-level features (*h f*). It computes weight *α<sup>t</sup>* using the softmax function via the attention layer (see Equation (1)), where the BLSTM output vector is *ht* = [<sup>→</sup> *ht*, ← *ht*] at time *t*. It produces the high-level feature, *h f* , which is the weighted sum, *ht*, obtained by multiplying the weights, *α<sup>t</sup>* (see Equation (2)). The generated *h f* is transited again to an embedding feature vector of R<sup>64</sup> through the two fully-connected (FC) layers in the MPGLN.

$$\alpha\_{\rm t} = \frac{\exp(\mathcal{W} \cdot \mathcal{h}\_{\rm t})}{\sum\_{t=1}^{T} \exp(\mathcal{W} \cdot \mathcal{h}\_{\rm t})} \tag{1}$$

$$hf = \sum\_{t=1}^{T} a\_t \cdot h\_{t\_f} \tag{2}$$

The temporal feature generator, *gBLSTM* : *<sup>X</sup>* <sup>→</sup> <sup>R</sup><sup>64</sup> , generates a 64-D embedding vector from the input of the 74-D LLD in units of speech-segment frames. The feature generator, *gBLSTM*, in the MPGLN SER can operate as an SER model alone by combining the prediction function, *hbaseline <sup>d</sup>* : <sup>R</sup><sup>64</sup> <sup>→</sup> *<sup>Y</sup>*(y*<sup>k</sup> <sup>i</sup>*\_*<sup>d</sup>* ), without using the transferred features from the VGGish. This study uses the BLSTM-based SER model as a baseline for the evaluation of the MPGLN SER.

The transferred feature extractor, *gvgg* : X <sup>→</sup> <sup>R</sup>*VGGish* , extracts the transferred feature vector of data-sample X using the VGGish model. The input speech segment is divided into non-overlapping 960 ms time-unit frames, and 64 mel-spaced spectrogram features that apply a 25 ms window every 10 ms in each frame are extracted using the VGGish model [17]. Using the transferred feature extractor, *gvgg*, it generates a 128-D embedding feature vector from the VGGish model for the speech segment by inputting a frame-byframe spectrogram in units of 96 × 64. The extracted 128-D embedding vector passes through the fattening and FC layers and is transited to a 64-D embedding vector.

#### *3.2. Group Loss*

Equation (3) shows how classifier *f* is trained on the classification loss, L*c*(*f*), of the emotion labels *Y* of the speech samples *X*, where is an appropriate loss function similar to cross-entropy for multi-class classification [45,49].

$$\mathcal{L}\_{\mathfrak{c}}(f) = \ell(f(X), Y) \tag{3}$$

The proposed MPGLN SER is trained to simultaneously minimize multiple losses, which are induced by the association of multi-dimensional emotion labels. The discrete emotion labels are intuitive for expressing the emotion, but it has difficulty in expressing complex emotions. The dimensional emotion labels are capable of normalized expressions of complex emotions. However, doing so, it is difficult to intuitively distinguish emotions at similar positions (e.g., "fear" and "anger") in the arousal-valence axis [1]. This study derives an association between discrete and dimensional valence-level labels based on real SER domain datasets and applies a method of simultaneously learning the loss for each emotion-label classification in the MPGLN model.

As shown in Figure 2, the MPGLN SER learns simultaneously based on the two losses: <sup>L</sup>*cv* for the valence-level label using the <sup>R</sup><sup>64</sup> feature vector generated from *gBLSTM* and <sup>L</sup>*cd* for predicting the discrete emotion label.

The primary loss, L*cd*, is used for the predicting function, *fd* = *hd* ◦ (*gBLSTM* ⊕ *gVGGish*), where *hd* : <sup>R</sup><sup>64</sup> <sup>⊕</sup> <sup>R</sup>*VGGish* → *<sup>Y</sup>*(y*<sup>k</sup> <sup>i</sup>*\_*<sup>d</sup>* ) predicts the discrete emotion label of <sup>y</sup>*<sup>k</sup> <sup>i</sup>*\_*<sup>d</sup>* via the combination of two embedding vectors. The complementary loss, L*cv*, is that of the predicting function, *fv* = *hv* ◦*gBLSTM*, which classifies the valence-level labels, where *hV* : <sup>R</sup><sup>64</sup> <sup>→</sup> *<sup>Y</sup>*(y*<sup>k</sup> <sup>i</sup>*\_*<sup>v</sup>* ). Equation (4) shows that the proposed MPGLN SER is trained to minimize group loss L*<sup>g</sup>* about the prediction functions, *fd* and *fv*:

$$\mathcal{L}\_{\mathcal{S}} = \text{Group}(\mathcal{L}\_{cd}(f\_d), \mathcal{L}\_{cv}(f\_v)). \tag{4}$$

#### **4. Evaluation**

*4.1. Datasets*

We evaluated the proposed model using five multi-domain datasets contained in three real SER databases. For the evaluation of the MPGLN SER based on multi-cultural datasets, two KESD databases (i.e., KESDy18 and KESDy19) constructed for this study, and the IEMOCAP are used. KESDy18 and KESDy19 comprise two domain datasets based on heterogeneous microphone devices.

In the IEMOCAP dataset, data were collected from the scenarios for inducing the five target emotions ("happy", "sad", "neutral", "angry", and "frustration"), and annotators

selected one of the six basic emotions ("angry", "sad", "happy", "disgust", "fear", and "surprise") [50] along with "frustration", "excited", and "neutral" as the discrete emotion labels. Numerous data were annotated with the emotion categories such as "fear" and "disgust", which do not belong to the target emotions in IEMOCAP [7]. Even in the KESD database, considering the subjectivity and diversity of human emotion perception, the categorical emotion label was tagged as one of the six basic emotion labels along with "neutral".

The KESDy18 comprises speech samples in which 30 voice actors uttered 20 sentences while expressing the four given emotions of "angry", "happy", "neutral", and "sad". The six external taggers evaluated the speech segments while listening to the recorded utterances as shown in Figure 3a. The annotators tagged one of the seven categorical emotion labels comprising the six basic emotions [50] in addition to "neutral", whose tagged labels are more diverse than the classification of the actor's expressed emotion. They tagged labels of arousal and valence-level on a five-point scale for each segment. The final categorical emotion label was determined by majority vote. The label of arousal and the valence-level were determined from the average value of the levels tagged by the evaluators. KESDy18 simultaneously collected speech data from two heterogeneous microphones (i.e., a cell-phone's built-in microphone (PM) and an external microphone (EM) connected to a computer). According to the type of microphone devices, KESDy18 comprised the KESDy18\_PM dataset plus the KESDy18\_EM dataset.

**Figure 3.** External annotator tags the emotion labels for speech segments using the tagging application while watching the recorded video and listening to the Emotional Speech Database (KESD) speech segments: (**a**) evaluating emotional labels of KESDy18 via the tagging application; (**b**) evaluation of the KESDy19 speech segments looking at the recorded video.

The KESDy19 includes the speech samples of 40 voice-actors who speak Korean as their native language using collection scenarios similar to those of the IEMOCAP. KESDy19 consists of 20 sessions collected from speech and electrocardiogram signals produced during the dyadic acting of two voice actors, the process of acting was recorded. Each session consists of 10 plays having lengths of 4–10 min. Six plays were based on scenarios written to induce specific emotions, and the other four were improvised during the dyadic interactions. Each speech segment per speaker was tagged using one of seven categorical emotion labels, and the average value of the five-point scale of arousal and valence-level was annotated by 10 external taggers using the same tagging application as shown in Figure 3b. KESDy19 comprises a KESDy19\_EM dataset that used an external microphone and a KESDy19\_PM dataset that simulated the KESDy19\_EM dataset via a cell-phone's microphone.

The IEMOCAP is a widely used SER performance evaluation model organized into five sessions of multi-modal audio, visual, and textual data taken from interactive dyadic interactions performed by 10 voice actors. In each session, two voice actors emotionally performed improvisations or scripted scenarios. The speech segments of their utterance-levels were tailored to discrete emotion labels of "happy," "sad," "neutral," "angry," "surprise," "frustration," "excited," "disgust," or "fear" based on the majority opinions of three external human annotators. The IEMOCAP data were also tagged with labels of arousal and valence based on a five-point dimensional emotion scale [39,51]. The IEMOCAP database provides the re-rounded average score of the evaluations of arousal and valence-levels according to the five-point scale based on evaluations by six external evaluators. Many prior studies evaluated SER performance using the IEMOCAP database to classify the four emotion categories of "happy," "sad," "neutral," and "angry."

Figure 4 shows the distribution of four discrete emotion and arousal/valence-level labels on the five-point scales of IEMOCAP, KESDy18, and KESDy19. As shown in Figure 4a–c, the speech samples of the "happy" class are distributed at the highest valence level, and the "neutral" samples are in the middle. The speech data labeled with "sad" and "angry" classes show a distribution of low-level valences across all three SER databases. The association between discrete emotion labels and those of arousal-level shows more irregularities in Figure 4d–f. The speech samples tagged with the "sad" class are distributed in the overall arousal-level, and the samples of the IEMOCAP with the "happy" label are distributed in the overall level of arousal, unlike the other two KESD.

**Figure 4.** *Cont.*

**Figure 4.** Distribution between discrete and dimensional emotion labels of the five-point scale: (**a**) distribution of discrete and valence-level labels of Interactive Emotional Dyadic Motion Capture database (IEMOCAP); (**b**) distribution of discrete and valence-level of KESDy18; (**c**) distribution of discrete and valence-level of KESDy19; (**d**) distribution of discrete and arousal-level of IEMOCAP; (**e**) distribution of discrete and arousal-level of KESDy18; and (**f**) distribution of discrete and arousal-level of KESDy19.

In Figure 4, the speech samples corresponding to the discrete emotion classes constitute roughly three distribution groups across the label of valence-level. The three distribution groups are "happy," "neutral," and "sad" or "angry."

In this study, we mapped the valence-level labels of the five-point scale to a three-point scale using the induced association between discrete and dimensional emotion labels, as shown in Table 1 and Figure 4. Each valence-level (i.e., 1, 2, and 3) of the three-point scale represents "negative", "neutral", and "positive" emotional states, respectively. For the conversion to the valence-level of the three-point scale, this study assigned sample labels of valences less than 2.5 to the first valence-level, samples of 4.0 or higher to the third, and the others to the second, respectively. Table 1a shows the mean and standard variation of arousal and valence-levels on a five-point scale for each discrete emotion category. Table 1b shows the confidences of association [52] of the speech samples of four discrete emotion classes included in the valence levels of the three-point scale. The <sup>=</sup> *NCi*∪*Vj*

confidence *Conf* . Ci → Vj *NCi* , where C*<sup>i</sup>* is the discrete emotion label, 1 ≤ *i* ≤ 4, and V*<sup>j</sup>* denotes the valence-level, 1 ≤ *j* ≤ 3.


**Table 1.** Association properties of discrete emotion labels and valence-levels in multi-domain SER datasets: (**a**) Mean and standard variation of arousal and valence levels on a five-point scale for each discrete emotion category; (**b**) Confidence of discrete emotion labels and valence-level of three-point scale.

Table 2 shows properties of the five domain datasets of three SER databases used for the evaluation, where we used speech segments having lengths of 2 s or longer as one of four categories of emotion labels, "angry", "happy", "neutral", and "sad."


**Table 2.** Properties of multi-domain SER datasets.

<sup>1</sup> KESDy18\_EM is available online at https://nanum.etri.re.kr/share/kjnoh/SER-DB-ETRIv18?lang=eng (accessed on 7 January 2021). <sup>2</sup> The collecting process of the KESDy19 was approved by the Institutional Review Board of Korea National Institute for Bioethics Policy (approval number P01-201907-22-010 and 22 July 2019).

#### *4.2. Evaluation of the BLSTM-Based Baseline SER*

As shown in Table 2, the five domain SER datasets used for evaluation were unbalanced in the number of samples of the discrete emotion classes. We did not apply oversampling, data augmentation [11], or weighted loss methods [46] to minority classes for objective verification of the proposed MPGLN SER.

Speech samples of each class in the multi-domain datasets were trained in the SER model by the units of the speech segment, which consisted of the voiced part of the vocalcord vibrations and unvoiced parts such as a silence section between voiced parts [53]. This study did not remove the unvoiced region from any speech segment. However, it framed the entire voiced and unvoiced parts of the segment as input to the model.

We present four performance metrics in consideration of the sample imbalance of each emotion class: weighted accuracy (WA), unweighted accuracy (UA), precision (PR), and F1 score. WA is the overall accuracy, calculated as the ratio of the total number of test data and the number of samples accurately predicted by the actual label. UA is calculated as the average of the recall values of four classes and is an important performance indicator in the evaluation of the SER model based on imbalanced datasets [19,20,26].

This study applied *z*-normalization [1] of the means and standard deviations of each dataset to reduce the fluctuations of the speaker and speech signals. We evaluated the speaker-independent leave-*p*-subjects-out (L*p*SO) validation technique, where *p* is the number of subjects to leave out when training the model. For training, we used separated samples belonging to speakers accounting for 80% of the total number in each dataset; samples of the remaining 20% were evaluated as test data.

For the evaluation of IEMOCAP, we used a leave-two-subjects-out evaluation that applied speech data from two speakers participating in one session as the test data, which was the leave-one-session-out (LOSO) validation. KESDy18 was evaluated as a leave-sixsubjects-out sample from the set of 30 speakers. The evaluation of KESDy19 was conducted as a leave-eight-subjects-out sample for four sessions of the 20 sessions played in pairs by 40 speakers. The training and test data separated for speaker-independent evaluation in each dataset were equally applied to the evaluation of a single domain, multi-domain, or domain generalization, as shown in Tables 3 and 4 and Tables 6–8.


**Table 3.** Performance of the baseline BLSTM-based SER model according to the input low-level descriptions (LLD) feature set in SER datasets.

**Table 4.** Performance of the baseline BLSTM-based SER model.


In the evaluation of this study, a model based on the temporal embedding features and the learning loss, L*cd*, without the transferred embedding feature was assumed to be the baseline SER model. It can be seen that this baseline operated using a single-pathsingle-loss (SPSL) scheme. In the evaluation, the proposed MPGLN and the baseline SPSL SER model were trained with a batch size of 200 samples at 25 epochs using an Adam optimizer and a drop rate of 0.6 to the last two FC layers. The learning rate of the optimizer was 1.10−3. The model was evaluated over 10 iterations of training and testing, and the final value of each performance metric was calculated as the average value.

The baseline SPSL SER model uses the 74-D LLD integration per-frame of speech segment, which comprises 13-D MFCC and 40-D Mel-spectrogram (Mel-spec), along with 21-D time- and spectral-domain (TimeSpectral) LLDs such as zero-crossing rate, energy, spectral centroid, and spectral roll-off. We evaluated the performance of each combination of LLDs with our baseline SER model based on multiple SER datasets. Table 3 summarizes the performance evaluation according to the input feature set of the LLDs used in this study, as shown in the evaluation results based on the IEMOCAP, KESDy18\_EM, and KESDy19\_EM datasets. It can be observed that MFCC is the dominant feature of SER from the results in Table 3. The SER performance improved from 1.6% to 3.2% based on the F1 score in comparison with the single input of MFCC when using the input combination of MFCC and Mel-spectrogram, along with TimeSpectral LLDs.

Table 4 shows the results of the speaker-independent evaluation of the BLSTM baseline SPSL when classifying the four discrete emotion labels in each of the five domain datasets. The evaluation based on KESDy19 showed similar performance results as IEMOCAP. In the evaluation of KESDy18, it showed higher performance results than the other two databases.

A previous study by Zheng et al. [54] demonstrated the performance of 40% WA of the CNN-based SER model for the five emotion classes based on IEMOCAP. For a fair comparison of the SER performance, this study performed a comparison with the previous RNN-based SER models that presented the UA performance of the four emotion classes based on IEMOCAP, which was the test environment in many previous SER studies.

In Table 5, we compare the performance results of previous RNN-based SER models and the SPSL baseline model in the LOSO evaluation to classify the four emotion labels based on the IEMOCAP dataset. These studies present a UA metric of the average recall for each emotion class, considering the imbalance of the number of samples. As shown in Table 5, our baseline BLSTM SER model achieved a competitive performance of UA 59% in the LOSO validation based on IEMOCAP.

**Table 5.** Performance results reported in previous recurrent neural networks (RNN)-based studies of SER model and our baseline model based on IEMOCAP.


<sup>1</sup> This study used only the improvisation data of female speakers as test data.

#### *4.3. Evaluation of Multi-Domain Adaptation*

As shown in Tables 6–8, evaluations were performed using a single-domain evaluation, a multi-domain adaptation, and a multi-domain generalization according to the source and target domains participating in training and evaluation. The division of training and testing data separated for speaker-independent evaluation in each dataset used the same configurations as those used in Tables 3–8. In Tables 6–8, the highest F1 scores are highlighted.

**Table 6.** Evaluation results in a single domain dataset. Single-path-single-loss (SPSL) is the baseline SER model that learns by the temporal embedding features and the loss L*cd*; Multi-path-single-loss (MPSL) is that model learns using the multi-path embedding vectors and loss L*cd* without the loss L*cv*; MPGL is the model that learns based on multi-path embedding vectors and the group loss L*g*.



**Table 7.** Evaluation results of multi-domain adaptation.

**Table 8.** Evaluation results of multi-domain generalization.


Table 6 shows the evaluation results when classifying four discrete emotion classes based on each of the five domain datasets. The evaluation was conducted in three experimental environments according to the type of SER model: The baseline SPSL model learns from the temporal embedding features and the single-loss L*cd*. Multi-path-single-loss (MPSL) uses multi-path embedding vectors and is trained only on L*cd* without the complementary loss, L*cv*, for valence-level classification. Multi-path-group-loss (MPGL) learns from multi-path embedding vectors and the group loss, L*g*, consisting of L*cd* and L*cv*.

When compared with the harmonic-mean F1 score based on the KESDy18\_PM dataset shown in Table 6b, the performance of the SER of the MPSL using a single-loss L*cd* showed an improvement of 1% over that of the baseline SPSL. The SER MPGL model trained on the loss group, L*g*, showed an F1 improvement of up to 3.7% over the SPSL's F1.

Table 7 shows the results of multi-domain adaptation evaluation when the SER model was trained with samples aggregated from multiple-domain SER datasets collected from various environments. The separated test samples for about 20% of the speakers were evaluated for speaker-independent evaluation. As shown in Table 7a, regarding KESDy18, which consisted of two datasets collected simultaneously via heterogeneous devices, the proposed SER model trained on the group-loss L*<sup>g</sup>* of MPGL achieved an F1 improvement of up to 3.7% over the baseline SPSL.

Table 8 presents the evaluation results of the proposed MPGLN SER for supporting multi-domain generalization. In the evaluation of Table 8a, the SER model was trained with the aggregated samples of KESDy18\_PM, KESDy18\_EM, and KESDy19\_EM datasets

and was evaluated against the separated test samples of the KESDy19\_PM domain, which was not used for training but was collected from the same language culture. The evaluation results of Table 8a shows that the F1 score of the MPGL model improved by 1.2% compared with the baseline SPSL. In the evaluation of Table 8b, when the SER model was trained on KESDy18\_EM and IMEOCAP datasets, which were from different language cultures, the model was evaluated using the Korean KESDy18\_PM domain dataset. The proposed MPGLN SER showed an F1-score improvement of about 3.5% over the baseline model.

Figure 5 shows the changes in losses from Table 8b, including the loss, L*cd*, of the baseline SPSL model and losses L*cd* and L*cv* of the MPGL SER model. These losses were measured every 25 epochs during training using aggregated KESDy18\_EM and IEMOCAP samples. The loss, L*cd*, of the MPGL model, which learned two losses simultaneously, trained faster than did the L*cd* of the baseline SER model. This shows that the other complementary loss, L*cv*, of the proposed MPGLN, used to predict the valence-level label, decreased similarly to the loss, L*cd*, of the baseline SPSL.

**Figure 5.** Change in losses of the baseline SER and the proposed MPGLN SER in Table 8b. The loss, L*cd*, of the baseline SPSL model and losses L*cd* and L*cv* of the SER model of MPGL.

Figure 6 shows the distribution of the 64-D embedding vectors of the test data reduced to a 2-D embedding space via *t*-stochastic neighbor embedding (t-SEN). The 64-D embedding vectors were generated in the FC layer just prior to the MPSL and MPGL softmax activations of the evaluation in Table 8b.

**Figure 6.** Distribution of reduced embedding vectors (the 64-D embedding vectors of the test data in the last fully-connected (FC) layer in the ensemble network) that are reduced to 2-D via *t*-stochastic neighbor embedding (t-SEN) dimension reduction: (**a**) embedding space for MPSL in Table 8b; (**b**) embedding space for MPGL in Table 8b.

Figure 6a shows the distribution of the embedding feature vector in the MPSL trained by the loss, L*cd*, only without the complementary loss, L*cv*. Figure 6b displays the distribution of the MPGL model based on the loss group, L*g*, of the two losses: L*cd* and L*cv*. Figure 6b shows the MPGLN SER model that learns from multi-path embedding vectors and the loss group, L*g*, where the samples belonging to the "happy" class were more closely grouped, and the samples of the "angry" and "sad" classes are located closer together compared with the MPSL distribution shown in Figure 6a.

#### **5. Conclusions**

We determined that it is essential to improve the generalization of the SER model for deployment to real applications. This paper proposed the MPGLN for SER in support of supervised multi-domain adaptation and generalization based on multi-domain datasets. The proposed MPGLN SER includes a temporal feature generator for the BLSTM network using the input of handcrafted LLD features of a speech sample. Additionally, we leveraged the transferred feature extractor from the pre-trained VGGish model for the MPGLN. The proposed MPGLN SER learned simultaneous multiple losses induced by associations between discrete emotion and dimension labels.

The proposed MPGLN SER was evaluated using five real SER datasets of various speaker domains, language cultures, collecting devices, and procedural environments. This included KESDy18 and KESDy19 databases. KESDy18 comprised speech samples delivered by voice actors who uttered Korean short sentences by expressing specific discrete emotions. The KESDy18 database consisted of KESDy18\_PM and KESDy18\_EM datasets from heterogeneous devices and environments with different device locations. The KESDy19 database comprised KESDy19\_EM and KESDy19\_PM, which contained the collected speech sample voices acted using a similar procedure as that of the IEMOCAP and that of the simulated dataset based on the cell-phone's built-in microphone, respectively.

This study assumed that the SER model was trained only with the BLSTM-based temporal embedding feature generator included with MPGLN without transferred feature as the baseline SER model. We verified the performance reliability of the baseline SER model using the IEMOCAP. The BLSTM-baseline SER model showed competitive UA results of 59% when classifying the four categorical emotion labels. The multi-domain adaptation and domain generalization evaluation of the proposed MPGLN SER was performed using the English-speaking IEMOCAP and the Korean KESDy18 and KESDy19 datasets by comparing the performances of the baseline model according to various evaluation environments.

The proposed MPGLN SER model trained on multiple losses showed an F1 performance improvement of up to 3.7% over the baseline model when classifying four emotion labels in a single domain dataset. The performance evaluation of the MPGLN SER for supervised multi-domain adaptation, which trained and tested on the SER model using the aggregated speech samples of the multi-domain datasets, also showed an improvement of up to 3.7% over the baseline F1 score. From the evaluation of the multi-domain generalization of the proposed MPGLN SER, the F1 score enjoyed an improvement of 3.5% over the baseline SER when using samples from other language cultures not used for training. From these results, we found that our MPGLN SER, which supports supervised multi-domain adaptations, is also effective in reinforcing the generalization of the SER model based on multi-domain datasets.

For future works, we plan to derive the differences in acoustic features of emotional expressions based on multi-cultural SER datasets and study the learning method for the deep-learning-based SER model considering the domain discrepancy. Furthermore, we will continue enhancing our model's generalizability through evaluations of speech data in the wild by deploying the proposed MPGLN SER to real applications.

**Author Contributions:** Conceptualization, K.J.N., C.Y.J., J.L., S.C., G.K., J.M.L., and H.J.; methodology, K.J.N. and C.Y.J.; software, K.J.N.; validation, K.J.N.; formal analysis, K.J.N.; investigation, K.J.N., C.Y.J., J.L., S.C., G.K., J.M.L., and H.J.; resources, H.J.; data curation, K.J.N.; writing—original draft preparation, K.J.N.; writing—review and editing, K.J.N., C.Y.J., and S.C.; visualization, K.J.N.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government. (21ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System).

**Institutional Review Board Statement:** The collecting process of the KESDy19 database was conducted according to the guidelines of the Declaration of Helsinki, and was approved by the Institutional Review Board of Korea national Institute for Bioethics Policy (approval number P01-201907-22- 010 and 22 July 2019). The study did not require additional ethical approval.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Statistical results are contained within the article. The KESDy18\_EM dataset collected in this study is available online at https://nanum.etri.re.kr/share/kjnoh/SER-DB-ETRIv18?lang=eng (accessed on 7 January 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Recognition of Emotion by Brain Connectivity and Eye Movement**

**Jing Zhang 1,†, Sung Park 1,†, Ayoung Cho <sup>1</sup> and Mincheol Whang 2,\***


† These authors contributed equally to this work.

**Abstract:** Simultaneous activation of brain regions (i.e., brain connection features) is an essential mechanism of brain activity in emotion recognition of visual content. The occipital cortex of the brain is involved in visual processing, but the frontal lobe processes cranial nerve signals to control higher emotions. However, recognition of emotion in visual content merits the analysis of eye movement features, because the pupils, iris, and other eye structures are connected to the nerves of the brain. We hypothesized that when viewing video content, the activation features of brain connections are significantly related to eye movement characteristics. We investigated the relationship between brain connectivity (strength and directionality) and eye movement features (left and right pupils, saccades, and fixations) when 47 participants viewed an emotion-eliciting video on a two-dimensional emotion model (valence and arousal). We found that the connectivity eigenvalues of the long-distance prefrontal lobe, temporal lobe, parietal lobe, and center are related to cognitive activity involving high valance. In addition, saccade movement was correlated with long-distance occipital-frontal connectivity. Finally, short-distance connectivity results showed emotional fluctuations caused by unconscious stimulation.

**Keywords:** emotion recognition; attention; eye movement; brain connectivity

#### **1. Introduction**

Studies have shown that different brain regions participate in various perceptual and cognitive processes. For example, the frontal lobe is related to thinking and consciousness, whereas the temporal lobe is associated with processing complex stimulus information, such as faces, scenes, smells, and sounds. The parietal lobe integrates a variety of sensory inputs and the operational control of objects, while the occipital lobe is related to vision [1].

The brain is an extensive network of neurons. Brain connectivity refers to the synchronous activity of neurons in different regions and may provide useful information on neural activity [2]. Mauss and Robinson [3] suggested that emotion processing occurs in distributed circuits, rather than in specific isolated brain regions. Analysis of the simultaneous activation of brain regions is a robust pattern-based analysis method for emotional recognition [4]. Researchers have developed methods to capture asymmetric brain activity patterns that are important for emotion recognition [5].

Users search massive amounts of information until they find something useful [6]. However, although the information is presented visually, users do not recognize it, because of a lack of attention. The cortical area known as the frontal eye field (FEF) plays a vital role in the control of visual attention and eye movements [7].

Eye tracking is the process of measuring eye movements. Eye tracking signals imply the user's subconscious behaviors and provide essential clues to the context of the subject's current activity [8], which allow us to determine what elicits users' attention.

The brain activity is significantly related to eye movement features involving pupil, saccade, and fixation. Our pupils change their size accordingly [9] when one is stimulated

**Citation:** Zhang, J.; Park, S.; Cho, A.; Whang, M. Recognition of Emotion by Brain Connectivity and Eye Movement. *Sensors* **2022**, *22*, 6736. https://doi.org/10.3390/s22186736

Academic Editor: Wataru Sato

Received: 19 July 2022 Accepted: 3 September 2022 Published: 6 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

from resting to emotional states. The saccade is a decision made every time we move our eyes [10,11]. Decisions are influenced by one's expectations, goals, personalities, memories, and intentions [12].

A gaze is a potent social cue. For example, mutual gaze often implies threat or evasion, signaling submission or avoidance [13–16]. Eye gaze processing is one of the bases for social interactions, because the neural substrate for gaze processing is an essential step in developing neuroscience for social cognition [17,18].

By analyzing eye movement data, such as gaze position and gaze time, researchers can obtain explanations for multiple cognitive operations involving multiple behaviors [19]. For example, language researchers can use eye-tracking to analyze how people read and understand spoken language. Consumer researchers can study how shoppers make purchases. Researchers can gain a better cognitive understanding by integrating eye tracking with neuroimaging technologies (e.g., fMRI and EEG) [20].

Table 1 compares the few studies on eye movement features and EEG signals with an interest in producing a robust emotion-recognition model [21]. Wu et al. [22] integrated functional features from EEG and eye movements with deep canonical correlation analysis (DCCA). Their classification achieved 95.08% ± 6.42% accuracy on SEED public emotion EEG datasets [23]. Zheng et al. [24] used a multimodal depth neural network to incorporate eye movement and EEG signals to improve recognition performance. The results demonstrated that modality fusion with deep neural networks significantly enhances the performance compared with a single modality. Soleymani [25] learned that the decisionlevel fusion strategy is more adaptive than feature-level fusion when incorporating EEG signals and eye movement data. They also found that user-independent emotion recognition can perform better than individual self-reports for arousal assessment. While studies focused on improving recognition accuracy, currently, there is a lack of understanding of the relationship between brainwave connectivity and eye movement features (fixation, saccade, and left and right pupils). Specifically, we do not know how the functional relationship varies according to visual content's emotional characteristics (valence, arousal).


**Table 1.** Comparison of previous and proposed methods.

In this study, our research question involves the functional characteristics of brainwave connectivity and eye movement eigenvalues in valence-arousal emotions in a two-dimensional emotional model. We hypothesized that when viewing video content, the activation features of brain connections are significantly related to eye movement characteristics. We divided and analyzed brainwave connectivity into three groups: (1) long-distance occipital-frontal connectivity, (2) long-distance prefrontal and temporal, parietal, and central connectivity, and (3) short-distance connectivity, including frontal-temporal, frontal-central, temporalparietal, and parietal-central connectivity. We applied k-means clustering to distinguish emotional feature responses, and eye movement eigenvalues were further differentiated. We then analyzed the relationship between eye movements and brain wave connectivity, depicting the differential characteristics of a two-dimensional emotional model.

#### **2. Materials and Methods**

We adopted Russell's two-dimensional model [26], where emotional states can be defined at any valence or arousal level. We invited participants to view emotion-eliciting videos with varying valences (i.e., from unpleasant to pleasant) and arousal levels (i.e., from relaxed to aroused). To understand brain connectivity and causality of brain regions according to different emotions, we used supervised learning to classify emotional and nonemotional states, and extract eye movement feature values associated with such different emotional states to analyze the relationship between brain activity and eye movement.

#### *2.1. Stimuli Selection*

We edited 6-min video clips (e.g., dramas or films) to elicit emotions from the participants. The content used to induce emotional conditions (valence and arousal) was collected in a two-dimensional model. To ensure that the emotional videos were effective, we conducted a stimulus selection experiment prior to the main experiment. We selected 20 edited dramas or movies containing emotions; five video clips were used for each quadrant in the two-dimensional model. Thirty participants viewed the emotional videos and responded to a subjective questionnaire. They received USD 20 for their participation in the study. Among the five video clips, the most representative video for each of the four quadrants in the two-dimensional model was selected (see Figure 1). Four stimuli were selected for the main experiment.

**Figure 1.** Video stimulus for each quadrant on a two-dimensional model.

#### *2.2. Experiment Design*

The main experiment had a factorial design of two (valence: pleasant and unpleasant) × two (arousal: aroused and relaxed) independent variables. The dependent variables included participants' brainwaves, eye movements (fixation, saccade, and left and right pupils), and subjective responses to a questionnaire.

#### *2.3. Participants*

We conducted an a priori power analysis using the program G\*Power with the power set at 0.8 and α = 0.05, d = 0.6 (independent *t*-test), two-tailed. These results suggest that an N value of approximately 46 is required to achieve appropriate statistical power. Therefore, 47 university students were recruited for the study. Participants' ages ranged from 20 to 30 years (mean = 28, STD = 2.9), with 20 (44%) men and 27 (56%) women. We selected participants with a corrective vision ≥ 0.8, without any vision deficiency, to ensure reliable recognition of visual stimuli. We recommended that the participants sleep sufficiently and refrain from smoking and consuming alcohol and caffeine the day before the experiment. As the experiment required valid recognition of the participant's facial expression, we limited the use of glasses and cosmetic makeup. All participants were briefed on the purpose and procedure of the experiment, and signed consent was obtained from them. They were then compensated for their participation by payment of a fee.

#### *2.4. Experimental Protocol*

Figure 2 outlines the experimental process and the environment used in this study. The participants were asked to sit 1 m away from a 27-inch LCD monitor. A webcam was installed on the monitor. Participants' brainwaves (EEG cap 18 Ch) and eye movements (gaze tracking device) were acquired, in addition to subjective responses to a questionnaire. We set the frame rate of the gaze-tracking device to 60 frames per second. Participants viewed four emotion-eliciting videos and responded to a questionnaire after each viewing session.

**Figure 2.** Experimental protocol and configuration.

#### **3. Analysis**

Our brain connectivity analysis methods were based on Jamal et al. [27], as outlined in Figure 3. The process consisted of seven stages: (1) sampled EEG signals at 500 Hz, (2) removed the noise through pre-processing, (3) conducted fast Fourier transform (FFT) at 0–30 Hz, (4) conducted band pass filter with delta (0 Hz–4 Hz), theta (4 Hz–8 Hz), alpha (8 Hz–12 Hz), and beta (12 Hz–30 Hz), (5) processed continuous wavelet transform (CWT) with complex Morlet wavelet, (6) computed the EEG frequency band-specific pairwise phase difference, and (7) determined the optimal number of states in the data using incremental k-means clustering.

We used the CWT with a complex Morlet wavelet as the basis function to analyze the transient dynamics of phase synchronization. In contrast to the discrete Fourier transform (DFT), it has a short vibration signal and an expiration date for the vibration wave. Figure 4 shows the Morlet wavelet graph. The CWT operates with a signal with scaled and shifted versions of a basic wavelet.

**Figure 3.** The process of brain connectivity analysis.

**Figure 4.** The Morlet wavelet graph.

Therefore, it can be expressed as the formula below in Equation (1), where a is a scale factor and b is a shift factor. Being continuous, infinite wavelets can be shifted and scaled:

$$X\_w(a,b) = \frac{1}{|a|^{\frac{1}{2}}} \int\_{-\infty}^{\infty} x(t) \overline{\varphi} \left(\frac{t-b}{a}\right) dt\tag{1}$$

#### **4. Results**

We will present the results of the participants' subjective evaluation and brain connectivity analysis, followed by the results of eye movement analysis.

#### *4.1. Subject Evaluation*

We compared the subjective arousal and valence scores between the four emotioneliciting conditions (pleasant-aroused, pleasant-relaxed, unpleasant-relaxed, and unpleasantaroused). We conducted a series of ANOVA tests on the arousal and valence scores. Posthoc analyses using Tukey's HSD were conducted by adjusting the alpha level to 0.0125 per test (0.05/4).

The mean arousal scores were significantly higher in the aroused conditions (pleasantaroused, unpleasant-aroused) than in the relaxed conditions (pleasant-relaxed, unpleasantrelaxed) (*p* < 0.001), as shown in Figure 5. The pairwise comparison of the mean arousal scores indicated that the scores were significantly different from one another, as shown in Table 2. The results indicate that participants reported congruent emotional arousal with the target emotion of the stimulus.

The results indicated that the mean valence scores were significantly higher in the pleasant conditions (pleasant-aroused, pleasant-relaxed) than in the unpleasant conditions (unpleasant-aroused, unpleasant-relaxed), *p* < 0.001, as shown in Figure 6. The pairwise comparison of the mean valence scores indicated that the scores were significantly different

from one another, except for two comparisons, as shown in Table 3. The results indicate that participants reported congruent emotional valence with the target emotion of the stimulus.

**Figure 5.** Analysis of the arousal values between the four emotion-eliciting conditions.

**Table 2.** Multiple comparisons of mean arousal scores using Tukey HSD.


**Figure 6.** Analysis of the valence values between the four emotion-eliciting conditions.


**Table 3.** Multiple comparisons of mean valence scores using Tukey HSD.

#### *4.2. Brain Connectivity Features*

We computed the EEG frequency band-specific pairwise phase differences for each emotion-eliciting condition, as shown in Figures 7–10. A total of 153 pairwise features were analyzed. If the power differences between the two brain regions are lower than the mean power value, the connectivity is relatively strong. Such cases were marked as unfilled ( ).

We further analyzed the long- and short-distance connectivity of the extracted features. The connectivity of the frontal and occipital lobes can predict the process of information transmission to the occipital lobe after emotion is generated (marked in green in Figure 11). The eigenvalue was the average (N = 47) of the connectivity sum of the two channels defined by the long-distance O-F connectivity.

The prefrontal cortex is involved in emotion regulation, recognition, judgment, and reasoning. The connectivity of the prefrontal lobe to the temporal lobe, parietal lobe, and center helps to understand the information processing process of visual-emotional stimuli (marked in yellow in Figure 11). The eigenvalue was the average (N = 47) of the connectivity sum of the two channels defined by the long-distance prefrontal connectivity.

Long- and short-range connectivity features have been extensively studied for their ability to process social emotions and interactions. Short-distance connectivity characteristics can determine the brain's different states during negative emotions, especially those related to the central-parietal lobe connectivity. We considered a distance of less than 10 cm as short connectivity (marked pink in Figure 11). The eigenvalue was the average (N = 47) of the connectivity sum of the two channels defined by the short-distance connectivity.

**Figure 7.** The brain connectivity map in the pleasant-aroused condition.

**Figure 8.** The brain connectivity map in the pleasant-relaxed condition.

**Figure 9.** The brain connectivity map in the unpleasant-relaxed condition.

**Figure 10.** The brain connectivity map in the unpleasant-aroused condition.

**Figure 11.** The three distance connectivity groups in the brain connectivity map.

4.2.1. Characteristics of Three Distance Connectivity

Figure 12 depicts the long-distance connectivity of the occipital and frontal lobes (LD\_O-F connectivity) of the beta wave in the visual comparison diagram of the twodimensional model. O-F connectivity in the unpleasant-aroused condition had the strongest connectivity. In the pleasant-relaxed condition, bi-directional connectivity was observed between the left frontal and occipital lobes. In the unpleasant-relaxed condition, bidirectional connectivity was observed from the right occipital to the frontal lobe. In the

pleasant-aroused condition, cross-hemispheric connectivity was observed between the frontal and occipital lobes.

**Figure 12.** The long-distance connectivity of the occipital and frontal lobes (LD\_O-F connectivity) of the beta wave.

Figure 13 depicts the long-distance connectivity of the prefrontal and temporal lobes, parietal lobes, and central (LD\_pF connectivity) beta waves in the visual comparison diagram of the two-dimensional model. In pleasant-aroused and unpleasant-relaxed conditions, the right prefrontal lobe was strongly connected to the central, parietal, and temporal lobes of both hemispheres. In the pleasant-relaxed condition, there was strong connectivity in the left prefrontal–temporal, left prefrontal–central, and left prefrontal– parietal regions. In the unpleasant-aroused condition, the prefrontal–temporal, prefrontal– parietal, and prefrontal–central regions showed the weakest connectivity.

**Figure 13.** The long-distance connectivity of the prefrontal and temporal lobes, parietal lobes, and central (LD\_pF connectivity) of the beta wave.

Figure 14 depicts the short-distance connectivity (SD connectivity) of the beta waves in the visual comparison diagram of the two-dimensional emotional model. In the aroused conditions (pleasant-aroused, unpleasant-aroused), strong frontal–temporal–central connectivity was observed. However, in the relaxed conditions (pleasant-relaxed, unpleasantrelaxed), strong central–parietal connectivity was observed.

**Figure 14.** The short-distance connectivity of the prefrontal-temporal lobes, central-parietal lobes, and parietal-temporal lobes (SD connectivity) of the beta wave.

In summary, the analysis suggests a strong frontal activity in the unpleasant-aroused condition, indicating intense information processing and transfer involving the frontal cortex. In pleasant conditions, feedback is sent to the parietal, temporal, and central regions after the prefrontal cortex processes the information. In the unpleasant-relaxed condition, brain connectivity implies the control of the participant's eye movement.

#### 4.2.2. Power Value Analysis in Three Distance Connectivity

To further understand the strength and directionality of brainwave connectivity, statistical analysis was performed on the power value using ANOVA, followed by post hoc analyses (see Figures 15–20).

Figure 15 depicts the eigenvalues (i.e., mean power value) of the occipital and frontal lobe connectivity. The plus-minus sign of the eigenvalue determines the causality. In the unpleasant-aroused condition, more information is processed in the frontal lobe, indicating more activity in the occipital lobe than in primary visual processing.

**Figure 15.** The eigenvalues in the long-distance O-F connectivity.

Figure 16 shows the absolute values of the mean (|*mean*|). The pleasant-relaxed and unpleasant-aroused conditions exhibited high occipital-frontal connectivity, whereas the pleasant-relaxed condition exhibited left hemisphere-frontal activation (see Figure 12).

Figure 17 depicts the eigenvalues (i.e., the mean power value) of prefrontal connectivity. The plus-minus sign of the eigenvalue determines the causality. The results showed that activity in the prefrontal lobe in pleasant conditions (pleasant-aroused, pleasant-relaxed) was greater than that in other regions. Conversely, in the unpleasant conditions (unpleasant-aroused, unpleasant-relaxed), activity in the other regions was stronger than that in the prefrontal lobe.

**Figure 17.** The eigenvalues in the long-distance prefrontal connectivity.

Figure 18 shows the absolute values of the mean (|*mean*|). The unpleasant-relaxed condition exhibited the strongest connectivity.

**Figure 18.** The absolute value in the long-distance prefrontal connectivity.

Figure 19 depicts the eigenvalues (i.e., mean power value) of the short-distance connectivity in frontal–temporal, frontal–central, and temporal–parietal connections in the four emotion-eliciting conditions. Overall, connectivity in the relaxed condition was stronger than that in the aroused condition. Specifically, central–parietal connectivity showed stronger activity than frontal–temporal and frontal–central connectivity (see Figure 14).

**Figure 19.** The eigenvalues in the short-distance connectivity.

Figure 20 shows the absolute values of mean (|*mean*|). The relaxed conditions (pleasantrelaxed and unpleasant-relaxed) showed stronger connectivity, specifically stronger P-O connectivity. Conversely, the aroused conditions (pleasant-aroused, unpleasant-aroused) showed weaker connectivity, but stronger F-T connectivity. In particular, the unpleasantaroused, pleasant-aroused, and pleasant-relaxed conditions showed substantial premotor cortical PMDr (F7) connections associated with eye movement control. This was consistent with the saccade results.

Through statistical analysis, we found that connectivity in the pleasant-relaxed condition was the highest, while connectivity in the unpleasant-relaxed condition was higher than that in the pleasant-aroused and unpleasant-aroused conditions.

**Figure 20.** The absolute value in the short-distance connectivity.

By comparing the three extracted brainwave connectivity eigenvalues with subjective evaluations, we found that the long-distance prefrontal connectivity eigenvalues have similar characteristics to the valence score measures of subjective evaluations. The prefrontal cortex (PFC) makes decisions and is responsible for cognitive control. Positive valence

increases the neurotransmitter dopamine, enhancing cognitive control [28–30]. This may explain prefrontal activation in pleasant conditions (see Figure 15).

In summary, in the unpleasant-aroused condition, the frontal lobe showed a stronger activation than the occipital lobe. Overall, in pleasant conditions, the prefrontal lobe showed a stronger activation than other regions. Conversely, in unpleasant conditions, the prefrontal lobe showed a weaker activation than other regions.

#### *4.3. Clustering Eye Movement Features*

The statistical results showed that the short-distance connectivity eigenvalue and subjective evaluation arousal score had similar characteristics. Connectivity in the unpleasantrelaxed condition was the strongest (Figure 16). Specifically, central-parietal connectivity showed stronger connectivity than frontal–temporal and frontal–central connectivity. Unpleasant emotions are known to activate central–parietal connectivity [31].

The three eigenvalues of the extracted EEG can be used to distinguish the four emotions in the two-dimensional emotional model. We conducted an unsupervised K-means analysis in chronological order using these three eigenvalues. We distinguished the emotional and non-emotional states of each participant while viewing the emotional video. The emotional and non-emotional states of the eye movement data were then distinguished. Figure 21 shows an instance of a participant's K-means results. Group 1 indicates the non-emotional states, whereas Group 2 indicates the emotional states. The figure implies that the participant's state changes from a non-emotional state (i.e., 0.0) to an emotional state (i.e., 1.0) as a function of time.

**Figure 21.** An instance of a participant's k-Means results.

Figures 22 and 23 depict the post-hoc analysis of the left and right pupils between the two-dimensional emotional model conditions. From the statistical results of the eye movement eigenvalues, the characteristics of the right pupil and left pupil did not change much between the four conditions; the pupil of the pleasant-aroused condition had the largest change, followed by the pleasant-relaxed and unpleasant-relaxed conditions. The least difference was observed in the unpleasant-aroused condition.

However, in relaxed conditions (pleasant-relaxed and unpleasant-relaxed), the right pupil of the unpleasant-relaxed condition was larger than the left pupil. From the first eigenvalue long-distance O-F connectivity of brain wave connectivity, we found two locations with high connectivity: the right occipital lobe and the left and right prefrontal lobes.

Figure 24 shows the results of the post hoc analysis of the fixation between the twodimensional emotional model conditions. The fixation feature in the unpleasant-relaxed condition was larger than that in the other three conditions.

**Figure 23.** The post hoc analysis of the right pupil. \*\* *p* < 0.05. \*\*\* *p* < 0.001.

Figure 25 shows the results of the post hoc analysis of the saccade between the twodimensional emotional model conditions. The results showed the lowest change in the unpleasant-relaxed condition, and the greatest change in the pleasant-relaxed condition. The characteristics of the saccades were similar to those of the short-distance connectivity eigenvalues. Short-distance connectivity also showed weak brain connections in the unpleasant-relaxed condition (see Figure 14). After the frontal lobe makes a cognitive judgment, it gives instructions to the occipital lobe, causing saccadic eye movements.

**Figure 25.** The post hoc analysis on the saccade. \*\*\* *p* < 0.001.

#### **5. Conclusions and Discussion**

This study aimed to understand the relationship between brain wave connectivity and eye movement characteristic values using a two-dimensional emotional model. We divided brainwave connectivity into three distinct groups: long-distance occipital–frontal connectivity, long-distance prefrontal connectivity between the prefrontal lobe and temporal lobe, parietal lobe, and central lobe, and short-distance connectivity including the characteristic relationships between the frontal lobe–temporal lobe, frontal lobe-central lobe, temporal– parietal lobe, and parietal lobe–central. Then, through unsupervised learning of these three eigenvalues, the emotional response was divided into emotional and non-emotional states in real time using K-means analysis. The two states were used to extract the feature values of the eye movements. We analyzed the relationship between eye movements and brain wave connectivity using statistical analyses.

The results revealed that the connectivity eigenvalues of the long-distance prefrontal lobe, temporal lobe, parietal lobe, and center are related to cognitive activity involving high valence. The prefrontal lobe occupies two-thirds of the human frontal cortex [32] and is responsible for recognition and decision-making, reflecting cognitive judgment from valence responses [33,34]. Specifically, the dorsolateral prefrontal cortex (dlPFC) is involved with working memory [35], decision making [36], and executive attention [37]. However, most recently, Nejati et al. [32] found that the role of dlPFC extends to the regulation of the valence of emotional experiences. Second, the saccade correlated with long-distance occipital-frontal connectivity. After making a judgment, the frontal lobe provides instructions to the occipital lobe, which moves the eye. Electrical stimulation of several areas of the cortex evokes saccadic eye movements. The prefrontal top-down control of visual appraisal and emotion-generation processes constitutes a mechanism of cognitive reappraisal in emotion regulation [38]. The short-distance connectivity results showed emotional fluctuations caused by the unconscious stimulation of audio-visual perception.

We acknowledge some limitations of the research. First, the results of our study are from one stimulus for each of the four quadrants in the two-dimensional model. Future studies may use multiple stimuli, possibly controlling the type of stimuli. Second, although pupillometry is an effective measurement for understanding brain activity changes related to arousal, attention, and salience [39], we did not find consistent and conclusive results between pupil size and brain connectivity. The size of pupils changes according to ambient light (i.e., pupillary light reflex) [40,41], which may have confounded the results. Future studies should control extraneous variables more thoroughly to find the main effect of pupil characteristics. Third, our analysis is based on participants of local university students, limiting the age range (i.e., 20 to 30 years). Age and culture may influence the results, so future studies may consider a broader range of demographic populations and conduct a cross-cultural investigation.

The study purposely analyzed brain connectivity and changes in eye movement in tandem to establish a relational basis between neural activity and eye movement features. We took the first step in unraveling such a relationship, albeit fell short in achieving a full understanding, such as the pupil size characteristics. Because the eyes' structures are connected to the brain's nerves, an exclusive analysis of eye features may lead to a comprehensive understanding of the participant's emotions. A non-contact appraisal of emotion based on eye feature analysis may be a promising method applicable to metaverse or media art.

**Author Contributions:** J.Z.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing, visualization, project administration; S.P.: methodology, validation, formal analysis, investigation, writing, review, editing; A.C.: conceptualization, investigation, review, editing; M.W.: conceptualization, methodology, writing, review, supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government (22ZS1100, Core Technology Research for Self-Improving Integrated Artificial Intelligence System).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of Sangmyung University (protocol code C-2021-002, approved 9 July 2021).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the subjects to publish this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Subject-Specific Cognitive Workload Classification Using EEG-Based Functional Connectivity and Deep Learning**

**Anmol Gupta 1, Gourav Siddhad 1, Vishal Pandey 2, Partha Pratim Roy <sup>1</sup> and Byung-Gyu Kim 3,\***


**\*** Correspondence: bg.kim@sookmyung.ac.kr

**Abstract:** Cognitive workload is a crucial factor in tasks involving dynamic decision-making and other real-time and high-risk situations. Neuroimaging techniques have long been used for estimating cognitive workload. Given the portability, cost-effectiveness and high time-resolution of EEG as compared to fMRI and other neuroimaging modalities, an efficient method of estimating an individual's workload using EEG is of paramount importance. Multiple cognitive, psychiatric and behavioral phenotypes have already been known to be linked with "functional connectivity", i.e., correlations between different brain regions. In this work, we explored the possibility of using different model-free functional connectivity metrics along with deep learning in order to efficiently classify the cognitive workload of the participants. To this end, 64-channel EEG data of 19 participants were collected while they were doing the traditional n-back task. These data (after pre-processing) were used to extract the functional connectivity features, namely Phase Transfer Entropy (PTE), Mutual Information (MI) and Phase Locking Value (PLV). These three were chosen to do a comprehensive comparison of directed and non-directed model-free functional connectivity metrics (allows faster computations). Using these features, three deep learning classifiers, namely CNN, LSTM and Conv-LSTM were used for classifying the cognitive workload as low (1-back), medium (2-back) or high (3-back). With the high inter-subject variability in EEG and cognitive workload and recent research highlighting that EEG-based functional connectivity metrics are subject-specific, subject-specific classifiers were used. Results show the state-of-the-art multi-class classification accuracy with the combination of MI with CNN at 80.87%, followed by the combination of PLV with CNN (at 75.88%) and MI with LSTM (at 71.87%). The highest subject specific performance was achieved by the combinations of PLV with Conv-LSTM, and PLV with CNN with an accuracy of 97.92%, followed by the combination of MI with CNN (at 95.83%) and MI with Conv-LSTM (at 93.75%). The results highlight the efficacy of the combination of EEG-based model-free functional connectivity metrics and deep learning in order to classify cognitive workload. The work can further be extended to explore the possibility of classifying cognitive workload in real-time, dynamic and complex real-world scenarios.

**Keywords:** CNN; cognitive workload; functional connectivity analysis; LSTM; mental workload; mutual information; phase locking value; phase transfer entropy

#### **1. Introduction**

Cognitive workload is the measure of the amount of mental effort required to complete any task [1]. Working memory is required to process information for short periods of time, while long-term memory is associated with storing information for long periods of time [2]. Tasks such as arithmetic operations, reading and learning require efficient use of working memory. Cognitive workload can be defined as the amount of mental activity utilized by working memory to complete any task. Assessment of an individual's cognitive workload is an essential component in most human-machine collaboration tasks. A major application

**Citation:** Gupta, A.; Siddhad, G.; Pandey, V.; Roy, P.P.; Kim, B.-G. Subject-Specific Cognitive Workload Classification Using EEG-Based Functional Connectivity and Deep Learning. *Sensors* **2021**, *21*, 6710. https://doi.org/10.3390/s21206710

Academic Editor: Giovanni Sparacino

Received: 19 August 2021 Accepted: 2 October 2021 Published: 9 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of this lies in the defense domain. Operations like driving under high-stress environmental conditions, monitoring air traffic control, piloting an aircraft or operating an unmanned vehicle are excellent examples. The optimal level of cognitive workload is pivotal in highrisk scenarios where important decisions are supposed to be made in real-time. The rate at which the information is processed determines the workload induced in any individual while performing any task. A high workload can lead to unplanned and disproportionate hazards, and too little workload can lead to being disengaged from the task. This points to the importance of maintaining optimal cognitive workload in high-risk scenarios to perform the task satisfactorily. With respect to cognitive workload, emotional intelligence and stability are regarded as essential components. An individuals' cognitive load will be affected by emotional valence as it will interfere with parallel cognitive processing. Studies show a positive relation between emotional intelligence and some cognitive tasks [3,4]. Therefore, classification of cognitive workload can be an essential indicator of emotional intelligence and stability.

Although the assessment of cognitive workload is important, it is not a trivial task. Traditional methods of the evaluation of cognitive workload included subjective measures such as interviews or questionnaire-based approaches where the participants self-reported the amount of workload caused/induced during the task. Various research groups such as Hart et al. [5] and Malekpour et al. [6] contribute towards the assessment of cognitive workload with the use of subjective methods, primarily in the form of self-assessment questionnaires, like NASA-TLX (National Aeronautics and Space Administration Task Load Index), MCH (Modified Cooper-Harper Scale) and SWAT (Subjective Workload Assessment Test). Such questionnaires generally record the various metrics involved in performing the task, such as demand (mental, physical and temporal), effort, pressure, concentration, frustration, etc., to evaluate their connection with performance during the task. These methods prove to be subjective to the individual participant, however, and can be biased and prove to be unreliable as a distinct and coherent metric for the evaluation and estimation of cognitive workload in general as they depend on the participant recalling past engagement. Another drawback of using post-task questionnaire is that it does not allow for real-time evaluation of cognitive workload.

In contrast to the subjective questionnaire based methods, the evaluation based on neuro-physiological signals present an opportunity for an objective and real time assessment of cognitive workload. However, this method of evaluation comes at the expense of limited availability of equipment, trained operators and high costs. To obtain better efficacy and efficiency, physiological measures such as Electroencephalography(EEG), Event-Related Potential (ERP), Eye Tracking (gaze entropy), and Heart Rate Variability (HRV) can be utilized [7–9]. EEG is highly accepted as a measure to assess cognitive workload in real-time [10–12]. Various EEG features including time, frequency, timefrequency, and spatial domain features extracted from raw EEG data are effective ways to gain information from EEG signals. Time domain features mainly include Event Related Potentials (ERP) [13], statistical features (mean, standard deviation, variance, etc.), higherorder crossing analysis [14], and Hjorth parameter. Frequency domain features include decomposing the frequency in multiple sub-bands such as delta, theta, alpha, beta, and gamma bands which are mainly associated with deep sleep, drowsy, relaxed, engaged, conscious, and active states, respectively [15]. Such features are commonly used for classification of workload in various machine learning experiments. Recent advancements in the application of deep learning in various domains such as emotion recognition, pattern recognition and prediction makes it an excellent choice to be used with EEG signals for classification [16–19]. EEG signals can be used to decode and classify the human cognitive state. Various studies have carried out research in the area with different combinations of EEG features and machine learning models. Bashivan et al. [20] demonstrates the use of fast Fourier transform to convert EEG data into the frequency domain and map the 3D spatial positions of electrodes to 2D, according to the distribution of the electrodes. Using theta, alpha and beta frequency bands, 3-channel spectral maps are generated and sent

to CNN model for classification of mental load. Kwak et al. [21] propose a multi-level feature fusion method based on CNN to learn the spectral, spatial, as well as local and global information. Li et al. [22] reviews some deep learning models (e.g., RNN and CNN) and their applications for EEG data to decode brain activities and diagnose brain diseases.

Substantial research for estimation of cognitive workload from EEG using machine learning and deep learning is limited. Most of the studies perform binary classification of workload into high and low by extracting compute expensive EEG features from the raw data, making these non ideal to be used in real life conditions or in real time. Das et al. [23] reports an accuracy of 86.33% and 82.57% for binary and three class classification, respectively, using a BLSTM-LSTM based architecture in a subject independent study. Appriou et al. [24] performs subject specific and subject independent studies for binary classification of workload, achieving the highest mean accuracy of 72.7% and 63.7% using CNN for subject-specific and subject independent cases, respectively. In the study by Zhang et al. [25], the authors achieved an accuracy of 88.9% in binary classification using a combination of RNN and 3D CNN models with EEG topographic maps as features for classification. Using a similar technique of topographic maps in combination with a modified CNN model, highest accuracy of 91.9% in subject specific three class classification is reported [26]. However, more informative features regarding an individual's brain can be obtained from EEG data. Information acquired from signals originating from a specific brain region can be regarded to represent the brain activity of that region. This allows the study of separate brain regions in isolation when evaluating characteristics relevant to a specific cognitive state and this methodology has been adopted by various researchers. However, neuronal activity is not this straightforward as different regions of the brain contribute to the completion of a task, while different regions are still dominantly responsible for specific functions required for the completion of the task. This implores the necessity of examining the inter-regional interactions to understand the collaboration of the different brain regions. More formally, this analysis is termed as brain connectivity.

Brain Connectivity has been used to study the nature of the cerebrum in the past. Based on the attributes of connections, it can be classified into three types: structural connectivity (biophysical connections between neurons or neural elements), functional connectivity (statistical relations between anatomically un-connected cerebral regions) and effective connectivity (directional causal effects from one neural element to another) [27]. This study focuses on the exploration of functional brain connectivity as a measure to assess different levels of workload. Brain functional connectivity has been linked with cognitive deficient psycho-physiological diseases. Strong patters on connectivity in resting state EEG are evident in autism spectrum disorders as reported by [28]. Slower and less efficient connectivity is found in schizophrenia patients as reported by [29]. Another study suggested a relation between high frequency connectivity neural pattern and recurrent illness course of major depressive disorder [30]. However, few studies have investigated the links between cognitive workload and brain functional connectivity networks. Dimitrakopoulos et al. [31] is one such study that has used brain connectivity measure as a feature for classification of workload. This study uses correlation as a method of brain connectivity and achieved an accuracy of 88% for binary classification using SVM classifier. Another study by Islam et al. [32] explores the use of Mutual Information based functional connectivity for binary classification of drivers' mental workload using the SVM classifier and obtained an accuracy of 82%. There are only a limited number of studies that explore functional connectivity as a feature for classification of workload. Therefore, in this study we explore different functional brain connectivity methods as features to be used for classification of levels of cognitive workload. EEG data is known to have high intersubject variability [33,34]. Various researchers such as Byrne et al. [35] and Pang et al. [36] study the inter-subject variability. Nentwich et al. [37] report the subject-specific nature of EEG-based functional connectivity. Given this evidence, subject specific classification of workload has been aimed at in this study. In Zhang et al. [38], the authors compared the subject-dependent and independent approach and highlighted that variations in feature

distribution of EEG across subjects reduces the generalization ability of a classifier and at the same time subject-dependent approach provides a promising way to solve the problem of personalized classification. In Neto et al. [39], the authors discussed various subject specific characteristics and data splitting techniques for EEG data. A possible advantage of subject specific classification is that the classifier can learn subject-dependent features and it can be really useful in building robust and effective BCI systems [40,41].

The contributions of this paper can be summarized as follows:


The rest of the paper is organized as follows. Section 2 presents the materials and methods used for in the experiment. Section 3 discusses the results obtained in various experiments and Section 4 presents the implications of the reported results and the possible future directions and possible extensions of the current work.

#### **2. Materials and Methods**

#### *2.1. Participants*

A total of 19 participants (11 male and 8 females, mean age = 20.1 years, standard deviation = 1.2 years, minimum age = 19 years, maximum age = 23 years) at the Department of Biomedical Engineering, Institute of Nuclear Medicine and Allied Sciences, Delhi, India participated in this study. An institutional ethical committee approved the study at the Institute of Nuclear Medicine and Allied Sciences. Participation in the study was voluntary, and the subjects gave written consent before participating in the study. Out of 19 participants, 18 participants were right-handed, and one was left-handed. None of the participants reported neurological/psychological/mental history of any kind. All the participants hailed from a Science/Engineering/Technology/Mathematics (STEM) background. All the participants received a flat payment of INR 50, irrespective of their performance in the study.

#### *2.2. The N-Back Task*

The modern version of the n-back task [42] was designed using OpenSesame v 3.3.6 [43]. The n-back task is one of the most used psychological tests for inducing cognitive workload. In the task, the participants were required to observe a sequence of single digits separated by a small interval of time and for each letter they were required to identify whether the stimuli are a target (identical of the digit that has appeared 'n' digits back in the sequence) (see Figure 1). During a session/block the value of 'n' is kept constant. An increase in the value of 'n' induced cognitive workload according to [43]. The participants were required to interact with the appeared stimuli depending on the value of 'n'.

**Figure 1.** Schematic of the n-back task used for the cognitive workload classification. The participants were required to observe a sequence of single digits and determine whether the stimuli was a target. A target is the digit which is identical to the digit that appeared 'n' digits back in the sequence. For example, in the 2-back scenario 5 is the target as the sequence of digits were 9,**5**,2,**5**.

A total of 339 sessions were presented to each participant in a randomized manner with 113 sessions each for 1, 2 and 3 back. The sessions were initialized with an instruction set that was displayed for 5 seconds, where the participants were informed about the nature of session (type of 'n'). After the instruction block, the set of digits (1–9) appeared on the screen in sequence. The digits stayed on the screen for 500 ms, the participants were given 1500 ms to respond. The participants had to press space-bar in case the digit appeared was a target in accordance with the session. The inter-stimulus interval was 2000 ms (with 500 ms where the stimuli was displayed and 1500 ms given for response). The task was designed in accordance with standard n-back format. The n-back stimuli occurred within a visual angle of about 40° horizontally and about 4.50° vertically so the stimuli fall within the participants' visual field and for minimal eye movement. The stimuli were presented using OpenSesame [43], an open-source experiment builder. The target missed was also considered as an incorrect response in this case. The first three session of each conditions (n-back) were removed from further data analysis.

#### *2.3. Physiological Data Acquisition and Pre-Processing*

Sixty-four channel EEG were recorded through Ag/AgCl electrodes conforming with the extended 10–20 electrode system of placement. An eegoTMmylab amplifier (ANT Neuro, Enschede, The Netherlands) was used in the data acquisition. Electrooculogram (EOG) data was acquired from a single electrode placed below the right eye. All channels were grounded to channel CPz. Impedances were kept below 20 kΩ. The EEG data were sampled at 2048 Hz. The data were later downsampled to 256 Hz. During the recording process the participants were requested to sit in a relaxed posture to avoid potential contamination of data with movement artifacts. The data was referenced to linked mastoids in the further analyses. For pre-processing, DC offset was applied followed by band-pass with 0.1–45 Hz and finally we used ICA to get rid of the ocular and other artifacts. The data was then segmented according to the three conditions (1, 2 and 3 back) for all the 19 subjects.

#### *2.4. Feature Extraction*

Different cognitive tasks activate different specialized brain areas where the brain could dynamically coordinate the information flow to achieve the task [44]. Functional Connectivity is a method of quantifying these neuronal interactions. There exist many different algorithms for calculating these interactions using electrophysiological data. These algorithms can be divided into different domains based on the direction of the interaction among brain regions and interdependence of the signals [45]. In this study, we chose three connectivity metrics namely Mutual Information (MI), Phase Locking Value (PLV) and Phase Transfer Entropy (PTE). The reason for choosing these three metrics was to compare directed and non-directed model-free measures. One goal of the study was to build a near real-time framework for workload estimation using EEG, which is why only model-free connectivity measures were chosen. Therefore, we used only the raw (cleaned) EEG data to calculate the metrics.

Another important aspect for making the system fast was to select the dimensions of the connectivity matrix. To that end, 16 electrodes were chosen from the available 64. Choosing the 16 electrodes was done with brodmann areas in mind as functional connectivity implies interaction between different brain regions. In his article, Kaiser [46] defined a mapping between the EEG electrodes and different brodmann areas; therefore, we selected the same 16 EEG electrodes. The electrodes were Fp1, Fp2, F7, F3, F4, F8, T7, C3, C4, T8, P7, P3, P4, P8, O1 and O2. The closest associated brodmann areas with these electrodes are 10, 10, 47, 8, 8, 45, 42, 2, 1, 21, 37, 39, 39, 37, 18 and 18, respectively. This electrode placement is also supposed to be the most optimal for source localization [46]. We used the pre-processed EEG data to calculate these 16 × 16 functional connectivity metrics. Next, the different connectivity measures are discussed.

#### 2.4.1. Mutual Information (MI)

In information theory, MI is used to quantify the interdependence between two time series [47]. For a pair of discretized random variables *x* and *y* that are recorded from time series with their respective probability distribution functions *P*(*x*) and *P*(*y*), and joint probability function *P*(*x*, *y*), the MI between *x* and *y* can be defined as:

$$MI\_{xy} = \sum\_{\mathbf{x} \in X, y \in Y} P(\mathbf{x}, y) \log \frac{P(\mathbf{x}, y)}{P(\mathbf{x})P(y)}.\tag{1}$$

MI was proposed as a measure to quantify the strength of functional connectivity between a pair of time series data.

#### 2.4.2. Phase Locking Value (PLV)

Phase locking value (PLV) is a measure to quantify the synchronization of phase of different signals as acquired from separate brain areas. The analytical representations of two signals originating from brain regions, *k* and *l*, *sk*(*t*) and *sl*(*t*), are obtained by the Hilbert transform and expressed as [48,49]:

$$z\_k = A\_k(t)e^{j\varphi k(t)},\tag{2}$$

$$z\_l = A\_l(t)e^{jql(t)},\tag{3}$$

The differences in phase are then calculated at each time point by

$$
\Delta \varphi\_{k,l}(t) = \varphi\_k(t) - \varphi\_l(t). \tag{4}
$$

Thereafter, by averaging over all time points (*nt* being the number of time points) the PLV between the brain regions *k* and *l* is represented as:

$$PLV(k,l) = \frac{1}{n\_{\rm th}} \left| \sum\_{l=1}^{n\_{\rm t}} e^{j\Delta\varphi\_{k,l}(t)} \right|,\tag{5}$$

The PLV ranges between 0 (which reflects no phase synchronization) and 1 (which reflects perfect phase synchronization). After the PLV calculation is repeated for all brain regions, it is assembled to form a connectivity matrix.

2.4.3. Phase Transfer Entropy (PTE)

The flow of information between neuronal regions are quantified by the estimation of causal influence one region exercise on another. There is a plethora of methods to quantify the neuronal interactions, out of which PTE is the only measure that is phase-specific and directed in nature. For a connectivity metric to quantify the interactions amicably it should:


PTE [52] is a method of quantifying directed phase interaction across trials as well as continuous data using binning methods for state-space reconstruction based on the same principle as Wiener-Granger causality [53]. In the framework of Information Theory, the Wiener-Granger causality can be re-written as: "a source signal has causal influence on the target signal, if the uncertainty of the target signal conditioned by the source signal and its own past is smaller than the uncertainty of the target signal conditioned by its own past" [54]. The instantaneous phase and amplitude of a signal *x*(*t*) can be expressed by its analytic associate as expressed in Equation (1). The PTE for an analysis lag *θ* can be defined as:

$$PTE\_{XY} = H(\varphi\_y(t), \varphi\_y(t')) + H(\varphi\_y(t'), \varphi\_x(t')) - H(\varphi\_y(t')) - H(\varphi\_y(t), \varphi\_y(t'), \varphi\_x(t')), \tag{6}$$

where *ϕx*(*t* ) and *ϕy*(*t* ) are the past states at lag *θ*, i.e., *ϕx*(*t* ) = *ϕx*(*t* − *θ*) and *ϕy*(*t* ) = *ϕy*(*t* − *θ*). The marginal and the joint entropies can then be defined as [55]:

$$H(\varphi\_y(t), \varphi\_y(t')) = -\sum p(\varphi\_y(t), \varphi\_y(t')) \log p(\varphi\_y(t), \varphi\_y(t')),\tag{7}$$

$$H(\varphi\_y(t'), \varphi\_x(t')) = -\sum \left(\varphi\_y(t'), \varphi\_x(t')\right) \log p(\varphi\_y(t'), \varphi\_x(t')),\tag{8}$$

$$H(\varphi\_y(t')) = -\sum p(\varphi\_y(t')) \log p(\varphi\_y(t')),\tag{9}$$

$$H(\varphi\_{\mathcal{Y}}(t), \varphi\_{\mathcal{Y}}(t'), \varphi\_{\mathcal{X}}(t')) = -\sum p(\varphi\_{\mathcal{Y}}(t), \varphi\_{\mathcal{Y}}(t'), \varphi\_{\mathcal{X}}(t')) \log p(\varphi\_{\mathcal{Y}}(t), \varphi\_{\mathcal{Y}}(t'), \varphi\_{\mathcal{X}}(t')), \tag{10}$$

where the probabilities are computed by histograms of occurrences of single, pairs or triplets of phase estimates in an epoch. The prediction delay *θ* and the number of bins in the histogram was set as ((*<sup>L</sup>* <sup>×</sup> *CH*))/*N*<sup>±</sup> and *<sup>e</sup>*0.626+0.4 ln(*L*−*θ*−1) respectively, where *<sup>L</sup>* is the length of the epoch in sample count, *CH* is the number of channels and *N*<sup>±</sup> is the number of times the phase changed its sign across time and channels. The PTE values were normalized between 0 and 1 with 0.5 < *PTExy* < 0.5 implying an information flow of *x* → *y*, 0 < *PTExy* < 0.5 implying information flow preferentially of *x* ← *y* and 0.5 implying no preferential flow of information.

#### *2.5. Classification*

The classification of workload is implemented using three different variants of convolution and recurrent neural networks that provide different feature extraction and learning capabilities and a comparison of the performance is presented. The input to all the three networks were the connectivity matrices MI, PLV and PTE as described above. The shape of each of the matrix was 16 × 16. The networks were trained using Python 3.9 and Tensorflow 2.4 on Nvidia DGX server at Indian Institute of Technology, Roorkee. For processing the input and feeding it to the model, we used Tensorflow Datasets API and used 70,15,15 split for training, validation and testing data. As mentioned earlier, the n-back task was composed of 339 sessions, hence, we calculated a matrix corresponding to each session giving rise to 339 matrices for each participant. With the split of 70-15-15, there were 237, 51 and 51 matrices for training, validation and testing, respectively, for each of the 19 subjects. We used a batch size of 64 trained each model for 1000 epochs. During the training, early

stopping [56] and learning rate scheduler [57] were used to improve the convergence time. The motivation and details of the networks used are as follows; The CNN classifier [58] was chosen based on the similarity that the input (which is a weighted square adjacency matrix) has to an image, as it's ability to extract spatial features is superior unlike the primitive ANNs. We used a Regular CNN (Table 1) (consisting of the usual 2D convolution, pooling and batchnorm layers). For all the convolution layers of the models, stride of 1, 'same' padding, and ReLU [59] as activation was used. The last dense layer consisted of 3 units and softmax activation [60] for classifying the three levels of workload. Similarly, in LSTM (Table 2), the input was flattened and all LSTM layers make use of ReLU activation. In Conv-LSTM (Table 3), all Conv2D layers have ReLU activation. After reshaping the output, they are followed by LSTM layers, followed by 2 dense layers and a softmax layer same as the above models. The overview of the classification framework can be visualized as shown in Figure 2. Additionally, Figure 3 shows the architecture of the CNN, LSTM and the Conv-LSTM models used.

**Figure 2.** Overview of the classification workflow using EEG signals.


**Table 1.** Configuration of CNN Architectures used for the ablation study. C-A, C-B and C-C refers to the three variations of CNN Networks. The bottom half of the table is common to all the three variations.

**Figure 3.** Model architectures for (**a**) CNN C-A (**b**) LSTM L-A (**c**) Conv-LSTM CL-A.


**Table 2.** Configurations of LSTM Architectures used for the ablation study. L-A, L-B and L-C refers to the three variations of LSTM Networks. The bottom half of the table is common to all three variations.

**Table 3.** Configuration of Conv-LSTM Architectures used for the ablation study. CL-A, CL-B and CL-C refers to the three variations of Conv-LSTM Networks. The bottom half of the table is common to all the three variations.


#### **3. Results and Discussion**

In this research, the efficacy of three different functional brain connectivity analysis methods (MI, PLV and PTE) to classify cognitive workload into high, medium and low using three different deep learning architectures (CNN, LSTM and Conv-LSTM) was investigated. Nineteen participants executed the the modern version of the n-back task on a computer screen with three levels of cognitive workload, high, medium and low.

The input to the deep learning networks was 16 × 16 connectivity metrics. Sixteen brain regions were chosen from the brodmann atlas [61] to cover the different brain regions and at the same time keep the computations as fast as possible. Figure 4 shows the differences (for a random participant) between low, medium and high workloads of MI, PTE and PLV, respectively. Although the differences among the three connectivity metrics are visible, there are no explicit and visible differences among the three workload conditions, i.e., low, medium and high.

However, in the statistical analysis, significant differences were found among the three conditions. The mean accuracy (in percentage) for the three n-back condition was-75.42 (SD = 16.10), 62.27 (SD = 15.64), 37.84 (SD = 14.18) for 1-back, 2-back and 3-back, respectively. There were significant differences among the groups (*F*(2, 75) = 40.22, *p* < 0.01, *η*<sup>2</sup> = 0.56). Similarly we found significant differences in the reaction time as well (1-back =

492.58 (SD = 91.1), 2-back = 673.58 (SD = 150.57), 3-back = 824.84 (SD = 147.32), ANOVA = *F* (2, 75) = 40.98, *p* < 0.01, *η*<sup>2</sup> = 0.48). Differences between all possible combinations (1 vs. 2, 1 vs. 3, 2 vs. 3) across both mean accuracy (in percentage) and mean reaction time (in ms) were also found to be significant (*p* < 0.01).

Based on the statistical results, we hypothesized that there will be differences in the brain connectivity matrices (although not visible to the naked eye) in the three workload settings and the deep learning classifiers will be able to utilize these differences for successful classification. It was expected that PTE would perform best in terms of connectivity metric, with it being directed and phase-specific.

Several experiments (ablation study) were performed to find best hyperparameter settings for the three deep learning architectures. The results of the ablation study are compiled in Table 4. As shown in Table 4, for MI, a mean accuracy of 80.87% was achieved with CNN, 71.87% was achieved with LSTM and 71.16% was achieved with Conv-LSTM. Similarly, for PLV a mean accuracy of 75.88% was achieved with CNN, 71.82% was achieved with LSTM and 69.68% was achieved with Conv-LSTM. Lastly, for PTE a mean accuracy of 71.16% was achieved with CNN, 69.63% was achieved with LSTM and 69.74% was achieved with Conv-LSTM. The highest accuracy (among all subjects) was achieved with the combination of PLV with Conv-LSTM and CNN at 97.92%. This is followed by MI with CNN at 95.83%. Besides the accuracy, Precision, Recall and F1-score of the classifiers are also reported in Table 5. Figure 5 shows the box-plot containing the accuracy and statistical results (standard error, quartiles, and outliers) of all the classifiers in combination with different functional connectivity methods. The combination of CNN and MI indicates the best classification performance. The achieved accuracy outperforms the state-of-the-art in multi-class classification in the context of workload classification in the n-back task with various EEG features and machine-learning algorithms. The comparison of the proposed method with others is given in Table 6. Since, the number of trials for the three workload settings were balanced, accuracy was indicative of the performance of the classifiers. Nevertheless, we reinforced the results with the analysis of the confusion matrices and ROC curves. Figure 6 shows the confusion matrix and Figure 7 shows the ROC curves for all combinations of the classifiers and the connectivity metrics of the best subject. From these figures, it can be substantiated that the classification performance of the models is high for the multiclass-classification problem as the true positive rate is high. The high value class-wise area under the curve shows that the classifier is able to learn and classify each class separately with high accuracy.

Figure 8 shows the features learned by the CNN when MI was given as an input. MI was chosen as it gave the highest accuracy and similarly, input image of medium workload was chosen since the recall of medium workload was highest. It is visible that the filters are actually learning similar activation as in the input image indicating that the classifier was successful. Overall, given the consistent performance of the classifiers across all the metrics and the significant differences found in the statistical tests, it can be concluded that the classifier was successful.

Although state-of-the-art results were obtained, the study had some limitations. One important limitation of the study is the hypothesis itself. We hypothesized that there will be differences in the connectivity matrices in the three workload conditions. However, the study was limited to calculating the connectivity using raw(cleaned) EEG data. This was done to test whether all inclusive connectivity (not band limited) would yield conceivable differentiation in workload or not. This would have implications in making the entire framework close to real-time since band-limiting the signals would have increased the computational complexity. In the future we will consider doing a comparison with our approach and investigations in connectivity with different frequency bands to make a comprehensive and exhaustive hypothesis. Another limitation was the subject-dependent classification. The subject-dependent classifiers can extract subject-dependent features and can effectively tackle the issue of accuracy and generalization encountered in subjectindependent EEG classifiers. However, it also gives rise to the issues of long collaboration

sessions and collection of large quantities of data [38,39]. Lastly, the choice of 16 brain regions for computing the connectivity matrices. The choice of the brain regions could have been empirical instead of hypothesis and use-case driven. Exhaustive search and feature selection algorithms could be used in the future for validating the selection of brain regions empirically.

(**g**) PLV Low (**h**) PLV Medium (**i**) PLV High **Figure 4.** Brain connectivity maps of a random subject obtained through MI, PTE, and PLV for different workload states (low, medium, and high) using Brodmann atlas [61].


**Table 4.** Ablation Study of different variations of the hyper-parameter combinations for used classifiers as described in Tables 1–3.

**Table 5.** Precision, recall and F1-score for the different architectures used in the ablation study as described in Tables 1–3.


**Figure 5.** Box Plots representing the range of accuracy (with standard error) achieved by different subjects with deep learning architectures used (**a**) CNN (**b**) LSTM and (**c**) Conv-LSTM.

**Table 6.** Comparison of the proposed work with state-of-the-art results. The comparison includes different features and classifiers used for EEG-based cognitive workload classification in the n-back task. The proposed work achieves the highest accuracy in multi-class classification.


**Figure 6.** Confusion Matrix for the best performing subject for different combinations of the deep learning architectures (CNN, LSTM, and Conv-LSTM) and the functional connectivity metrics (MI, PLV and PTE).

**Figure 7.** ROC (Receiver Operating Characteristics) curves for the best performing subject for different combinations of the deep learning architectures (CNN, LSTM, and Conv-LSTM) and functional connectivity metrics (MI, PLV and PTE).

(**g**) Conv-LSTM MI (CL-A) (**h**) Conv-LSTM PLV (CL-A) (**i**) Conv-LSTM PTE (CL-A)

(**a**) Medium Workload MI matrix (**b**) 64 Filters of the 2nd Conv2D layer.

**Figure 8.** (**a**) Input given to the CNN network (**b**) Visualization of feature maps of the convolution layer in the CNN network.

#### **4. Conclusions**

Workload Classification can be used as an indicator of the Emotional Intelligence and stability. The aim of the study was to build a fast and accurate workload classifier which can be extended to real-time workload classification. Real-time workload classification

is an important and very useful cognitive construct for the development of robust BCI systems [62] and useful in several other domains like Virtual Reality [63] and Human-Machine Teaming [64]. In this research, EEG was chosen as the neuroimaging modality with its advantages of being cheap, portable and having high time resolution [65]. Modelfree functional connectivity was chosen for the feature extraction with the concomitant advantages of being fast and associated with cognitive control in the context of mental workload [66]. Also, it has been shown that there are subject-specific differences in EEGbased functional connectivity measures [37].

Thereby, a combination of various directed/non-directed model-free brain functional connectivity algorithms and state-of-the-art deep learning algorithms were utilized for efficient subject-specific classification of cognitive workload into three levels, high, medium and low. Three functional brain connectivity algorithms (Mutual Information, Phase Transfer Entropy and Phase Locking Value) were used to generate the functional connectivity networks, which represents the neuronal interactions between the different regions of the brain. These connectivity networks are used as inputs to the classification models to classify different levels of workload. We employed three different deep learning architectures (CNN, LSTM and Conv-LSTM) for classification of cognitive workload. Intra-subject method of classification was applied on the data of 19 participants. The best classification performance was obtained with CNN in combination of each of the three connectivity networks over LSTM and Conv-LSTM. CNN outperforms the other two deep learning architectures because of the spatial information provided by the connectivity analysis in the form of input data upon which the classification is being performed. With CNN, MI produces the best classification results with an accuracy of 80.87%, followed by CNN with PLV with an accuracy of 75.88% and LSTM with MI with an accuracy of 71.87%.

We achieved state-of-the-art accuracy for multi-class workload classification using EEG and functional connectivity. From the results, it can be concluded that indeed EEGbased model-free functional connectivity metrics, when combined with deep-learning, provides an accurate, reliable and fast method of classifying cognitive workload. Although there is not much literature available on this, it was hypothesized that the connectivity method PTE will outperform MI and PLV as PTE is the only connectivity measure that is phase-specific and directed in nature. However, in our experiments MI outperformed PTE in the classification performance. This can be due to the fact that this study had lesser number of participants' and the choice of brain regions. Therefore, no significant conclusions can be made about which model-free connectivity measure is the best. A future study can be performed with higher number of participants and different permutations and combinations of brain regions to make better and clear conclusions regarding the comparative analysis of the different connectivity measures.

Since these brain connectivity methods enable extremely rapid (specially MI) and accurate connectivity matrix generation from raw EEG data, the proposed architecture (a combination of MI/PLV/PTE and state-of-the-art CNN) can be used for effective and efficient cognitive state monitoring and other BCI applications. In addition to that, brain connectivity coupled with hybrid deep learning architectures can be used to classify higherorder cognitive processes like executive functioning and complex decision-making in the future. The subject-specific classification also sanctions the analysis and extraction of subject-specific features. Together, this could enable BCIs to become more reliable and efficient exponents of effective state monitoring in complex real world scenarios.

**Author Contributions:** Conceptualization, A.G., G.S. and V.P.; methodology, G.S.; software, A.G.; validation, G.S. and V.P.; formal analysis, A.G.; investigation, G.S. and V.P.; resources, P.P.R.; writing original draft preparation, G.S. and V.P.; writing—review and editing, A.G., P.P.R. and B.-G.K.; visualization, G.S. and V.P.; supervision, P.P.R. and B.-G.K.; project administration, A.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board (IRB) at the Institute of Nuclear Medicine and Allied Sciences (INMAS), Defence R & D Organization, Delhi.

**Informed Consent Statement:** All the participants had provided written informed consent for taking part in the study.

**Data Availability Statement:** The raw data will be made available on request by the authors, without undue reservation.

**Acknowledgments:** Our sincere thanks to Sushil Chandra, Head of Department of Biomedical Engineering, Institute of Nuclear Medicine and Allied Sciences, Defence Research and Development Organization for his invaluable guidance and support. We extend our gratitude to him for sharing the data so this research study could be conducted.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **EEG Emotion Classification Network Based on Attention Fusion of Multi-Channel Band Features**

**Xiaoliang Zhu, Wenting Rong, Liang Zhao \*, Zili He, Qiaolai Yang, Junyi Sun and Gendong Liu**

National Engineering Research Center of Educational Big Data, Central China Normal University, Wuhan 430079, China; zhuxl@ccnu.edu.cn (X.Z.); rwt\_0706@mails.ccnu.edu.cn (W.R.); hzlzero@mails.ccnu.edu.cn (Z.H.); yql2020113547@mails.ccnu.edu.cn (Q.Y.); sunjunyi@mails.ccnu.edu.cn (J.S.); gendong@mails.ccnu.edu.cn (G.L.) **\*** Correspondence: liang.zhao@ccnu.edu.cn

**Abstract:** Understanding learners' emotions can help optimize instruction sand further conduct effective learning interventions. Most existing studies on student emotion recognition are based on multiple manifestations of external behavior, which do not fully use physiological signals. In this context, on the one hand, a learning emotion EEG dataset (LE-EEG) is constructed, which captures physiological signals reflecting the emotions of boredom, neutrality, and engagement during learning; on the other hand, an EEG emotion classification network based on attention fusion (ECN-AF) is proposed. To be specific, on the basis of key frequency bands and channels selection, multi-channel band features are first extracted (using a multi-channel backbone network) and then fused (using attention units). In order to verify the performance, the proposed model is tested on an openaccess dataset SEED (*N* = 15) and the self-collected dataset LE-EEG (*N* = 45), respectively. The experimental results using five-fold cross validation show the following: (i) on the SEED dataset, the highest accuracy of 96.45% is achieved by the proposed model, demonstrating a slight increase of 1.37% compared to the baseline models; and (ii) on the LE-EEG dataset, the highest accuracy of 95.87% is achieved, demonstrating a 21.49% increase compared to the baseline models.

**Keywords:** EEG; learning emotions; emotion recognition; attention; convolutional neural network; multi-channel band features

#### **1. Introduction**

As a high-level psychological state, emotion is composed of many kinds of feelings, thoughts, and other factors, and has been broadly used in the medical, educational, and other related fields because of its capability to reflect people's real psychological reactions to different things. With the rapid development of artificial intelligence, emotion recognition research has become a hotspot. Generally speaking, the existing research in the field of emotion recognition is carried out from one of the two following aspects. The first type of research is a variety of manifestations (e.g., voice, text, and images) based on external behavior, which is acquired through non-contact methods. For example, in 2005, Burkhardt et al. established a speech dataset, called the Berlin database, which contained seven emotions [1]. In 2016, Lim et al. converted the original speech signal in this dataset into a spectrogram by time–frequency analysis and proposed a shallow convolutional neural network (CNN) and long short-term memory (LSTM) fusion network to identify the seven emotions [2]. Socher et al. built a text dataset containing the five emotions of very positive, positive, neutral, negative, and very negative [3], while Kim et al. used CNN to learn sentence feature vectors from this dataset and identify the emotions [4]. Anderson et al. proposed that facial muscle movements can represent emotional states, in which the support vector machine (SVM) was used to identify six basic emotions commonly associated with facial expressions [5]. The second type of research is based on the neurophysiological state, that is, the acquisition of various physiological signals [6–10], such as electrocardiogram (ECG),

**Citation:** Zhu, X.; Rong, W.; Zhao, L.; He, Z.; Yang, Q.; Sun, J.; Liu, G. EEG Emotion Classification Network Based on Attention Fusion of Multi-Channel Band Features. *Sensors* **2022**, *22*, 5252. https:// doi.org/10.3390/s22145252

Academic Editors: Mincheol Whang and Sung Park

Received: 30 April 2022 Accepted: 11 July 2022 Published: 13 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

photoplethysmography (PPG), and electroencephalogram (EEG), among many others. Although this type of research requires subjects to wear certain appropriate physiological signal acquisition equipment, compared with the former external behavioral research, focusing on neurophysiological states is a more objective method of representing emotions. The collected physiological signals address better the problems associated with facial expression deception, and among them, the EEG signal is a focus of great concern [11]. A number of researchers previously constructed their own EEG signal datasets to study the basic emotions (i.e., anger, disgust, fear, happiness, sadness, and surprise) proposed by Ekman et al. [12]. For example, Petrantonakis et al. developed an EEG dataset in an attempt to distinguish the six basic emotional states proposed by Ekman et al. [13]. Schaaff et al. developed an EEG dataset in an attempt to distinguish three emotions (including pleasant, neutral, and unpleasant) [14]. Duan et al. created the SEED dataset to distinguish between negative, neutral, and positive emotions in subjects [15]. Koelstra et al. created the DEAP dataset, which measures two types of emotional states obtained from potentiation and arousal [16]. D'Mello et al. pointed out that, although the six basic emotions proposed by Ekman et al. [12] are common in our daily life, most of them do not exist for the study time of 30 min to 2 h; hence, six learning emotions (i.e., boredom, engagement, confusion, frustration, delight, and surprise) are defined and further ranked in an ascending order of persistence on a time scale: (delight = surprise)<(confusion = frustration) < (boredom = engagement) [17]. Meanwhile, Graesser et al. proposed that, for college students, the main emotions centered on learning include confusion, frustration, boredom, engagement, curiosity, anxiety, delight, and surprise [18].

Distinguishing the learners' emotions in an intelligent educational environment is very important; thus, in recent years, research on learning emotions has gradually attracted the attention of scientists. For instance, Tonguc et al. recorded the facial expressions of students during their speech process and recognized seven different types of learning emotions [19]. Sharma et al. studied students' engagement states in conjunction with their eye, head, and facial muscle movements in an online learning scenario [20]. Actually, in a real learning scenario, students mostly showed their normal emotions, i.e., it is quite difficult to capture the facial expressions at that moment, due to the fact that the facial muscles possessed small amplitudes and short durations. In addition, facial expressions showed defects (such as falsifiability) that cannot truly reflect emotions, bringing challenges to learning emotion recognition. Therefore, the present study attempts to explore the learning emotion classification algorithm based on EEG signals. Although EEG causes a lot of inconveniences due to contact measurement, its ability to capture and represent real learning emotions for students is quite helpful. In our preliminary research, the six learning emotions proposed in [17] were taken into account initially; however, considering the time scale and the probability of emotion occurrence, it was found that the chances of recognizing confusion, delight, and curiosity are small. Therefore, in this study, a learning emotion EEG dataset (LE-EEG) is constructed, which only focuses on three emotions (i.e., boredom, neutrality, and engagement) that can last for a longer time. The main contributions of this study are as follows:


The remainder of this paper is organized as follows: Section 2 introduces the commonly used emotion classification algorithms; Section 3 presents the framework of the proposed ECN-AF model; Section 4 discusses the experimental design; Section 5 analyzes the experimental results; and Section 6 makes a summary and lists the future research directions.

#### **2. Related Works**

To realize emotion classification, the key methods of feature extraction based on EEG signals tend to be developed around the three aspects of time, frequency, and time– frequency domains [21]. First, the time domain methods focus on the EEG signals' temporal information, including the typical features of Hjorth parameters, fractal dimensional features, and higher-order crossover features. Second, the frequency domain methods often convert the collected EEG signals (0–50 Hz) into five sub-bands (i.e., delta (1–4 Hz), theta (4–7 Hz), alpha (8–13 Hz), beta (13–30 Hz), and gamma (31–50 Hz)) [22] and extract features, such as power spectral density, differential entropy and asymmetry, and rational asymmetry in different frequency bands [15]. Meanwhile, the time–frequency domain method combines the characteristics of both time and frequency domains, converting the EEG signals into sub-bands and using the windowing method for emotion classification.

Typical EEG emotion recognition methods tend to extract features and adopt machine learning, such as Support vector machines (SVM), k-nearest neighbor (KNN), and other algorithms for classification and recognition [23–25]. For example, Arnau-Gonzalez et al. conducted emotion classification experiments on the DEAP dataset, where frequency domain features (e.g., PSD) and mutual information in each frequency band of the channel were extracted, and a final classification accuracy of 66.7% for valence and 69.6% for arousal was obtained using the SVM [23]. Li et al. conducted experiments on the SEED dataset by extracting features (such as peak-to-peak average, alignment entropy, and Hjorth parameters), and their average classification accuracy using the SVM reached 83.3% [24]. Algumaei et al. used linear discriminant analysis (LDA), achieving an average accuracy of 90.93% on the SEED data set [25].

Compared with traditional machine learning models, deep neural networks show a more efficient performance [26–29]. They can not only automatically extract effective features, but also mark key frequency bands and brain regions. Therefore, more and more researchers use deep learning models to study EEG-based emotion classification. For example, on the SEED dataset, Zheng et al. proposed an emotion classification model using SVM and deep belief networks (DBN), and investigated the effect of the combinations of different frequency bands on emotion classification accuracy. Their final experimental results showed that the accuracy under the 12-channel combination could surpass that under the 62-channel combination. In addition, the direct concatenation of the DE features of five frequency bands under the DBN network led to an average classification accuracy of 86.08% [30]. Many researchers have improved the emotion recognition accuracy by developing advanced convolutional networks, such as the self-organizing graph neural network (SOGNN) [31] and dynamic graph convolutional neural network (DGCNN) [32], which respectively achieved 86.81% and 90.4% classification accuracy. To be specific, Li et al. proposed SOGNN, which constructs inter-channel correlations from self-organizing graphs, and explores the aggregation of these inter-channel connections and time–frequency features in frequency bands. The final experimental average accuracy (ACC) and the standard deviation (STD) were 86.81% and 5.79%, respectively [31]. Song et al. proposed DGCNN, which uses a graph to model the multi-channel EEG features and dynamically learn the intrinsic relationship between different EEG channels. As a result, they achieved 90.4% highest accuracy and 8.49% STD [32].

By contrast, studying emotion classification by exploring frequency bands and their correlation has made fruitful achievements. Yang et al. did not distinguish between the sub-bands on the SEED dataset to study the channel combination, but proposed the usage of directional RNNs to extract independent features of left and right brain regions. Consequently, they acquired 93.12% ACC and 6.06% STD [33]. Wang et al. improved the bidirectional long- and short-term memory network by proposing a similarity-learning network, achieving a classification accuracy of 94.62% on the SEED dataset [34]. Shen et al. proposed a four-dimensional convolutional recurrent neural network (4D\_CRNN) that converted full EEG channels into a two-dimensional picture. They superimposed all subbands to convert the features into three dimensions and finally extracted the channel and

band features using 2DCNN, as well as the temporal features using LSTM. They acquired 94.08% ACC and 2.55% STD [35].

The attention mechanism [36,37] was successfully introduced into neural networks, which greatly improved the performance of classification models. Researchers in the field of EEG emotion recognition found that the attention mechanism is like the idea of focusing on emotion-related brain regions and started to try using this in the field of EEG emotion recognition to improve the model performance. For instance, Li et al. proposed the transferable attention neural network (TANN) with 93.34% ACC and 6.64% STD, which used two directed RNN modules to extract features from whole brain regions and global attention layer fusion features to highlight the key brain regions for emotion classification [38].

In summary, existing research faces the following problems: (1) the exploration of multiple channel combinations for emotion classification fails to combine well the five sub-band features; and (2) exploring band correlations to synthesize all-channel studies is a mainstream method; however, not all brain regions of EEG signals contain valid emotion information, and thus this approach fails to focus on capturing the important emotion channels. To address these problems, in this study, ECN-AF is proposed, focusing on specific channels and some frequency bands for the fusion of attention units.

#### **3. Methodology**

#### *3.1. Model Framework*

Figure 1 depicts the framework of the proposed ECN-AF model consisting of the following three main modules:

**Figure 1.** ECN-AF framework diagram.

(1) Module 1: frequency band division and channel selection module. In this module, first, the acquired EEG signal were divided into raw segments by a sliding window with a window size 10 s and a step size 2 s; second, five different frequency bands were extracted by passing the raw segments through bandpass filters; third, the final segments were generated, which were the optimal combinations of EEG channels obtained by multi-channel filtering operation.


#### *3.2. Module 1: Frequency Band Division and Channel Selection Module*

After data cleaning, the SEED dataset contained 62 channels of EEG signals from 15 subjects with a sampling rate of 200 Hz [15]. The LE-EEG dataset contained 32 channels of EEG signals from 45 subjects with a sampling rate of 128 Hz. Both the SEED and LE-EEG datasets were divided using a window

$$\mathcal{W} = T \times \mathbb{C} \tag{1}$$

In Equation (1), *W* is the segment size, *T* is the time duration after splitting, and *C* is the number of channels. The datasets were all segmented using a sliding window with a window length of 10 s and a step size 2 s. In the SEED and LE-EEG datasets, *W* values are 2000 × 62 and 1280 × 32, respectively.

$$S = \{\mathcal{W}\_1, \mathcal{W}\_2, \mathcal{W}\_3, \dots, \mathcal{W}\_{\dot{\nu}}, \dots, \mathcal{W}\_{n-1}, \mathcal{W}\_n\} \tag{2}$$

$$\mathcal{Y} = \{\mathbf{Y}\_1, \mathbf{Y}\_2, \mathbf{Y}\_3, \dots, \mathbf{Y}\_{\mathbf{i}}, \dots, \mathbf{Y}\_{n-1}, \mathbf{Y}\_n\}, \ \mathbf{Y}\_{\mathbf{i}} \in \{-1, 0, 1\} \tag{3}$$

In Equations (2) and (3), *S* denotes a subject's dataset, *Wi* denotes the sequential segment data, *n* denotes the total number of samples, *Y* denotes a subject's sentiment label set, and *Yi* denotes the label of the *i*th segment data.

Finally, a sample size of 4896 for each subject and a total sample size of 73,440 for all the 15 subjects were collected in the SEED dataset. Meanwhile, a sample size of one subject ranging from 1082 to 1650 and a total sample size of 60,376 for all the 45 subjects were collected in the LE-EEG dataset.

$$\left|H(w)\right|^2 = \frac{1}{1 + \left(\frac{W}{W\_{f\_1 \sim f\_2}}\right)^{2N\_f}}\tag{4}$$

$$H(S) = \begin{cases} S\_{\delta \prime} \ w \in (1, 4) \\ S\_{\theta \prime} \ w \in (4, 7) \\ S\_{\theta \prime} \ w \in (8, 13) \\ S\_{\beta \prime} \ w \in (13, 30) \\ S\_{\gamma \prime} w \in (31, 50) \end{cases} \tag{5}$$

In Equations (4) and (5), a fourth-order Butterworth bandpass filter was used to filter the EEG signal into five wave sub-bands [39–42]. *Nf* is the order of the filter, i.e., *Nf* = 4. *W* is the frequency; *Wf*1∼*f*<sup>2</sup> is the normalized frequency band; and the range of frequencies *f* <sup>1</sup> to *f* <sup>2</sup> is the passband interval of the bandpass filter. *H*(*S*) is the EEG signal filtered by the fourth-order Butterworth bandpass filter, *w* is the frequency band, and *δ*, *θ*, *α*, *β*, and γ denote the data of the five different frequency bands.

$$S\_f = \frac{H(S) - AVG(H(S))}{STD(H(S))}, \; f \in \{\delta, \theta, \alpha, \beta, \gamma\} \tag{6}$$

In Equation (6), *Sf* is the normalized EEG segment data; *f* is one of the five sub-bands; *H* denotes the five different frequency band EEG signals of one subject; *AVG* is the average value; *STD* is the standard deviation.

Previous studies have found that, a combination of frequency channels can improve the recognition performance. For example, Zheng et al. used six channel combinations of "FT7," "FT8," "T7," "T8," "TP7," and "TP8" for emotion classification [43]. Zheng et al. designed four different electrode placement patterns based on the peak characteristics of the weight distribution and the asymmetry of the emotion processing, finally "FT7," "T7," "TP7," "P7," "C5," "CP5," "FT8," "T8," "TP8," "P8," "C6," and "CP6" were used, achieving the best result of 86.65% classification accuracy. This confirmed that it is possible to achieve better experimental results with fewer channel combinations than full-channel recognition [30]. Combining the abovementioned studies, we obtain the following setting:

$$X\_f{}^{\mathbb{C}} = \begin{cases} S\_f{}^{\mathbb{C}1} \\ S\_f{}^{\mathbb{C}2} \end{cases} f \in \{ \delta, \theta, \mathfrak{a}, \mathfrak{z}, \mathfrak{z} \} \tag{7}$$

In Equation (7), *Xf <sup>C</sup>* is the EEG signal at *f* frequency under the Cth channel combination; C is the channel combination method; and in our study, C1 and C2 are taken as C1 = {"FT7," "FT8," "T7," "T8," "TP7," "TP8"} and C2 = {"FT7," "T7," "TP7," "P7," "C5," "CP5," "FT8," "T8," "TP8," "P8," "C6," "CP6"}, respectively.

#### *3.3. Module 2: Frequency Band Attention Feature Extraction Module*

This section presents the combination of two sub-modules, a multi-channel convolutional backbone network and a band attention fusion unit.

#### 3.3.1. Multi-Channel Convolutional Backbone Network

The backbone network was built using two layers of CNN, AvgPool1D, BatchNormalization, and SpatialDropout1D, with the parameters shown in Table 1. We used the *Xf <sup>C</sup>* in Module 1 input to the multichannel convolutional backbone network to extract channel and time features.

$$F\_f^\mathbb{C} = \text{ReLU}\left( (f \ast g)\_{\times 2} \begin{pmatrix} X\_f^\mathbb{C} \\ \end{pmatrix} \right)\_\prime f \in \{ \delta, \theta, \mathfrak{a}, \mathfrak{z}, \gamma \} \tag{8}$$

$$F^{\mathbb{C}} = \left\{ F\_f^{\mathbb{C}} \right\} , \ f \in \{ \delta, \theta , \mathfrak{a} , \beta , \gamma \} \tag{9}$$

**Table 1.** Multi-channel convolutional backbone network construction.


In Equations (8) and (9), *F<sup>C</sup> <sup>f</sup>* is the feature of the output of the convolutional network in the *f*-band under the Cth channel combination, and *F<sup>C</sup>* is the set of different band features extracted by the convolutional backbone network under the Cth channel combination.

#### 3.3.2. Frequency Band Attention Fusion Unit

The feature *FC* was used as the input of the band attention fusion unit. First, the bands were selected from the feature *FC* for combination. Next, the attention weights were generated by the sigmoid function using the feature vector. Finally, the weights were attached to the corresponding features to finally obtain the channel, time, and band fusion features. This three-step process is expressed as follows, also see Figure 2:

$$\text{Weight}\_k = \text{Sigmoid}\left(q^T \text{Multi}\left(\text{Select}\left(F^C\right)\_{\times n}\right)\right) \tag{10}$$

$$F' = Multi\left(Select\left(F^C\right)\_{\times n}\right) \times \text{Weight}\_k\tag{11}$$

**Figure 2.** Band attention fusion unit.

#### *3.4. Module 3: Feature Fusion and Classification Module*

After the band attention feature extraction module, we input the fused features *F* into the classification network built by CNN, AvgPool1D, BatchNormalization, Spatial-Dropout1D, GlobalAvgPool1D, Dropout, and Dense. Table 2 lists the specific parameters. We used convolution to extract the depth features in the upper layers of the classification network. The fully connected layer output the triple classification results. We set the BatchNormalization behind the convolutional network to normalize the segment data and transform the features in a state with zero mean and a variance of 1. It not only sped up the convergence speed but also effectively prevented gradient explosion and disappearance.

**Table 2.** Classification network construction.


#### **4. Experiments**

#### *4.1. Experimental Materials*

We want to control the following variables: take a graduate student majoring in big data artificial intelligence as the subject's educational background; ensure that the video duration is not much different; and select popular courses and the knowledge points of the selected courses which cover multiple disciplines.

#### 4.1.1. Sources of Emotional Materials

At this stage, no standardized learning emotion induction course video is available in China. Hence, we used the well-known domestic learning websites https://www. icourse163.org/ (accessed on 21 March 2021) (Chinese University MOOC Network) and

https://www.bilibili.com/ (accessed on 21 March 2021) (Learning section in Bilibili). The lessons were selected from these two sites according to the learners' comments about engagement and boredom-related vocabulary. With computer-related courses as the academic background, 50 learning videos of computer majors and science-, literature-, history-, and philosophy-related learning courses were finally selected to induce learning clips with focused and boring emotional labels. Note that the China University MOOC is the largest online classroom in China. Its course categories are classified according to the students' professional background (e.g., computer, foreign language, and science). Bilibili.com is a popular video platform used by young people in China to learn knowledge, exchange ideas, and spread culture. The website contains many excellent user-uploaded learning resources.

#### 4.1.2. Emotional Material Clipping

Fifty videos were collected through the abovementioned means, among which 18 videos were marked as engaging, 17 videos were marked as boring; and 15 videos were marked as neutral. To clip a knowledge point in the videos, all acquired course videos were edited using Cut Screening for Windows Professional, which ensured that the content of the clip was complete, and the video length was not excessively long. The clipped video clips were edited into MP4 format video files, with a resolution of 1920 × 1080 px (30 fps). The clipping video duration was 76–293 s, with an average of 166 s. The emotion-inducing materials mainly consisted of Chinese materials and explanations. A few of them were English clips with Chinese subtitles.

#### 4.1.3. Evaluation of Emotional Materials

In this study, 49 graduate students were recruited as subjects for the emotional material assessment experiment. The participants were 23 male students and 26 female students aged 20–25 years, with an average of (22 ± 1.19) years. All subjects were physically healthy, right-handed, and free of significant emotional problems and mental illness. Forty-nine subjects were taking majors in computer and science technology, electronic information, educational information technology, and educational technology. To avoid the subjects' prior knowledge from interfering with the emotion induction results, those who previously participated in rating the emotion material did not participated in the current data collection experiment.

For the experiment, all subjects were given a "Self-assessment of Learning Status" questionnaire. After each video clip was shown, the subjects were asked to report their actual feelings and score the questionnaire. Each question was scored using a 5-point scale:


According to careless/insufficient effort (C/IE) detection (see Appendix A), finally 44 valid questionnaires were collected in this study. All data were imported into SPSS 27.0 statistical software according to the required SPSS format. The data were statistically analyzed by descriptive statistics, correlation analysis, reliability analysis, group analysis, and analysis of variance.

Figure 3 shows the 5-point scoring of 22 video clips marked as boredom and engagement by 44 subjects. The X-axis depicts 22 target videos. The Y-axis represents the ratings of the 44 subjects for each target video. The set of red dots indicates the rating of the 14 engaging emotional clips, while the set of green dots implies the rating of eight boring clips. Lighter scatters represent fewer subjects giving a score with the y-axis value, and darker scatters represent more subjects giving a score with the y-axis value. Figure 4 represents the mean scores of 44 subjects after the 5-point scoring for the 28 selected target video clips. The X-axis shows 28 target videos. The Y-axis is the mean score of 44 subjects for each target video. The blue bars indicate the mean scores of the 14 engaging emotion

clips, while the red bars illustrate the mean scores of six neutral clips. The orange bars show the mean scores of eight boring emotion clips.

**Figure 3.** 5-point scale score of the subjects.

**Figure 4.** Description statistics of the 28 target videos, with 0–4 ratings.

Gross et al. pointed out that the indicators for judging the success of emotion induction include the intensity and discreteness of emotion induction [44]. Intensity refers to the average score of different emotional segments. The greater the intensity of the emotional response, the higher the average score. The discreteness was judged by the hit rate (hit rate = the type of video discriminated by the subjects/the number of all emotions discriminated). The higher the hit rate, the better the singleness of the emotions induced by the emotional video clips. Figures 3 and 4 depict the dispersion and the intensity of the subject's response induced by the target video clip. According to the discrete scoring points in Figure 3, the hit rate of the engaging emotion was 79.48 ± 4.54%, while that of the boredom emotion was 81.73 ± 16.03%, proving that the singleness induced by the two emotions was good. In Figure 4, the average score of the input emotion was 2.873, while those of the boredom emotion and the neutral segments were 1.256 and 2.036, respectively. These results proved that the intensity of the induced emotional response was high. Finally, according to 44 valid questionnaires, 28 videos were effectively distinguished from the three emotions. We had 14 engaging segments, 8 boring emotional segments, and 6 neutral segments.

#### *4.2. Experimental Procedure and Signal Pre-Processing* 4.2.1. Experimental Procedure

#### In the experiment, we selected seven each of the engagement and boredom clips and six neutral videos as the target emotions from the 28 induced emotion materials. After each video clip was shown, all subjects were asked to answer the questionnaire, report their actual feelings, and rate the questionnaire. The questionnaire consisted of nine questions, each of which was scored on a 5-point (0–4) scale, except for the first two questions. The more intense the subject's concentration, the closer the question score was to 4. The more intense the boredom, the closer the question score was to 0.

We used a pseudo-randomized approach to play the induction video to prevent the boredom caused by the subjects watching the same emotional video for a long time. After the researcher played a video clip, the subjects were given 1 min to fill out the questionnaire and take a short break. The process was repeated for 20 times, with a 10 min break until all video clips had been studied.

The hardware device used to collect the data in this experiment was the EPOC Flex Saline Sensor Kit. The software device was EmotivPRO v2.0. During the experimental acquisition, we asked the subjects to keep their limbs still and try to avoid continuous blinking to minimize the presence of artifacts. The final experiment collected 940 segments of EEG data and 940 assessment questionnaires, of which 777 questionnaires were identified as valid data based on the subjects' completion and the researcher's screening. All valid questionnaires were labeled as boredom, neutrality, and engagement. The EEG data collected for the sentiment classification contained 745 segments because of the equipment acquisition failures and other reasons.

#### 4.2.2. Signal Pre-Processing

The pre-processing and removal of artifacts from the EEG signals are a demanding step in the EEG processing process. In Figure 5, the LE-EEG dataset was preprocessed using MATLAB R2020b, eeglab toolbox [45], ICLab [46–49], and adjusted [50] for bandpass filtering and automatic artifact processing of EEG signals. After the artifacts were processed using the automatic toolkit, some of the bad data were manually removed by visual inspection to finally obtain relatively clean EEG data.

**Figure 5.** Experimental flow of the LE-EEG dataset.

#### **5. Results and Analysis**

We trained the model on an NVIDIA GTX 1080 GPU. The model learning rate was set to 0.001. The learning rate decay was set to 0.00001. The optimization function was set to Adam optimization. The loss function was set to categorical\_crossentropy. The number of multi-channel convolutional backbone network settings depended on the number of band combinations. We conducted experiments on the SEED and LE-EEG dataset separately. The ACC and the STD were used as the evaluation criteria for all subjects in the dataset, dividing the data into training and test sets in a ratio of 8:2 in each fold of cross validation. On the SEED dataset, we performed the subject-dependent experiments, we performed a comparison with several baseline models using cross-validation to assess the model performance. On the LE-EEG data, we cited the paper containing the code for comparison with the model in this paper. In contrast to the approach to the SEED dataset prediction, we fused all subject data for data partitioning.

#### *5.1. Ablation Study*

We conducted two sets of ablation study experiments on the SEED dataset to validate the effectiveness of the combined band and attention fusion units in the model for sentiment classification. One experiment explored the effects of split-band prediction and combined band prediction on emotion classification to validate the importance of integrating the band features. Another experiment discussed multiple fusion approaches to validate the need for attentional fusion units.

#### 5.1.1. Sub-Band Prediction and Combined Band Prediction

In our experiments, we compared the emotional classification accuracy in two cases: one uses a single-channel backbone network to extract the sub-band features, while the other uses a multi-channel backbone network combination to extract the sub-band features. Table 3 shows the experimental results on the two datasets. First, on the SEED dataset, C1

and C2 are different channel combination methods, as described in Section 3.2. We recall that C1 represents the combination of "FT7," "FT8," "T7," "T8," "TP7," "TP8," and C2 represents the combination of "FT7," "T7," "TP7," "P7," "C5," "CP5," "FT8," "T8," "TP8," "P8," "C6," and "CP6." Second, on the LE-EEG dataset, All\_band indicates that all available EEG channels are used instead of C1 and C2. This is because the number of available EEG channels from the two datasets are not consistent, which are 64 and 32 for the SEED and LE-EEG datasets, respectively. Furthermore, in Table 3, in order to ensure the consistency of the algorithm migration benchmark and further make a fair comparison, C3 was proposed as the combination of "T7," "P7," "CP5," "T8," "P8"and "CP6," as shown in Figure 6. In Figure 6a, the scatter points shown are all 62 electrode points used in the seed data set, of which the blue scatter points are C1 combined electrodes; In Figure 6b, the scatter points shown are the electrical poles used in the LE-EEG data set, and the blue scatter points are C3 combined electrodes. Notably, the channels involved in C3 (see the blue points in Figure 6b) aimed to match the locations of the channels involved in C2 (see the blue points in Figure 6a) as closely as possible.

**Table 3.** Accuracy comparison (i.e., ACC/STD) of different frequency bands (average 5-fold cross validation results).


**Figure 6.** Channel selection maps: (**a**) C2 on the SEED dataset; (**b**) C3 on the LE-EEG dataset.

Table 3 shows the classification accuracy of the five sub-bands (i.e., δ, θ, α, β, and γ) in the SEED. β+γ means the add fusion method. β × γ means the multiply fusion method. These two operations have been widely used in deep learning network design. Specifically, the add fusion method is described as having the corresponding elements of the feature matrix (which outputs from the multi-channel convolutional network) for each sub-band be added together. Similarly, the multiplicative fusion method is described as having the corresponding elements of the feature matrix for each sub-band be multiplied. Attention (β, γ) indicates that the attention fusion unit is used for the feature-level fusion. Take C2

(see the third column of Table 3) as an example. Based on the experimental results of the single-channel network, on the SEED dataset, we found that the β and γ bands performed a better prediction than the other bands, the accuracy of these two bands were 87.09% and 90.90%, respectively. Therefore, we combined the β and γ frequency bands, input them to the multi-channel backbone network to extract features, and adopted three feature-level fusion methods for emotion prediction. The final experimental results showed that the fusion of the frequency band information (i.e., Attention (β, γ)) could improve the model accuracy; the resulting accuracy was 94.20%.

Furthermore, on the LE-EEG dataset, the emotion classification accuracy in each sub-band was high. We believe that the possible reasons for this phenomenon include (i) compared with the SEED dataset (*N* = 15), the LE-EEG dataset had relatively larger sample size (*N* = 45); (ii) after data fusion, the training samples (of the LE-EEG dataset) became even larger, which results in better model performance after the training. In addition, from the comparison between the last two columns in Table 3, we can see that the performance of All\_band has higher classification accuracy than the C3 combination of channels in each sub-band, so the channel selection does not yield better classification results. We believe that the reason for this phenomenon is that the types of emotions on the two datasets were different. To be specific, the SEED data were designed to explore three basic emotions containing negative, neutral, and positive, while the LE-EEG dataset explored three learned emotions of engagement, neutrality, and boredom. Therefore, the relevant channels for studying basic emotions may not be applicable to the study of learning emotions, and at this stage, there is no past reference literature regarding learning emotion channel studies, so in future work, learning emotion-related channel exploration should be the research focus. In this paper, the optimal combination of channels for learning emotions will not be discussed for the time being.

#### 5.1.2. Comparison of the Results of Fusion Methods

In this subsection, we verified the effectiveness of combining frequency band features to improve the model performance. This subsection focuses on analyzing the impact of multiple fusion methods on the model accuracy and verifying the necessity of attention fusion units. We compared three fusion methods, namely feature summation fusion, feature multiplication fusion, and attention weight fusion, which are denoted as *Add*, *Mult*, and *Attention* in Table 4, respectively. Table 4 shows the classification accuracy of the five sub-bands (i.e., δ, θ, α, β, and γ) in the SEED dataset after inputting different frequency band combinations into the multi-channel backbone network to extract features.


**Table 4.** Accuracy comparison (i.e., ACC/STD) of various fusion methods validated on SEED dataset (average 5-fold cross validation results).

Notably, Add means to directly add and fuse the features; Mult means that the features are multiplied and fused; Attention means that the attention fusion unit is used for feature-level fusion, and Bold indicates the best accuracy achieved using different fusion methods (for a given channel combination, C1 or C2).

Our experiments revealed that first, the proposed attention fusion unit pair model has a better performance on more frequency band combinations in general; however, more frequency band combinations cannot always guarantee a higher performance of emotion classification. For example, compared with the sub-band combinations shown in the other rows of Table 4, in the case of the sub-band (δ, α, β, γ) shown in the last row of Table 4, (i) the model performance using the fusion mode of *Add* decreased (see the 2nd and 5th columns of the last row in Table 4), but remained relatively stable; (ii) the model performance using fusion mode of either *Mult* or *Attention* (see the 3rd and 6th columns or the 4rd and 7th columns of the last row in Table 4) was seriously degraded. The reason for this might include that when the model was trained, the fusion method of *Mult* and *Attention* made the model training parameters exponentially increase, resulting in severe overfitting caused by model overtraining.

Second, we can see that, the best performance obtained by C2 (see the 5th–7th columns of Table 4) was always higher than that of C1 (see the 2nd–4th columns of Table 4). For clarification, let us take the sub-band (δ, γ) as an example. From the 4th row in Table 4, we can see that, (i) regarding C1, the best performance with 95.63% was achieved using the fusion method of *Attention*; (ii) regarding C2, the best performance with 95.70% was achieved again using the fusion method of *Attention*, i.e., compared with C1, 0.07% accuracy improvement was achieved by C2.

Third, regarding C2, the top two performances were achieved by the sub-bands (α, β, γ) and (δ, β, γ) using the fusion method of *Attention*, which were 96.02% and 96.45%, respectively (see the 2nd and 3nd last rows of the last column in Table 4). Take the subband (δ, β, γ) as an example. Compared with *Add* and *Mult*, 0.67% and 0.30% accuracy improvements were obtained by the fusion method of *Attention*. This demonstrated that the classification performance can be improved using the fusion method of *Attention*, due to those more important features were assigned by attention weights.

#### *5.2. Comparison*

Based on above experiments, we take δ, β, and γ bands and attention fusion to complete comparison. On the SEED dataset, the model herein was compared with the baseline models. Table 5 presents the results. Compared with that of the optimal baseline model (see the row of "DCCA [39]" in Table 5), the performance of our model was improved by 1.37%.


**Table 5.** Accuracy comparison (i.e., ACC/STD) versus baseline models (average 5-fold cross validation results).

Dotted line (i.e., "—") indicates that data was not provided; and bold indicates the best accuracy achieved for a given dataset.

Referring to the baseline models on the SEED dataset, two baseline models 4D\_CRNN [35] and SOGNN [31] that can be reproduced with the shared code were selected for comparison when validating on the LE-EEG dataset. Table 5 presents the comparison with the baseline models. Compared with that of these two baseline models, the performance of our model was improved by 28.39% and 21.49% (see the 3rd column of the rows of "4D\_CRNN [35]," "SOGNN [31]," and "ECN-AF(All\_band)" in Table 5), confirming that the network was robust across datasets. Figure 7 shows the validation set accuracy of the three different models during the training process. We still find that the ECN-AF model yields a better performance.

**Figure 7.** Accuracy of the model's validation set.

#### **6. Conclusions**

In this study, we collected the EEG signals of 45 subjects while they were watching learning materials. We established the LE-EEG dataset and tried to use the EEG signals to recognize learning emotions. The proposed ECN-AF first extracted the frequency band features through a multi-channel backbone network, and then fused the frequency band features with attention, which could effectively improve the model performance. Using the complementarity of the frequency band combination effectively improved the model's accuracy and robustness and yielded better results compared to a single sub-band. This is a conclusion similar to that of previous studies [30,31]. The ablation experiments performed herein also demonstrated the necessity of multi-channel backbone blocks and attention blocks. The experiments on the SEED and LE-EEG datasets showed that the proposed model outperforms baseline models with a better cross-dataset performance.

Our future work will focus on the expansion of the LE-EEG dataset and on the construction of a physiological signal dataset for multimodal learning emotion recognition. At the same time, the learning of emotion-related frequency bands and related brain regions and channels must be continuously explored and optimized, e.g., to further improve the performance by exploring the optimal combination of EEG channels on the LE-EEG dataset. The accuracy of the proposed model still needs improvement in across-participant research. The generalization ability and robustness of the algorithm must also be further improved.

**Author Contributions:** Conceptualization, X.Z.; methodology, X.Z., Z.H. and L.Z.; software, W.R.; Data collection, W.R., Q.Y., J.S. and G.L.; validation, W.R.; investigation, Z.H.; writing—original draft preparation, W.R.; writing—review and editing, X.Z. and L.Z.; supervision, X.Z. and L.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors would like to thank support from the National Key R&D Program of China (2020AAA0108804), National Natural Science Foundation of China (61937001) and the National Natural Science Foundation of Hubei Province (2021CFB157).

**Institutional Review Board Statement:** Our Institutional Review Board approved the study.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The open access dataset SEED is used in our study. Its links is as follows, https://bcmi.sjtu.edu.cn/~seed/seed.html (granted on 7 May 2020; accessed on 25 April 2022).

**Conflicts of Interest:** The authors declare that they have no conflict of interest to report regarding the present study.

#### **Appendix A**

Referring to [52–57], a questionnaire is taken as invalid if one or more than one of the six factors in Table A1 is/are involved.


**Table A1.** Summary of methods of careless/insufficient effort (C/IE) detection.

#### **References**


## *Article* **Fear Detection in Multimodal Affective Computing: Physiological Signals versus Catecholamine Concentration**

**Laura Gutiérrez-Martín 1,2, Elena Romero-Perales 1,2, Clara Sainz de Baranda Andújar 1,3, Manuel F. Canabal-Benito 1,2, Gema Esther Rodríguez-Ramos 1, Rafael Toro-Flores 4, Susana López-Ongil <sup>4</sup> and Celia López-Ongil 1,2,\***


**Abstract:** Affective computing through physiological signals monitoring is currently a hot topic in the scientific literature, but also in the industry. Many wearable devices are being developed for health or wellness tracking during daily life or sports activity. Likewise, other applications are being proposed for the early detection of risk situations involving sexual or violent aggressions, with the identification of panic or fear emotions. The use of other sources of information, such as video or audio signals will make multimodal affective computing a more powerful tool for emotion classification, improving the detection capability. There are other biological elements that have not been explored yet and that could provide additional information to better disentangle negative emotions, such as fear or panic. Catecholamines are hormones produced by the adrenal glands, two small glands located above the kidneys. These hormones are released in the body in response to physical or emotional stress. The main catecholamines, namely adrenaline, noradrenaline and dopamine have been analysed, as well as four physiological variables: skin temperature, electrodermal activity, blood volume pulse (to calculate heart rate activity. i.e., beats per minute) and respiration rate. This work presents a comparison of the results provided by the analysis of physiological signals in reference to catecholamine, from an experimental task with 21 female volunteers receiving audiovisual stimuli through an immersive environment in virtual reality. Artificial intelligence algorithms for fear classification with physiological variables and plasma catecholamine concentration levels have been proposed and tested. The best results have been obtained with the features extracted from the physiological variables. Adding catecholamine's maximum variation during the five minutes after the video clip visualization, as well as adding the five measurements (1-min interval) of these levels, are not providing better performance in the classifiers.

**Keywords:** multimodal affective computing; catecholamines; emotion classification; wearable devices

#### **1. Introduction**

Affective computing, the study, analysis, and interpretation of human emotional reactions by means of artificial intelligence [1], has become a hot topic in the scientific community. Possible applications include accurate neuromarketing techniques, more efficient human-machine interfaces and new wellness and/or healthcare practices, with innovative therapies for phobias and mental illnesses [2–6]. Recently, the prevention of violent attacks on vulnerable people by means of the early detection of fear or panic emotional reactions is under research in this area [7].

**Citation:** Gutiérrez-Martín, L.; Romero-Perales, E.; de Baranda Andújar, C.S.; F. Canabal-Benito, M.; Rodríguez-Ramos, G.E.; Toro-Flores, R.; López-Ongil, S.; López-Ongil, C. Fear Detection in Multimodal Affective Computing: Physiological Signals versus Catecholamine Concentration. *Sensors* **2022**, *22*, 4023. https://doi.org/10.3390/s22114023

Academic Editor: Mincheol Whang

Received: 30 April 2022 Accepted: 18 May 2022 Published: 26 May 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

In affective computing, many research areas merge to provide efficient and accurate systems capable of classifying the emotion felt by a person. Apart from psychology, neuroscience and physiology, other disciplines are required to automate the emotion detection process as well as to allow in-depth data analysis and useful feedback.

Human emotions are the consequence of biochemical reactions in the brain. External stimuli are processed in certain brain regions such as the amygdala, insula and prefrontal cortex [8–10]. These areas activate the autonomic nervous system, which triggers physiological changes as an emotional response. From the global emotional response, we can distinguish conscious and unconscious processes. The cognitive component in the emotion obtains a high degree of consciousness and can feedback the physiological reactions chain.

The measuring and processing of these physiological reactions allow automatizing the emotion detection and classification process, known as affective computing. If this detection involves several sources of information, it is known as multimodal affective computing. Validity and corroboration issues have made physiological variables the most attractive to researchers. Multimodal recordings commonly used are Galvanic Skin Response (GSR), ElectroMyoGraphy (EMG) (frequency of muscle tension), Heart Rate (HR), Respiration Rate (RR), ElectroEncephaloGraphy (EEG), functional Magnetic Resonance Imaging (fMRI), and Positron Emission Tomography (PET) [11], even though behavioural measurements such as facial expressions, voice, movement, and subjective self-reporting can also be useful for experimental purposes.

In this sense, some authors have related non-external physiological variables with emotional reactions [12]. For example, the levels of neurotransmitters in the brain or circulating catecholamines vary depending on a person's emotional state, affecting activity of physiological variables. Although their measures are very invasive, the relation between physiological variable changes and the concentration of these molecules makes them interesting in some applications of affective computing. For example, in risk situations, this early detection of fear or panic emotions would trigger a protection response for the person in danger. To date, there is no study using catecholamine concentration in blood plasma for emotion detection that includes an experimental sample in humans, just theoretical studies.

The concentration of catecholamines is usually measured in urine to diagnose or rule out the presence of certain tumours such as pheochromocytoma or neuroblastoma because these tumours raise the levels significantly. However, in basal conditions, the levels are low and can be detected in blood by high-performance liquid chromatography (HLPC) techniques.

Continuous and autonomous measurement of these molecules is not available currently, but if they prove useful, wearable analysis devices could be designed and developed, similar to insulin micropumps [13].

In this work, a methodology and protocol are proposed to connect the elicitation of human emotions with the variation of plasma catecholamine concentration. For this first test, fear is chosen as the target emotion for two main reasons. On the one hand, the relationship between neurotransmitters and stress or fear is well documented in the literature, as they are responsible for the activation of the body's fight or flight mechanisms. On the other hand, the protection of women against gender-based violence has been chosen as a target application. For this purpose, the objective is to be able to detect fear automatically so that an alarm is triggered to protect women in danger. Although there is already work in this area, so far only physiological variables have been used. In order to validate if the inclusion of catecholamine plasma concentration improves the results, an immersive virtual reality environment has been arranged to provoke realistic situations where the volunteer could have intense emotional reactions. Continuous monitoring of physiological variables, with a research toolkit system (for the sake of comparison with other affective computing research works), is connected with the virtual environment, as well as to an interface for the classification of the emotions elicited. The detection of emotions in humans through the plasma concentration of catecholamines has been analysed and compared with externally measured physiological variables, such as SKT, HR and

EDA. The main obtained results are very positive with regard to physiological variables while they are not conclusive for the levels of catecholamine concentration in blood plasma. The main contributions of this work can be summarized as:


The rest of this paper is organized as follows: Section 2 provides a review of the state of the art regarding emotion theory, automatic emotion detection, and physiological response related to catecholamines and emotion. As result, we can formulate the hypothesis of this work. Section 3 describes the methodology used in this work for the experimental setup, including the sample description, the design of the study, the stimuli used, the labelling method, and the collected measurements. Section 4 presents the experimental results (for labelling, physiological variables and catecholamine concentration). Additionally, we present an artificial intelligence algorithm analysis in order to validate the hypothesis formulated previously. The discussion is presented in Section 5, and finally, Section 6 concludes the work.

#### **2. State of the Art: Emotions, Physiological Response and Affective Computing** *2.1. Emotions*

Emotions are fundamental for human beings since they play an important role in individual and social behaviour and mental processes, such as decision making, perception, memory, attention, etc. [14]. However, they have been partially ignored in the past, generally due to the difficulties they intrigue for experimental methodology.

The identification and classification of emotions for improving people's lives have gained interest in recent years as several fields can take advantage of the results in this area [15–17]. such as mental health, human-machine interfaces, learning and teaching methods, video games or neuromarketing. In psychology, emotions are described as "psychological states that include three components: subjective personal experience, associated physiological response, and behaviours" [18,19].

Within the literature and the state of the art in emotion identification and classification, there are two trends: (1) the classification of emotions as discrete elements, and (2) their inclusion in a continuous vector space. Within the first option, different classifications have been proposed. The first classification was presented by Ekman [20] using six basic emotions (happiness, sadness, disgust, fear, surprise, and anger). Since then, other classifications have been presented, adding emotions, or changing some of them [21,22]. Within the second option, we find the representation in the affective space. This consists of the multidimensional representation (usually within two or three axes) of the emotion so that the affective space becomes a continuous space in which every emotional state is represented by two or three coordinates. The most lately used space [23] proposes three dimensions (valence, arousal, and dominance). In this space, valence-pleasure (P) indicates positive or negative emotions; arousal (A) ranges from calm to high excitement levels; and finally, dominance (D) denotes the ability to control the emotion [24]. Several studies [25] of emotion classification use only a 2-dimensional space (PA space) using the valence and arousal axes previously described. That generates four quadrants in the space for locating emotions (Q1, Q2, Q3, and Q4). Some authors [26,27] have tried to place the discrete emotions in the quadrants according to the valence and arousal presumably experienced

by each of them (see Figure 1a). Adding the third dimension (D) allows for differentiating discrete emotions sharing similar values in the PA space, such as fear and anger in Q2.

**Figure 1.** (**a**) Discrete emotion mapping in PA space in the literature. (**b**) Results extracted from Spanish study [28].

Both emotion classification systems present difficulties when applied to the automatic identification of emotions and their experimental validation. On the one hand, the use of discrete emotions is considerably biased by the sociocultural environment of the person [28], especially the background and the country of origin. In addition, there is reasonable dependence on the correct understanding of the description of the emotion or its nuances when identifying it [29]. In an attempt to address this, several emotions have been added to the list making it longer, but this also leads to problems for automatic emotion classification methods (as they add subtle differences in the responses). On the other hand, PAD affective space systems are often also related to the difficulty in understanding the three classification axes.

#### *2.2. Emotion Detection*

Affective computing has emerged to shed light on the gap where technology and emotions converge. One of the goals of this field is trying to model emotional response to a wide variety of stimuli by evaluating emotional states. These states become measurable regarding subjective self-reports, physiological variables and behaviour.

The main elements involved in affective computing systems are the emotions theory [30] which connects human affective reactions to external stimuli, attending to intrinsic and extrinsic factors, with externally measurable physical and physiological changes; collecting data with smart sensors, first through emotion elicitation experiments in the lab and secondly through live in-the-wild monitoring; and the generation, training and integration of artificial intelligence algorithms in autonomous systems [3].

In affective computing, those changes are objectively measured in the person to determine the emotion felt. External (behavioural) aspects, such as facial expression, voice, movement, etc., are voluntary and biased through culture and society, making them difficult to apply to user-independent emotion detection. On the other hand, physiological changes (involuntary reactions) with an external effect (it is possible to measure them in a non-invasive way), have been preferred [31]. Typical variables used in affective computing include galvanic skin response, which increases linearly with a person's level of arousal [32,33] electromyography (frequency of muscle tension), which is correlated with emotions of negative valence [34]; heart rate, which increases with negative valence emotions like fear [35,36]; respiration rate(how deep and fast the breath is), which becomes irregular with more aroused emotions like anger [37]; electroencephalography [38,39] and functional magnetic resonance imaging [40].

All these variables differ in many aspects, some of them are ease of measurement, which is related to how internal or external the target signal is; consciousness, because some

variables can be consciously controlled and altered by the individual; and invasiveness, which means that some variables can be measured with low/high invasiveness for the individual. Many affective computing systems combine several variables in order to increase the performance of the application integrating solutions known as multimodal affective computing [41–43]. This allows combining several features from different sources making the automatic detection usually more complex but also with higher accuracy.

Intelligent algorithms should be trained with these measured physiological variables together with subjective perceived emotion during stimuli application. Among the different available options, we can feature according to the literature [44] those used in constrained devices as: Support Vector Machine (SVM) [45], K-Nearest Neighbours (KNN) [46] and Ensemble Methods (ENS) [47]. For training and research purposes, there are different databases compiling all these data for helping in the generation of affective computing systems [48,49].

The measurement of these physiological variables with wearable devices during daily life is associated with a high amount of noise due to interferences and users' movements [50]. There are several works proposing solutions to eliminate or reduce this noise, through filters, algorithms, and even, fuzzy logic [51], but these techniques are expensive in terms of power consumption, the time required, and computation effort.

In order to try to overcome this problem, other variables could be tested in order to validate its inclusion pertinence. Among them, catecholamines' presence in blood plasma, saliva or sweat could be an interesting option, even if its measurement is more invasive, as they could be more robust against artifacts.

#### *2.3. Chatecolamines in Emotion Detection*

Since the first half of the 20th century, explanatory theories emerged to explain the physiological changes caused by stressful stimuli that altered the body's homeostasis. These theories somehow evolved from the "stress non-specificity" approach to the "stress specificity" approach [52]. This means that the first theories of stress regarded this response relatively independent of the type of threat. Whether it was exposure to cold, haemorrhage or distressing emotional encounters, the stress response would be essentially the same [53]. However, recent data and observations indicate the probable existence of a variety of stressors with different targets and different effects on homeostasis [54]. These theories tend to explain the stress response by considering that it has a primitive type of specificity, with differential responses of the sympathetic nervous and adrenomedullary hormonal systems, depending on the type and intensity of the stressor perceived by the organism and interpreted in the light of experience [55]. The activation of the adrenomedullary hormonal system has been linked to glucoprivation and emotional distress such as fear. There is some evidence to confirm an accumulated association between noradrenaline and active escape, avoidance or attack, and a link between adrenaline and passive, immobile fear [56].

Catecholamines are hormones made in nerve tissue, the brain, and the adrenal glands. If they are found in the synapses of the nervous system, they are classified as neurotransmitters, and if they are found in the bloodstream, they are classified as hormones. The adrenal glands produce large amounts of catecholamines in response to acute stress or elevated arousal [57]. The main catecholamines are adrenaline (epinephrine), noradrenaline (norepinephrine) and dopamine. Catecholamines help the body to respond to stress or fear and prepare the body for "fight or flight" reactions [58]. This reaction to states of threat or high arousal results in a general discharge of catecholamines from three peripheral systems: the sympathetic branch of the autonomic nervous system, the adrenomedullary hormonal system and the autocrine/paracrine dopaminergic system. The activation of these systems favours the secretion of catecholamines into the bloodstream, where they trigger a cascade of physiological changes in peripheral tissues after binding to their receptors. Catecholamines increase heart rate, blood pressure, respiratory rate, muscle strength, and alertness. They also reduce the amount of blood going to the skin and intestines and increase blood going to major organs, such as the brain, heart, and kidneys [59].

Theoretical studies such as [12] propose that there is a direct relationship between neurotransmitter levels (dopamine, noradrenaline, and serotonin) and emotions. In this model, for example, fear is related to a combination of a low level of serotonin, a low level of noradrenaline and a high level of dopamine, (see Figure 2).

**Figure 2.** Loveheim cube showing correspondence among catecholamines and emotions (based on [12]).

Loveheim's study describes a theoretical framework that, if measurable, could improve multimodal affective computing systems for the automatic identification and classification of emotions. In fact, the study proposes to continue this research with a further experimental test that allows validating his proposal. Walker also proposes a theoretical framework that includes cortisol (a hormone produced in the adrenal gland) as an indicator related to fear and stress [60]. Again, this work suggested validating this framework with experimental tests. There are no results for catecholamines and human emotions experiments, although some previous tests have been performed in cats [61]. Directly measuring the presence of neurotransmitters is very invasive and nearly impossible on a day-to-day basis, so measuring catecholamines' presence in blood plasma in an experimental setup in order to confirm whether there is a relationship between this presence associated with different emotional states is a good starting point for future developments in affective computing research.

#### *2.4. Hypotheses*

Once the state of the art is reviewed, it can be stated that there is a lack of experimental studies that validate the relationship and convenience of using the concentration of plasma catecholamine in affective computing. So, in this work, the authors propose that:


If this hypothesis is proved correct, an automatic system for early detection of emotional states of fear can be implemented, reducing the effect of interferences and noise in the measured signals. Better protection for people in dangerous situations will be provided through the activation of early protective responses.

#### **3. Material and Methods**

In this section, we present the proposed methodology for data collection of both physiological variables and catecholamines in an immersive environment for emotion elicitation. Since the design of this experiment involves the extraction of blood samples for the analysis of catecholamines in blood plasma, and the number of samples cannot be high, fear has been chosen as the target emotion, since, as discussed in Section 2, it is highly related to the release of catecholamines.

In addition, some considerations have to be taken into account. As stated before, one of the objectives of the authors is to apply multimodal affective computing to the protection of women victims of gender-based violence. For this reason, the sample of this study is entirely composed of women, and the proposed final application also influences the choice of one of the audio-visual stimuli, which is directly related to gender violence.

#### *3.1. Sample of the Study*

The study population consisted of 21 volunteers, all of them apparently healthy women. All of them were Spanish women, and healthcare workers. Study subjects were not allowed to perform strenuous exercise, smoke, eat some foods, or take drugs or some medicines (Table 1) at least 24 h before analysis, to avoid interference with catecholamines measurement.


**Table 1.** Foods, drinks, and drugs can interfere with the analysis of catecholamines.

Main data of female volunteers are registered in Table 2. The mean age of the volunteers is 36. Only 5 of them had one child, and 13 volunteers were single. With regard to Body Mass Index (BMI), only 4 volunteers presented values between 25 and 30, overweight indicative. Finally, 4 volunteers are in their menopause. Some volunteers (6) were taking treatments for chronic illnesses (hypertension, chronic pain, heart failure, ulcerative colitis, anaemia, and diabetes).


**Table 2.** Characteristics of women volunteers.

The study conforms to the ethical principles outlined in the Declaration of Helsinki. Design of the study was approved by the Research Ethics Committee (REC) of Principe de Asturias Hospital with protocol number: CLO (LIB 10/2019). All participants received a detailed description of the purpose and design of the study and signed informed consent approved by the REC.

#### *3.2. Design of the Study*

The study consisted in measuring the physiological variables of a set of volunteers while they were watching a set of 4 emotion-related videos in an immersive virtual reality environment. Additionally, several blood extractions were performed after the visualization of three of these videos to analyse the plasma catecholamine levels (dopamine, adrenaline, and nor-adrenaline). Besides, after every video watching, the volunteer labelled the emotions elicited during the visualization.

Each participant fasted at least twelve hours before the experiment. Previously to the experiment, the participant filled in a form providing information such as personality traits, sex, age group, recent physical activity, or medication (which could alter the participant's physiological response), self-identified emotional loads, and mood bias (fears, phobias, or traumatic experiences), summarized in Table 2. This information could be relevant and informative to the emotional reactions of the participants during the experiment, affecting their cognition, appraisal, and attention.

The experiment was designed to last globally 2 h. In Figure 3, the schedule of the experiment is shown. After the interview, filling in the questionnaire, and signing the informed consent, the test schedule and protocol were explained to every volunteer and some demo was performed in relation to the virtual reality environment. Then, the sensors for measuring the physiological variables were located. The BioSignalPlux® research toolkit system was used to register the physiological variables evolution throughout the study, such as forearm skin temperature, galvanic skin response, finger blood volume pulse (BVP), trapezoidal electromyogram, and chest respiration. The system is placed in different locations in the volunteer's body (arm, hand, chest, and finger), (Figure 4). These physiological signals were selected because they could be easily implemented in an inconspicuous and comfortable wearable device, avoiding any disadvantage to the user. There are smartwatches that already integrate BVP, GSR, and SKT sensors. Respiration and EMG could be integrated into a patch or band. This characteristic is mandatory for this type of application.

**Figure 3.** Schedule of the experiment for each volunteer.

**Figure 4.** Electrodes and sensors position for experiment.

Once explained how to handle the equipment to label each video, the nurse proceeded to put a via in the antecubital vein to extract blood samples at different time points of the study, at the beginning (basal point) and after each video (5 samples). Each subject watched four unexpected videos related to different emotions that had to be labelled according to what she was feeling at that moment. Just after finishing each video a blood sample was taken. After videos 2, 3 and 4, five samples were collected, separated 1 min each, to monitor the changes in catecholamine levels, (Figure 5).

**Figure 5.** Volunteer ready to start the experiment.

#### *3.3. Audiovisual Stimulus*

Every subject watched four videos, two of them related to the emotion of fear, one related to calm and the other one related to joy. The schedule is Calm Fear Joy Fear. The order of fear-related videos is randomly set for each volunteer.

The video clips used for the experiment were selected from the UC3M4Safety Database of audiovisual stimuli aimed to elicit different emotional reactions through an immersive virtual reality environment [62] (see Figure 6). Most of the clips were 360-degree scenes providing more realistic experiences.

**Figure 6.** Screenshots for fear and calm video visualization.

The Oculus™ Rift S Headset was used under an application built on Unity™ that connects the video clips projection to the physiological monitoring system and records the emotion labelling. The whole data recording system was initiated by the virtual reality environment that manages both video stimuli and sensor measurement. A TCP/IP port connection was created at the beginning of the trial to communicate with the OpenSignals application. The information storage was divided by scenes, meaning each file contained the information collected between two timestamps (start and end of each screen) set by the environment, thus enabling synchronization.

The four video clips were V1, V2, V3, and V4, aimed to provoke calm, fear (genderbased violence related), joy and fear, respectively.


These videos obtained a very good unanimity in discrete emotion, higher in the case of women for the fear and joy clips while the mean and standard deviations in the PAD affective space dimension are also closer than expected for fear clips and for women, (Table 3). In this table, the discrete emotion labelled for every video is shown for the experiment detailed in [28], as well as the three dimensions of the PAD affective space. As it could be seen, V2 has a very high unanimity in the discrete emotion of fear in women, and also V4. Regarding PAD variables, the dispersion and the mean are complying with the expected ranges.


**Table 3.** Emotional Labelling of the video clips used in the experiment [28].

#### *3.4. Labelling*

In order to try to overcome the problems related to labelling method mentioned above, in this work, we have decided to include both a discrete classification of emotions (joy, hope, surprise, attraction, tenderness, calm, tedium, contempt, sadness, fear, disgust, and anger), plus an indicator of emotional intensity to be able to detect more nuances, and the classification in the PAD affective space using the SAM methodology [63] (see Figure 7). As depicted in Figure 3, the labelling is carried out just after the blood sample collection.

**Figure 7.** Labelling screen used in the experiment.

#### *3.5. Measurement of Dopamine, Adrenaline and Noradrenaline*

We have carried out the determination of catecholamines in 3 mL of plasma by highperformance liquid chromatography (HPLC). Blood samples were collected in pre-chilled EDTA-treated tubes, in the morning after a 12-h overnight fast and resting period. As several samples had to be taken every few times after watching each video, a via was placed to assist sample collection from each point of the study. Plasmas were immediately separated, to prevent catecholamines degradation, by centrifugation at 2000× *g* for 15 min

at 4 ◦C. After that, the plasma was collected in clean and pre-chilled tubes and then stored at −80 ◦C until measured. All plasmas were properly submitted to Reference Laboratory S.A. (L'Hospitalet de Llobregat, Barcelona, Spain) to measure by HPLC the adrenaline, noradrenaline and dopamine in each sample.

Measurement of serotonin requires serum instead of plasma, needing the extraction of additional 5 mL blood samples from each volunteer. Apart from the extra cost, equivalent to measuring the other three catecholamines, the large number of samples required has prevented the authors from analysing the evolution of serotonin concentrations during the study.

#### **4. Experimental Results**

The experiments were performed from December 2020 to January 2021, on 12 and 9 volunteers, respectively.

#### *4.1. Emotion Labeling*

As it was already mentioned, emotional labelling is a complex task, not only because sometimes the target emotions are not the ones that are elicited to the volunteers, but also because of the terminology.

For that reason, at first, it is important to analyse the distribution of the labels reported during the experiment and study how well the clips have been eliciting their target emotions.

Taking into consideration discrete classification, (Figure 8), the clip targeting general fear emotion (V4) is the one with the highest agreement among the volunteers, 95% of them labelled it as fear. In the case of the clips of calm (V1) and joy (V3), a unique emotion does not obtain a clear majority; however, if the quadrants of PAD space are analysed, these videos show 76% and 90% of agreement, respectively.

**Figure 8.** Emotion labelling distribution (0.00–1.00) between emotions reported by the volunteers w.r.t. each video clip visualized.

On the other hand, V2 shows the highest dispersion, although fear is the most used label (48%), anger (19%), and sadness (19%) represent approximately 40% of the reported classifications. This scattering is mainly due to the scenes presented in the clip. As we have already found in previous works [28], gender-based violence videos elicit this variety of emotions depending on the volunteer's perspective (first person or external).

As regards continuous labelling, independently from the dispersion found in discrete labelling, both fear clips are represented in their theoretical ideal position in the PAD space, low-valence, low-dominance and high-arousal corner.

The same occurs with the calm and joy clips which are placed at spots of high-valence, medium-high dominance, and medium-low arousal, with the joy clip being slightly above in terms of arousal.

Looking at previous results, and to observe the intercorrelation between volunteers when classifying all the clips, the correlation coefficient is computed considering all continuous reported labels. As result, a high positive relationship is obtained between all the volunteers, except for V002 and V005, who barely correlate with the rest, Figure 9. These

results allow us to check that the emotions elicited are not only close to the original target (at least in the quadrant) but also inter-volunteer.

**Figure 9.** Correlation matrix between volunteers considering continuous reporting labelling.

#### *4.2. Physiological Variables*

From the physiological variables measured, the authors extracted features from the forearm skin temperature, skin conductance (GSR), finger blood volume pulse (BVP), and respiration. These variables have been measured throughout the whole experiment for every volunteer. First, a global analysis of the whole group of volunteers was carried out, for every video clip watched and, consequently, for every emotion. Later, temporal evolution of every physiological variable was also performed to find patterns of evolution during the visualization of the different emotion-related video clips.

4.2.1. Median and Quartile Distribution of Extracted Features per Video Clip

This analysis has been performed on the measurements from all the volunteers, considering the target labels of emotion, normalizing every volunteer with respect to their own values.

Although Clip 2 (V2) and Clip 4 (V4) have the same fear label, V2 includes genderbased violence and the emotional reactions are very different from the reactions on V4, as it has been detailed in the previous section.

The extracted features from the physiological variables are Inter-Bit-Interval (IBI) and Heart Rate Variability (HRV) extracted from BVP, which are very related to the degree of arousal, and the phasic peaks of GSR and the mean of GSR, which have been identified with the variables that work better for artificial intelligent algorithms in affective computing. These features are computed in 60 s windows.

As it can be observed in the Figure 10, the median and quartile distribution (box plots) IBI (a) and HRV (d) are the physiological features that better differentiate fear-related emotions, while the mean (c) and peaks (b) of GSR are clearly different for fear emotions (V4). Even, gender-based violence (V2) reactions are not distinguishable from calm or joy in terms of median values.

**Figure 10.** Normalized physiological features per video. (**a**) IBI. (**b**) number of phasic GSR peaks. (**c**) mean of GSR. (**d**) HRV rmssd.

The statistical analysis ANOVA on the features extracted from the physiological variables has provided some differences in the effect of different emotions elicited. In Table 4, the *p*-values for the comparison between videos are shown. We have observed significant values for the comparison between the effect of video clip V1 (calm) and video clips V2 and V4, for the mean of GSR. Additionally, there are significant differences in the effect of V1 and V4 for the IBI, and V3 and V4 for the number of peaks of GSR.

**Table 4.** *p*-values results from Kruskal-Wallis one-way ANOVA test for physiological data grouped by video clip.


NOTE: Significant codes: '\*\*' 0.001 '\*' 0.01 ' ' 0.05.

#### 4.2.2. Temporal Evolution of Physiological Variables

Temporal evolution analysis provides information about the evolution of the emotional state during the video. It should be noted that videos are labelled according to the prevailing emotion, but the same video could elicit more than one emotion, and the

intensity could be non-homogeneous. This is a limitation of this type of experiment where continuous labelling is not possible. The result is dispersion/noise in the data, hindering their classification and modelling. Figure 11 shows the mean evolution of the four features used in the previous section.

**Figure 11.** Temporal evolution of normalized features. (**a**) IBI. (**b**) Number of phasic GSR peaks. (**c**) Mean GSR. (**d**) HRV.

The four videos present a high variation of the selected features, especially V4. These variations correlate with scenes in the videos. In Figures 12 and 13, details on the scenes of both videos, V2 and V4, related to the fear emotion, are provided. As it could be seen, the most intense period of stress-fear in V2 is between seconds 32 and 58 when the boy is trying to open the bathroom's door. In Figure 11, features extracted from physiological variables present a very different behaviour in this period of time that, in some cases, it is maintained untill the end of the video due to the empathizing effect with the escaping mother and boy. Until they discover the aggressor is not in the lift, second 90, the climax is maintained.

**Figure 12.** V02 main stressful events. "Refugiado" Diego Lerma 2014. Available at [62].

**Figure 13.** V04 main stressful events. "Chamber of horrors" Inside 360 VR Prod 2018. Available at [62].

With regard to V4, all the scenes are stressful but peak instants are when lights go off (seconds 38 and 88) and there are screams or sudden hits/blows (seconds 12, 22, 63, and 105). The worst moment is when two people appear suddenly in front of the viewer, no-faced, with loud music and screams (105); all features show a change of behaviour around this final scare that has been under preparation right from second 63.

#### *4.3. Catecholamine Concentration*

The concentration of adrenaline, dopamine and nor-adrenaline catecholamines, has been measured as detailed in Section 3, with the HPLC technique. In Table 5 the concentration values for these catecholamines are detailed per volunteer. A global analysis of these values has been performed to determine the relationship between the emotional reaction and these concentrations. First, the box plots of mean and quartile for every video clip were obtained, Figure 14. Second, to analyse the temporal evolution of these concentrations, temporal graphs were plotted, in Figures 15 and 16.

**Figure 14.** Normalized concentrations for dopamine, adrenaline and nor-adrenaline (**a**–**c**) for every video clip.


**Table 5.** Plasma catecholamine concentration levels for every volunteer for every sample (pg/mL), for adrenaline (A), dopamine (DA) and noradrenaline (NA).

**Figure 15.** Temporal evolution of normalized concentrations for dopamine, adrenaline and noradrenaline (**a**–**c**) for every video clip, mean for all volunteers.

4.3.1. Catecholamine Concentration and Quartile Distribution

Data was collected per video clip, normalized per volunteer, and mean values were calculated for all the volunteers.

The obtained values do not show differences in catecholamine concentrations for different emotion-related video clips, especially for adrenaline and dopamine. Furthermore, for these catecholamines (A and DA), the gender-based violence fear video clip (V2) presents very dispersed values, while the fear video clip (V4) provides higher dispersion just for dopamine, Figure 14.

The statistical analysis ANOVA of the plasma concentration level has not provided a clear difference between the effects of different emotions elicited for the three catecholamines measured. In Table 6 the *p*-values for the comparison between the videos are shown. No significant values have been obtained for any pair compared.

**Table 6.** *p*-values results from Kruskal-Wallis one-way ANOVA test for catecholamine concentration data grouped by video clip.


**Figure 16.** Temporal evolution of normalized catecholamine concentration for video clips 2, 3 and 4 for DA (**a**), A (**b**) and NA (**c**).

#### 4.3.2. Temporal Evolution of Catecholamines after Video Clip Watching

Figure 15 shows the temporal evolution of dopamine (a), adrenaline (b) and noradrenaline (c) for video clips V2, V3 and V4, related to fear (gender-based violence related), joy and fear, respectively. The graphs represent the concentration of catecholamines, per sample (five per video per volunteer), as well as the mean value (continuous line) and the mean plus/minus standard deviation (dashed lines) for all the volunteers. Catecholamine concentration values have been normalized with respect to the mean value of every volunteer. For the sake of clarity, and for comparison with respect to the behaviour of physiological variables, in Figure 15 the temporal evolution of the mean value (for all volunteers) has been plotted for the three catecholamines. Dopamine concentrations show a slightly different evolution after watching the video clips related to fear with gender-based violence than in those related to joy or fear, where a final drop can be appreciated, (Figure 15a). Adrenaline concentration shows a continuous rising tendency for the fear-related clip (V4) while for joy (V3), a stabilization is observed in the final samples (Figure 15b). In the gender-based violence clip (V2), the stressful/relieving situation may provoke a rise and a drop in the adrenaline's concentration. Finally, in the noradrenaline's concentration (Figure 15c), a similar evolution can be observed in V2 and V3 (fear with gender-based violence and joy) with a final drop in the normalized value, while V4 (intense fear) is not presenting the final drop, since the stressful situation continues to get even more stressful until the end of the clip.

#### *4.4. Artificial Intelligent Algorithms*

Considering our goal, which is to study the improvement that catecholamines measurements can bring to our fear/not-fear detection model and compare the results with physiological models, the data were normalized, reorganized, and grouped by clip for both data types to generate supervised techniques and evaluate performance metrics individually and together.

In this work the standardization selected is a modified version of self-dependent z-score; it consists of subtracting the mean value and dividing by the standard deviation of the complete experiment for each volunteer independently.

The algorithms tested to classify the data were support vector machine (SVM), knearest neighbour (KNN), and ensemble (ENS). This selection was based on the target application, a wearable device with memory and computation power constraints. In addition, these methods are the most common ones used in the literature [44].

Each model's hyper-parameters were tuned using Bayesian optimization to minimize the misclassification rate over iterations and supported by 5 k-fold cross-validation strategy. Specifically, the selected technique is a sequential model-based optimization, which has shown substantial improvements over combinational space approaches [64]. Besides, this training and validation scheme was based on previous works and results in [7]. The performance values presented were the mean validation results of 10 iterations. No testing was carried out due to the lack of data.

Table 7 shows the characteristics of the different models used to generate classifiers regarding the information source, number of features, and windowing. A detailed explanation is provided in the next subsections. Videos V02, V03, and V04 were considered in all cases.

The metrics selected to evaluate the classifiers' performance are geometrical mean (Gmean) between Sensitivity (true positive rate, TPR) and Specificity (true negative rate, TNR) according to Equation (1). The TPR is the ratio between true positive (TP) and the sum of true positive and false negative (FN). The TNR is the ratio between true negative (TN) and the sum of true negative and false positive (FP).

$$\text{Gmean} = \sqrt{(\text{Sensitivity} \ast \text{Specificity})} = \sqrt{\left(\frac{TP}{TP + FN}\right) \ast \left(\frac{TN}{TN + FP}\right)}\tag{1}$$


**Table 7.** Characteristics of each configuration.

#### 4.4.1. Physiological Supervised Models

The classification of physiological data with supervised machine learning techniques is a common approach in affective computing due to the complex relationships that implies. The models presented in this work are user-independent because there is not enough data for user-dependent solutions.

Two configurations were tested with the same number of features but with a different window size and overlapping. The features used are 22 for BVP, 7 for GSR, 6 for SKT, and 12 for respiration. The segmentation and windowing were applied following two strategies. Firstly, the configuration 1 used a 60 s window per video clip aiming to reduce data dispersion in the video. The second one has five windows per video, 20 s with 10 s overlap. This strategy helped algorithm training by providing more data and more temporal resolution; however, this could also lead to information redundancy.

The results in Table 8 showed that it is possible to classify the data between fear and no fear generally (Gmean above 0.5). The best performance was achieved by ENS (Adaboost) with the first model.


**Table 8.** Performance metrics for physiological configurations.

4.4.2. Catecholamines Supervised Models

As in the physiological section, three algorithms KNN, SVM, and ENS (RandomForest) were applied (Table 9).

**Table 9.** Performance metrics for catecholamines models.


Firstly, each observation was associated with a clip and each feature to a sample of that clip, resulting in a data matrix of 63 rows (21 volunteers × 3 clips) and 15 columns (5 samples per clip × 3 catecholamines).

After achieving in almost all cases overfitted models or poor-quality metrics, a transformation of the data was applied to compute the maximum in-video variations, considering the sign positive if this variation was increasing (minimum previous maximum) or negative if it was decreasing (maximum previous minimum). This variable was obtained and then normalized for each catecholamine, resulting in a data matrix of 63 rows (21 volunteers × 3 clips) and 3 columns (1 maximum variation per clip × 3 catecholamines).

As in previous models and mainly due to the lack of enough data and an imbalanced configuration, overfitted models were achieved and performance results worsened (Gmean values between 0.33 and 0.44) and showed the model would work randomly, such as flipping a coin.

#### 4.4.3. Fusion Models

The data fusion applied followed two strategies based on physiological configurations. The first configuration was merged with the variation in plasma catecholamine concentration levels, per video clip, as explained previously (Model 5) and the physiological variables in a unique 60 s window. The second one used the plasma catecholamine concentration level directly, five samples per video clip. Each sample was paired with a 20 s physio window.

Table 10 shows the performance metrics obtained with the fusion models. The results were slightly worse than physiological models alone, i.e., the model was not learning from this data.


**Table 10.** Performance metrics for merged models.

#### **5. Discussion**

The study conducted in this work presents four main results. First, a methodology and protocol have been defined to connect the elicitation of human emotions with the variation of plasma catecholamine concentration. An immersive virtual reality environment has been arranged to provoke realistic situations where the volunteer could have intense emotional reactions. A continuous monitoring of physiological variables, with a research toolkit system (for the sake of comparison with other affective computing research works), is connected with the virtual environment, as well as a labelling procedure for discrete emotions and continuous PAD affective space dimensions. These three elements have been presented in previous works by the authors [65]. The novelty added to this method is to determine whether a person's emotions can be reliably recorded, assessing the differences or similarities between recording different physiological variables and measuring plasma catecholamine levels. The blood extraction must be performed after the video clip visualization to not interfere in the emotion elicitation but as soon as possible to detect the concentration peaks and valleys due to the emotion processed in the brain, which provokes a change in plasma catecholamine concentration. A pattern in the concentration variation has been looked for, as well as different classifiers, typical in affective computing, to determine the feasibility of using catecholamines for detecting fear emotions in a person.

Second, the emotion labels obtained during the study guaranteed the elicitation of the target emotions. The video clips selected were those with the best scores in terms of unanimity, in discrete and continuous emotions classifications, from the UC3M4Safety database [62]. The video clips' durations were between 60 s and 119 s. The 21 volunteers labelled the emotion felt during the video clip visualization in a very close way to the target emotion, especially for video clips V04 (fear) and V01 (calm), while for the other clips, at least the PA quadrant is maintained, (Figure 8). Every video clip provoked the target emotions, and, except for two volunteers, every volunteer labelling process matched with the rest of them, (Figure 9). Therefore, the variation in the measures of physiological variables and plasma catecholamine concentration per video clip, whatever they were, can be associated with a specific emotion.

Third, the physiological variables measured during the study, and the features extracted from them (IBI, GSR number of peaks, GSR mean and HRV) present similar behaviour as in previous works [7,65]. Statistically representative differences between fearrelated video clip V04 and joy and calm clips (V03 and V01) were found for the GSR mean, as well as between V01 (calm), V02 (fear related to Gender-based violence) and V04 (fear) for IBI. The classifiers applied to generate an artificial intelligence algorithm to detect fear emotional reactions present good results for windows of 20 s and 60 s, although the results were better for wider windows, and ENS model, with a True Negative Rate of 1 and a True Positive Rate of 0.83, (Table 8).

It should be noted that the amount of data compiled during the experiment was large due to the sampling frequency (200 Hz), making easier the training and testing processes for affective computing tasks.

Finally, the plasma catecholamine concentration measurements provided data with apparently no connection with the emotion elicited. The ANOVA analysis provided no significant differences between the levels of catecholamines in blood plasma after visualizing the video clips of the different emotions. Besides, the clustering analysis (fear/no-fear emotions) on the data obtained from the 21 volunteers did not produce a valid result. Moreover, the classifiers selected as artificial intelligence algorithms to detect fear emotional reactions present poor-quality metrics, mainly due to the lack of enough data for training, testing and generalizing.

This problem of insufficient data on plasma catecholamine concentration (only five samples per video, i.e., per emotion) is difficult to solve. Even in an experimental study, the ethical research advises to not make volunteers suffer unnecessarily. Sixteen blood samples per session per volunteer, although taken through a via, while visualizing emotional intensive video clips within a virtual reality environment, are a fairly good number to test the hypothesis of the research work. In the literature, up to our knowledge, there is no similar study, with most of the proposals being theoretical hypotheses and/or based on analysing previous experimental results for other purposes.

However, the data obtained should have provided some patterns of responses to different target emotions and, although in the temporal evolution of the concentration levels of adrenaline and nor-adrenaline a similar behaviour can be observed after both V02 and V04 fear-related clips, neither statistically significant relations have been found nor affective computing classifiers provided good results.

It is true, that the plasma catecholamine levels are altered by the effect of some foods, drinks, and medicines or drugs, as well as by strong physical exercise and/or recent intense stressful episodes. Amines found in banana, avocado, walnuts, beans, cheese, beer and red wine can modify the concentration of these hormones in the blood. Additionally, foods/drinks with cocoa, coffee, tea, chocolate, liquorice, or vanilla, as well as drugs (nicotine, cocaine and ethanol) and medicines (aspirin, tricycle antidepressants, tetracycline, theophylline, blood pressure control agents, and nitro-glycerine) have similar effects.

Besides, the emotional response is altered by prior experiences during a lifetime, and so does the emotional response to stress and the conditioned response to fear. Traumatic stressinduced fear memories may affect the physiological response and plasma catecholamine

levels. There is strong evidence supporting that central catecholamines are involved in the regulation of fear memory, by activation of the sympathetic nervous system with elevated basal catecholamine levels are common in patients suffering from post-traumatic stress disorder (PTSD).

In the study presented, attention is paid to the activity of the volunteers before the experiment, as well as the different substances taken and, also, previous traumatic stressful experiences.

Although we previously informed about the recommendations, the volunteers reported the following data. With regard to medicines as regular treatment, six volunteers reported five chronic diseases: diabetes mellitus (1), hypertension (2), cardiac failure (1), ulcerative colitis (1), anaemia (1), and chronic pain (1). Additionally, one volunteer was taking contraceptives. On the other hand, four volunteers were taking ibuprofen or another type of anti-inflammatory drugs for the two days prior to the experiment. Respect to avoiding stimulants in food, drinks and drugs in the 24 h prior to the experiment, 13 volunteers took coffee or tea in that period of time, and one volunteer drank alcohol. Additionally, three of them ate citric fruits in that period.

Only four volunteers (v06, v11, v13, v19) exactly complied with the recommendations with regard to avoiding stimulant foods, drinks and drugs; and did not take any medication. They were young women with ages 23, 30, 29, and 23, respectively. Likewise, three volunteers (v01, v04, and v17) only had a coffee, complying with the rest of the recommendations, and did not take any medication either. Their ages were 21, 55, and 24 respectively. There are seven volunteers that only took a coffee and medicaments not presenting differences in the levels of catecholamine concentrations (v02, v05, v09, v12, v14, v15, and v20). In summary, we can consider that 14 volunteers were fully compliant and 7 could have some objection with respect to regular catecholamine activity.

Regarding prior stressful experiences, or specific fears, seven volunteers reported some previous traumas that activate themselves in situations like video clips V02 and V04, (v01, v03, v04, v12, v15, v16, and v20). Two of them identified as gender-based violence victims. However, the evolution of their plasma catecholamine concentration levels were not different from the other volunteers', (Figures 15 and 16).

Apart from the extrinsic and intrinsic factors that can be affecting the results of the study, the authors wish to highlight the low levels of the concentration of these catecholamines present in the blood plasma. We tested the technique ELISA that produced worse results in terms of sensitivity of these catecholamines. Nine women volunteers followed a similar experimental study, and 15 blood samples per volunteer were analysed with ELISA kits.

With respect to the hypothesis stated in this work, the measurement of the levels of dopamine, noradrenaline and adrenaline concentration in blood plasma is neither providing better classifications nor a more accurate differentiation of fear-emotion reactions in women.

#### **6. Conclusions**

In this work, a methodology and a protocol have been proposed to connect the elicitation of human emotions with the variation of plasma catecholamine concentration. For them, an immersive virtual reality environment has been arranged to provoke realistic situations where the volunteer could have intense emotional reactions. A continuous monitoring of physiological variables, with a research toolkit system (for the sake of comparison with other affective computing research works) was connected to the virtual environment, as well as a labelling procedure for discrete emotions and continuous PAD affective space dimensions.

Using this methodology, an experimental study with 21 volunteers has been conducted, using fear as a target emotion, thus provoking fear and non-fear while measuring physiological variables and extracting blood samples after the visualization of every video stimulus. In this first study, 16 blood samples have been extracted per volunteer; 1 for basal

measure and 5 after the three emotion-related video clips (fear (gender-based violence related), joy and fear). These samples have been extracted in 1-min intervals after the visualization of the video clip. Along with the blood sample for catecholamine plasma analysis, physiological variables have been measured during the visualization of the video clips. Skin temperature, galvanic skin response, blood volume pulse, respiration, and Trapezoidal Electromyogram were the selected variables, measured with a commercial research toolkit.

Additionally, the emotion labelling for every video clip by all the volunteers has been analysed and there is a high degree of agreement in the discrete emotion, which was even better in the PAD affective space dimensions, especially for fear-related video V04. Therefore, we can affirm that the selected video clips are meaningful for the experiment.

The results for the evolution of the features extracted from the physiological variables, as well as an ANOVA statistical analysis, are in accordance with previous works. Differences between features measured during fear-related and during calm and joy-related video clips have been found for the mean of GSR (60 s windows). Additionally, differences have been found between calm-related and fear/gender-based-violence fear-related video clips for the IBI (for heart rate,). Furthermore, the temporal evolution of these features has been analysed and correlated with the fear-related video clips, identifying precise moments where the features' behaviour can be associated with the scene development.

We can conclude that there are no significant *p*-values (ANOVA statistical analysis performed) that allow differentiating the emotion elicited using only the evolution of the plasma catecholamine concentration levels as a variable. Additionally, the temporal evolution of these levels has been analysed, not identifying precise patterns for fear-related video clips different from the joy-related video clip.

Finally, artificial intelligence algorithms for fear classification with physiological variables and plasma catecholamine concentration levels (separately and together) have been tested. The best results have been obtained with the features extracted from the physiological variables. Adding the maximum variation of catecholamines during the five minutes after the video clip visualization, as well as adding the five measurements (1-min interval) of these levels, do not provide better performance in the classifiers.

The small number of samples together with the low concentration of catecholamines in blood plasma make it not possible to use these data for machine learning techniques for fear classification in this experiment.

Finally, we can state that research on this topic should continue considering the following future actions:


means algorithms four clear groups were observed, two of them being a symmetrical representation of the other two. In two of the groups, the third clip contains a negative variation, which is below the other two clips. On the other hand, the other two groups have a peak in the third clip (V3) which is above the values representing the other two videos.

**Author Contributions:** Conceptualisation, L.G.-M., E.R.-P., C.S.d.B.A., M.F.C.-B., G.E.R.-R., R.T.-F., S.L.-O. and C.L.-O.; methodology L.G.-M., E.R.-P., C.S.d.B.A., M.F.C.-B., G.E.R.-R., R.T.-F., S.L.-O. and C.L.-O.; software L.G.-M. and M.F.C.-B.; formal analysis L.G.-M., E.R.-P., C.S.d.B.A., M.F.C.-B., G.E.R.-R., R.T.-F., S.L.-O. and C.L.-O.; data curation, L.G.-M., E.R.-P., M.F.C.-B., G.E.R.-R., S.L.-O. and C.L.-O.; writing—original draft preparation, L.G.-M., E.R.-P., M.F.C.-B., G.E.R.-R., S.L.-O. and C.L.-O.; writing—review and editing, L.G.-M., E.R.-P., C.S.d.B.A., M.F.C.-B., G.E.R.-R., R.T.-F., S.L.-O. and C.L.-O.; visualization, L.G.-M., M.F.C.-B. and C.L.-O.; supervision, E.R.-P., C.S.d.B.A., S.L.-O. and C.L.-O.; project administration, E.R.-P., C.S.d.B.A. and C.L.-O.; funding acquisition, E.R.-P., C.S.d.B.A. and C.L.-O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research has been supported by the Madrid Governement (Comunidad de Madrid, Spain) under the ARTEMISA-UC3M-CM research project (reference 2020/00048/001), the EMPATIA-CM research project (reference Y2018/TCS-5046) and the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M26), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors acknowledge the technical help given by José Ángel Miranda.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Non-Contact Measurement of Empathy Based on Micro-Movement Synchronization**

**Ayoung Cho 1, Sung Park 2, Hyunwoo Lee <sup>1</sup> and Mincheol Whang 3,\***


**\*** Correspondence: whang@smu.ac.kr; Tel.: +82-2-2287-5293

**Abstract:** Tracking consumer empathy is one of the biggest challenges for advertisers. Although numerous studies have shown that consumers' empathy affects purchasing, there are few quantitative and unobtrusive methods for assessing whether the viewer is sharing congruent emotions with the advertisement. This study suggested a non-contact method for measuring empathy by evaluating the synchronization of micro-movements between consumers and people within the media. Thirty participants viewed 24 advertisements classified as either empathy or non-empathy advertisements. For each viewing, we recorded the facial data and subjective empathy scores. We recorded the facial micro-movements, which reflect the ballistocardiography (BCG) motion, through the carotid artery remotely using a camera without any sensory attachment to the participant. Synchronization in cardiovascular measures (e.g., heart rate) is known to indicate higher levels of empathy. We found that through cross-entropy analysis, the more similar the micro-movements between the participant and the person in the advertisement, the higher the participant's empathy scores for the advertisement. The study suggests that non-contact BCG methods can be utilized in cases where sensor attachment is ineffective (e.g., measuring empathy between the viewer and the media content) and can be a complementary method to subjective empathy scales.

**Keywords:** video content empathy; micro-movement synchronization; non-contact empathy measurement; empathic advertisement

#### **1. Introduction**

Empathy, a crucial factor in successful digital content marketing [1], is generally conceptualized as a multidimensional construct that includes both cognitive and affective responses to others in dyadic interactions [2–4]. However, empathy for digital content involves the emotional engagement of a viewer with a character in a causal and probable narrative [5]. For example, eliciting a consumer's emotions congruent to content emotions may maximize an advertisement's effect. Viewers empathizing with content tend to better understand the story and have more positive attitudes. They are more attentive and engaged [6–8], feel favorably toward products and brands [9,10], and are less likely to skip an advertisement [11,12]. Moreover, heightened empathy promotes the consumption of content in addition to attitudinal acceptance [13,14]. Such behavioral acceptance implies that viewer empathy is a critical predictor of the success of media content.

Empathy has been measured to predict the success of commercials. Escalas and Stern developed a battery of scale items to measure empathy toward advertisements, which has been widely used in consumer research [15]. Other prominent subjective measures include Schlinger's Viewer Response Profile [16–18], the Balanced Emotional Empathy Scale [19], the Empathy Quotient [20], the Toronto Empathy Questionnaire [21], the Interpersonal Reactivity Index [22,23], the Basic Empathy Scale [24], and the Hogan Empathy Scale [25].

**Citation:** Cho, A.; Park, S.; Lee, H.; Whang, M. Non-Contact Measurement of Empathy Based on Micro-Movement Synchronization. *Sensors* **2021**, *21*, 7818. https:// doi.org/10.3390/s21237818

Academic Editor: Mario Munoz-Organero

Received: 14 October 2021 Accepted: 20 November 2021 Published: 24 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

However, such subjective evaluations cannot measure the dynamics of empathy over time. Empathic questionnaires are limited to assessing dispositional empathy, which refers to an individual's capability (i.e., personality trait) to empathize with others.

The dynamics of empathy when consuming digital content require a novel measurement that can capture the fluctuation of emotions over time. The ever-changing interplay between the viewer's emotions and the content emotions demands a more direct, sensitive, and real-time measurement, such as physiological measures, to properly assess the degree of empathy. The unconscious level of empathy that is not verbally reportable (i.e., subjective evaluation) can be acquired through more direct physiological measures.

#### *1.1. Psychophysiological Basis of Empathy*

Empathy includes motor mimicry and emotional contagion associated with autonomically activated neural mechanisms of the other's feelings [26–29]. It also includes mirroring responses between people, in which explicit and implicit physiology become synchronized [30–32]. Explicit responses from empathy involve the synchronization of faces, gestures, and body movements. Changes in body motion synchronization are associated with the degree of empathy during face-to-face communication [33,34]. Greater synchronization of head motion was observed when a listener empathized with a speaker in a lecture [35]. In addition, body synchronization was reported between counselors and clients when they shared empathy [36,37].

Such observable synchronized behavior is a result of an implicit empathic response. The implicit process constitutes the synchronization of physiological activities between individuals [38], which can be measured through electroencephalography (EEG) [39,40], electrocardiography (ECG) [41–44], and skin conductance [45,46]. For example, the synchronization of electrodermal activities (i.e., skin response) between a therapist and a patient correlates to the patient's perceived empathy toward the therapist [47,48].

Neuroscientific bases have been identified for the synchronization of brain activity among participants during empathic communication [49–51]. Empathy researches using EEG have been mainly focused on understanding the sharing of painful experiences. Several asymmetries or activations in the pain-related brain areas have been reported, which were elicited by empathy. The left frontal asymmetry has been related to the suffering of the other, and the right frontal asymmetry has been associated with the pain and sorrow of the other [52]. Moreover, empathy-related activation in fronto-insula and anterior cingulate cortices was reported, which have been related to pain [53]. Peng et al. have shown that brain-to-brain synchronization could be triggered by sharing painful experiences and could strengthen social bonds [54].

#### *1.2. Cardiovascular Measures of Empathy*

Measures of cardiovascular activity reflect both attentional and affective states [55]. Cardiovascular measures can be achieved using a piezoelectric transducer, ECG, or analysis of facial micromovements. Cardiovascular activity in empathy research has been understudied compared to other physiological measures [56], but recent advances in vision technology have shed light on novel and innovative methodologies such as remote ballistocardiography (rBCG).

Kodama et al. [57] examined a psychotherapy session between a counselor and a client and found synchronization in heart rate, suggesting a promising indicator that leads to the building of rapport and empathy. Salminen et al. [58] found that higher synchrony in respiration rates, which has a positive relationship with heart rate, is associated with higher empathy. The synchronization of the heart rate can also enhance closeness [59] and intimacy [60].

However, the measurement of synchronization between the cardiovascular activities of viewers and people in media content has been less studied, mainly due to multiple technical issues.

First, viewers need a sensor attachment to capture physiological measurements, which is a significant barrier to general adoption. Second, to evaluate empathy, measuring dyadic synchronization is paramount. The cardiovascular information of both the viewer and the person in the media content must be obtained and analyzed. Obviously, acquiring the latter is impossible with sensor attachment because it is digital content.

However, advances in vision technology for cardiac measurements, such as remote photoplethysmography (rPPG) and rBCG, suggest promising methods for overcoming these challenges. The rPPG evolved to detect changes in blood volume remotely without direct contact between the photosensor (i.e., PPG) and the skin [61]. Non-contact data acquisition is possible through various means, including infrared [62], thermal [63], and RGB [64] cameras. The rPPG uses band-pass filters to eliminate motion components in images [65] but has less effect on cardiovascular activities that include the motion itself, referred to as ballistocardiography motion [66]. The rBCG is a measurement of ballistocardiographic head movements through remote means using a camera and visionbased analysis. These vision technologies have improved considerably in recent years, enabling the estimation of the heartbeat signals of both the viewer and the person in the digital content without needing skin contact.

Specifically, BCG motion causes microscopic vibration (i.e., micro-movement), which appears in the face through the carotid artery [67]. Micro-movement implies the subtle movement of a face that the human eye cannot easily see. This is caused by regular vibrations from the heart that are transmitted to the face. Micro-movement can be obtained by filtering the frequency corresponding to the regular heart rate band from the frontal facial video capture [68–71]. Analyzing the similarity of micro-movement between viewers and digital content (e.g., advertisements) may provide insights into whether the viewer is empathizing with the content. We intended to analyze the similarity of micro-movements through cross-entropy analysis and compare it to the participants' subjective empathy through a questionnaire. To our knowledge, no study has investigated the relationship of micro-movements through an rBCG method for a participant and a person in real-world media content, such as an advertisement.

#### **2. Materials and Methods**

*2.1. Research Hypothesis*

This study sought to verify the following hypothesis:

**Hypothesis 1 (H1).** *The more similar the micro-movements between the participant and the person in the advertisement, the higher the participant's empathy scores for the advertisement.*

The following section explains our operational definition of micro-movement signals, how the signals were measured from the participant and the advertisement, and how the participant's subjective assessment of empathy was acquired.

#### *2.2. Experimental Design*

The main experiment was a one-factor design (empathy factor) with two levels (empathy and non-empathy). Each participant viewed two empathy conditions (i.e., withinsubject design), manifested in an empathy or non-empathy advertisement, and responded to an empathy questionnaire. The design of the stimuli (i.e., advertisement) and the questionnaire are explained in Section 2.3.

The dependent measurements involved the similarity of micro-movements between the participant and the stimulus, specifically, the similarity between the micro-movement signals extracted from the participant and those from the person in the advertisement. Cross-entropy was used as a similarity metric. Cross-entropy is suitable for the comparison of periodic distributions. The more similar the two distributions, the closer the crossentropy is to zero [72]. This study extracted the micro-movement signals by filtering the power spectrum between 0.75 Hz and 2.5 Hz corresponding to 45~150 bpm when static. However, this filtering range may vary according to the context, situation, and use cases. The details of the analysis are explained in Section 3.

#### *2.3. Participants*

Thirty participants (15 males and 15 females) voluntarily participated in the experiment. The mean age of participants was 22 (±2) years. None of the participants had a medical history of cardiovascular disease. The participants had an uncorrected or corrected visual acuity of 0.6 or better and were able to wear soft contact lenses but not glasses. Written informed consent was obtained from all the participants prior to the experiment. All participants were compensated for their participation.

Empathy varies with demographic characteristics, such as age [28], race [73], education [74], and gender [75]. Researchers have suggested an inverse-U-shaped pattern as a function of age, with middle-aged adults showing higher empathy than young adults [28]. Meta-analyses of gender differences in empathy support that women have more empathy than men [28,75,76]. One study reported a decline in empathy among undergraduate nursing students as they advanced through training [74]. The empathic neural response is increased for members of the same race, but not for other races [73]. Due to such demographic variance, the most recent (2021) massive survey (*n* = 3486) on the experience of empathy [77] quota sampled to reflect the U.S. population on demographic parameters. However, all empirical lab studies on empathy, including ours, have limitations when generalizing. We balanced the N of gender (15) and confirmed that gender did not have an effect on the dependent measures and ensured that the ethnicity of the participants (i.e., Korean) was consistent with the characters in the video stimuli. However, we acknowledge the limitation for generalizing the findings, such that the results may only apply to younger adults. Further studies are needed to confirm this hypothesis.

#### *2.4. Procedures and Materials*

The experimental procedure is shown in Figure 1. The participants stared at the blank screen for four minutes to stabilize their physiological state. For each stimulus, participants viewed an advertisement video and responded to a self-report questionnaire. Each condition (empathy and non-empathy) had 12 stimuli, so participants viewed 24 advertisements in total. The stimuli were presented in random order.

**Figure 1.** Experimental procedure.

Participants' frontal views, which were necessary for extracting the micro-movement signals, were recorded at 30 fps, 1920 × 1080 pixels, using a web camera installed on the monitor while they viewed the stimuli, as shown in Figure 2.

#### 2.4.1. Video Stimuli (Advertisements)

Marketing researchers have explored empathy as a construct for estimating advertising effects. Escalas and Stern suggested that well-developed stories elicit higher levels of empathy than poorly developed ones [15]. Classical drama advertisements that have clear causality have been better able to hook viewers into commercials than vignettes. Emotionally driven advertisements have a positive impact on consumers' engagement and empathy [8,78,79]. In short, advertisements that elicit viewers' empathy tend to provide a clear context behind the story, in addition to an emotional appeal [14,79,80]. As a

result, we chose three criteria for selecting the video stimuli: (1) causality of the storyline, (2) advertising appeal type, and (3) the degree of empathy.

**Figure 2.** Experimental environment.

Nine emotion researchers viewed and assessed 50 candidate advertisements. The candidates were limited to those targeting the younger generation in their 20 s and 30 s, consistent with the participant pool. For each criterion related to the candidate, the researchers responded from −3 to +3 on a six-point Likert scale. Per criteria 1, researchers scored from −3 (ambiguous causality) to +3 (clear causality) for the story of the advertisement. Per criteria 2, they scored from −3 (rational appeal) to +3 (emotional appeal) for the advertising appeal type. Finally, according to criteria 3, they scored from −3 (not empathetic) to +3 (empathetic).

We classified the candidates into empathy advertisements if the average score for the evaluators was above zero for all three criteria. Conversely, we classified them into non-empathy advertisements if the score was below zero. For each advertisement group (empathy and non-empathy), we sorted the advertisements into four product advertisements (energy boosters, snacks, computer peripheral devices (e.g., printer)) and selected the three best advertisements for each product group. That is, we selected 12 advertisements for each condition (empathy and non-empathy).

Empathy advertisements tend to be longer than non-empathy advertisements because the viewer requires some time for the narrative to "sink in". In contrast, non-empathy advertisements focus on the presentation of prominent models and products. For example, an energy booster's empathy advertisement has a story involving a student exhausted from studying being revitalized after drinking an energy drink. The non-empathy advertisement, however, featured a character dancing with an energy drink and did not have a particular narrative.

#### 2.4.2. Subjective Evaluations

As empathy is a multifaceted construct that includes both cognitive and affective processes, we adopted a comprehensive and empirically validated questionnaire with the participants' ethnicity (i.e., Korean). We used the Consumer Empathic Response to Advertising Scale [81,82], which consists of 11 items, as shown in Table 1. The factor loading exceeded 0.4 and Cronbach's alpha exceeded 0.8. The questionnaire included three empathy factors: cognitive empathy, affective empathy, and identification empathy. The dependent variable for analysis was the sum of all 11 items.

**Table 1.** Questionnaire about Empathy to Video Contents.


All questions were rated on a seven-point Likert scale. We asked for the degree of agreement with each empathy statement, with the lowest scale labeled "strongly disagree" and the highest scale labeled "strongly agree". The survey was collected through a web survey rather than a paper questionnaire.

#### **3. Analysis**

This study aimed to analyze whether the similarity of micro-movement signals between participants and advertisements differs according to the user's perceived empathy (i.e., subjective evaluation) with the advertisement video. The signal processing to filter only the micro-movements caused by the heartbeat is described in detail in Section 3.1. In addition, a method for calculating the cross-entropy, an indicator of similarity between the two signals, is described. Section 3.2 describes the statistical difference in the similarity between the participant and advertisement measured by cross-entropy according to the empathy score.

#### *3.1. Signal Processing*

The micro-movement signals were measured from the participant's facial videos, as shown in Figure 3. Ballistocardiographic changes are reflected to the face and can be measured at a distance, as validated by Balakrishnan [68]. The face was detected from the facial video using the Viola-Jones face detector and was defined as a region of interest (ROI). As the forehead and nose were more robust to facial expressions than other facial regions, the ROI was divided into multiple ROIs by cropping to the middle 60% of the width and top 12% of the height (i.e., forehead region) and the middle 10% of the width and middle 30% of the height (i.e., nose region).

Determining the feature point within multiple ROIs was necessary to measure the movements induced by the BCG. Although several studies on remote BCG employed the good-feature-to-track (GFTT) algorithm [83,84], their feature point numbers were not fixed because the algorithm determined the feature points based on the solid edge components. It was difficult to employ the GFTT algorithm in this study because the feature points needed to be re-determined quickly owing to the frequent change of the screen and the face movement.

**Figure 3.** Signal processing of the micro-movements [71]. (**a**) Face detection using Viola-Jones algorithm; (**b**) Area selection using the forehead and nose defined as ROIs; (**c**) Feature extraction using the GFTT algorithm; (**d**) Feature tracking using the KLT tracker; (**e**) Bandpass filtering for signals in 30 s window buffer using the second order Butterworth filter; (**f**) Decomposition of noise using PCA.

Thus, the ROIs of the forehead and nose regions were divided into cells using 16 × 2 and 2 × 8 grids, respectively. This study employed 48 feature points by determining the centroid of each cell as a feature point. The movements were measured by tracking the y-coordinate difference between frames of each feature point using the Kanade-Lucas-Tomasi (KLT) tracker because the BCG movements were generated up and down by the heartbeat [85–87].

The movements measured from the face are a combination of facial expressions, voluntary head movements, and micro-movements. Therefore, it is essential to remove motion artifacts due to facial expressions and voluntary head movements from the measured

movements. First, the movements were filtered by a second order Butterworth bandpass filter with a cut-off 0.75–2.5 Hz corresponding to 45–150 bpm. Then, the movements were normalized from their mean value (i.e., *μ*) and standard deviation (*σ*) by z-score. If the movements exceeded the *μ* + −2*σ*, they were determined to be noise, due to the subtle movements, and their mean value (i.e., *μ*) was corrected. Finally, principal component analysis (PCA) was performed to estimate the micro-movement from the mixed movements by decomposing the noise from facial expressions and voluntary head movements. This study extracted five components using PCA and then selected one component with the highest peak in their power spectrum converted using a fast Fourier transform. The selected component was finally determined to be micro-movements.

#### *3.2. Statistical Analysis*

As empathy is an individualized experience, the manner in which each stimulus affects each participant varies. Individualized response is affected by factors, such as the individual's empathy capability, predisposed tendency, and past experience (for an extensive review of empathy as a concept, see [88]). The observer's (i.e., the person who empathizes) mood and personality are also an important modulating factor [89]. Such individual differences mean that, in our study, the empathy stimuli selected by the emotion experts do not necessarily elicit empathy from the participants. Therefore, we applied an inclusion criterion to the participants' subjective empathy scores to select response sets from certain stimuli for analysis. We selected data obtained from stimuli that scored, on average (i.e., the mean of all 30 participants), on or higher than four for the empathy condition. In the seven-point Likert scale, four was the middle point, labeled as "Neutral". Conversely, we selected data obtained from stimuli that scored less than four on average for the non-empathy condition. This selection process yielded response sets from four out of the original 12 stimuli in the empathy condition and six out of the original 12 stimuli in the non-empathy condition.

In short, we analyzed 60 samples (30 participants in two empathy conditions) consisting of subjective empathy scores and cross-entropy data. A paired *t*-test was used to test this hypothesis.

#### **4. Results**

The study analyzed differences in the micro-movement similarity between empathy and non-empathy conditions using a *t*-test. The results showed that there was a significant difference in the subjective empathy score between empathy and non-empathy conditions induced by advertisements (*t*(29) = −11.754, *p* < 0.001), as shown in Figure 4. The subjective empathy score was significantly higher when watching empathy advertisements (*μ* = 5.149, *σ* = 0.564) than non-empathy advertisements (*μ* = 3.341, *σ* = 0.759).

**Figure 4.** A comparison of empathy scores for non-empathy and empathy advertisements by paired *t*-test.

There was a significant statistical difference in cross-entropy between empathy and non-empathy advertisements (*t*(29) = 61.019, *p* < 0.001), as shown in Figure 5. As predicted, cross-entropy was significantly lower when watching empathy advertisements (*μ* = 0.00317, *σ* = 0.00005) than non-empathy advertisements (*μ* = 0.00392, *σ* = 0.00005). This supported hypothesis *H*1, which stated that the more similar the micro-movements (i.e., the lower the cross-entropy) between the participant and person in the advertisement, the higher the participant's empathy scores for the advertisement (i.e., empathy advertisements).

**Figure 5.** A comparison of cross-entropy between non-empathy and empathy advertisements by paired *t*-test.

The Pearson correlation indicated that cross-entropy was also significantly associated with empathy score (*r* = −0.796, *p* < 0.001), indicating an inverse relationship between cross-entropy and the empathy scores. That is, the lesser cross-entropy, the higher the empathy scores.

#### **5. Discussion**

In summary, our study invited participants to view advertisements classified as empathy or non-empathy advertisements by experts. During each viewing of the advertisement, we recorded their facial data and obtained their subjective empathy scores after each viewing. We analyzed the cross-entropy between the participant's and the person's facial data and found that it was significantly lower when viewing empathy advertisements than when viewing non-empathy advertisements.

To the best of our knowledge, this is the first study to apply remote BCG methods to understand empathy-based micro-movement synchronization in a real-world use case (i.e., viewing an advertisement). Our research confirmed that the higher the similarity of micromovement between the participants and the advertisements, the higher the subjective empathy. The results validate the remote BCG methods with the accompanying analysis process (e.g., cross-entropy analysis), suggesting an alternative or complementary method to the subjective empathy scales.

Our findings also provide implications for understanding the empathic interactions of human dyads. In human communication, information is shared through natural language (i.e., explicit channels), whereas empathy is mainly shared through embodied synchrony (i.e., implicit channels). The latter synchronization is widely observed in human communication and is reflected in the harmonization of the heart rhythm. In other words, the heartbeat tends to follow the rhythm of someone who empathizes. Such mutual entrainment has been defined as two interacting nonlinear oscillating systems with different periods becoming a common period [90]. Although challenging, advances in technology enable us to tap into heartbeat traces through the carotid artery, reflected in the facial micro-movement. Our study confirmed that microscopic vibration is a valid indicator of dyadic empathy synchronization in an ecologically valid scenario.

In previous studies that measured empathy based on unconscious physiological responses [41,43], it was also verified that the correlation between the heartbeat patterns of two people was higher in the empathy condition than in the non-empathy condition. They measured heart rate patterns by attaching an ECG sensor to the participant's skin. The task of eliciting empathy was overly simplified, such as facing each other, and only momentary emotions were of concern, resulting in limitations to generalization. Although they can effectively elicit a definite empathic response, the emotion dynamics were not considered.

In addition, there were fewer applications measuring empathy for digital content because of the challenge in solving the barrier of obtrusive measurement and consideration of the dynamic nature of empathy. This study suggested a practical method for measuring empathy that complements the issue of contact-based empathy measurement that obstructs users' immersion in the content.

The hypothesis of the present study was tested under experimental conditions by manipulating product advertisements. This study acknowledges that there were large differences among the durations of the stimuli, and the stimuli were only focused on product commercials. However, the differences in time duration among stimuli did not affect the similarity because the similarity between the two signals was analyzed in the frequency domain. That is, because the similarity of the periodicity of the two signals was analyzed, the time length of the signal did not have a significant effect. Even if there was an effect, the empathy stimuli, which had a long duration, were difficult to make similar to the non-empathy stimuli, which had a short duration, because they had to vibrate at a similar frequency for a longer period of time.

This study suggests an application framework for evaluating empathy in interaction (e.g., viewing) with digital content. As our suggested method is non-contact and unobtrusive to real-life behavior (e.g., consuming media), future research agendas seem promising. Specifically, future studies may investigate content in other media domains (e.g., movies, TV shows, video games).

However, we acknowledge that a larger N would be needed to achieve the appropriate power to completely rule out false positives. We acknowledge that our N is small (30) and, as such, we conducted a post hoc power analysis with the program G\*Power [91] with power set at 0.8 and α = 0.05, d = 0.5, two-tailed. The results suggest that an N value of approximately 34 would be needed to achieve appropriate statistical power.

Empathy is a multifaceted social psychological construct that is affected by many factors, such as the relationship and history between the observer (i.e., empathizer) and the observed. Such social relationships are also shaped by intimacy, while favorability also comes into play. As empathy is dependent on context and task [89,92], our study has an inherent limitation in generalization.

We also acknowledge that empathic expression is a result of a combination of many nonverbal modalities (e.g., voice, facial expression, posture). We focused on a singular modality, the facial movements captured from the involuntary heartbeat, because such measures could also be confounded by noise. Moreover, there can be a gap between the actual emotion the actor felt and the physiological measurement we acquired. Such a gap can be measured through a combination of expressive measures (facial muscle movement, gestures) and implicit measures (heart rate, GSR). Future studies may investigate multimodal recognition of empathy, in addition to facial micro-movements.

We strived to filter out the signals that represent empathy from the signal spectrum as closely as possible to the target population by guiding the participant not to move and to refrain from exaggerating facial expressions. We did not include any participants who may have made significant movements that would confound our results, such as participants with Tourette syndrome or a person with bruxism.

Privacy issues that may arise from identifying individuals can be crucial in research that considers prosocial behaviors. However, the suggested method of recognizing empathy can enhance privacy by not saving personal identification data (i.e., original record video) in the database. Only the processed secondary data (i.e., micro-movement signals) can

be saved in the database by analyzing video frames in real-time without recording the face images. Then, the synchronization data can be analyzed if only a key can match (i.e., random number) an advertisement and a viewer. The analyzed micro-movement features are hardly restored to the original facial image, so it is impossible to identify its data.

**Author Contributions:** A.C. designed the study with an investigation of previous studies and performed the experiments and original draft preparation; S.P. wrote the review and edited with an investigation of previous studies; H.L. analyzed raw data and developed micromovement algorithm; M.W. conceived the study and was in charge of overall direction and planning. All authors have read and agreed to the published version of the manuscript.

**Funding:** "This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2020R1A2B5B02002770)" and "This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2021R1I1A1A01045641)".

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Review Board of the Sangmyung University, Seoul, Korea (SMUIRB C-2019-015).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Multimodal Data Collection System for Driver Emotion Recognition Based on Self-Reporting in Real-World Driving**

**Geesung Oh 1, Euiseok Jeong 1, Rak Chul Kim 1, Ji Hyun Yang 2, Sungwook Hwang 3, Sangho Lee <sup>3</sup> and Sejoon Lim 2,\***


**Abstract:** As vehicles provide various services to drivers, research on driver emotion recognition has been expanding. However, current driver emotion datasets are limited by inconsistencies in collected data and inferred emotional state annotations by others. To overcome this limitation, we propose a data collection system that collects multimodal datasets during real-world driving. The proposed system includes a self-reportable HMI application into which a driver directly inputs their current emotion state. Data collection was completed without any accidents for over 122 h of real-world driving using the system, which also considers the minimization of behavioral and cognitive disturbances. To demonstrate the validity of our collected dataset, we also provide case studies for statistical analysis, driver face detection, and personalized driver emotion recognition. The proposed data collection system enables the construction of reliable large-scale datasets on real-world driving and facilitates research on driver emotion recognition. The proposed system is avaliable on GitHub.

**Keywords:** driver emotion recognition; multimodal; self-report; real-world driving

#### **1. Introduction**

In recent decades, the use of data-driven state-of-the-art techniques such as deep learning has increased interest in and performance of human affect recognition [1]. This has increased interest in the development of driver emotion recognition systems. Since driving is significantly affected by the driver's emotions [2–4], driver emotion recognition studies have been conducted for various purposes such as driving safety, adjusting vehicle dynamics, and emotion elicitation of drivers [4–6]. All studies are affected by the quality and quantity of data. Therefore, research on quantitative and qualitative datasets for driver emotion recognition is being actively conducted [7–14].

Although large-scale and high-quality datasets are collected through various studies, the collection conditions vary significantly. First, the experimental environment is largely divided into simulation and real-world driving. Second, the modalities of collected signals are also diverse. When broadly classified, there are video, audio, bio-physiological, and controller area network (CAN) data. In detail, the position of cameras and microphones differ, and the collection list of biophysiological or CAN data is not unified. Lastly, the annotation of emotional states is various, which is critical for emotion recognition. The simplest way to classify a driver's emotional state is by driving experiments (e.g., assume that heavy traffic on the urban is high stress, and light traffic on the highway is low stress) [7–9]. There is also an approach in which external annotators judge a driver's emotional state based on the collected information about the driver. However, this approach has limitations in that it has

**Citation:** Oh, G.; Jeong, E.; Kim, R.C.; Yang, J.H.; Hwang, S.; Lee, S.; Lim, S. Multimodal Data Collection System for Driver Emotion Recognition Based on Self-Reporting in Real-World Driving. *Sensors* **2022**, *22*, 4402. https://doi.org/10.3390/ s22124402

Academic Editors: Mincheol Whang and Sung Park

Received: 29 April 2022 Accepted: 7 June 2022 Published: 10 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

a high-cost and requires others to report their emotional states [10,11]. In the self-reporting approach, drivers report their emotional states, but this should not interfere with the main task of driving. Hence, it is restricted to experiments through simulation or they have to report their emotional states after the completion of the experiments [12–14]. As previously stated, since data collection environments, measured data types, and annotation methods very, Zepf et al. have argued that a consistent dataset is needed to facilitate research on driver emotion recognition [15].

In this paper, we propose a data collection system that can be used for a variety of driver emotion recognition studies. The proposed system collects multimodal datasets such as videos from various views, audio, biophysiological, CAN data, and drivers' emotional states, which are data representatively used for driver emotion recognition. A driver's emotional state is collected by a driver self-reporting their emotional state while driving through a human–machine interaction (HMI) application. To realize a universal dataset, the collection experiment should be conducted in the real world environment, not through a simulator. To conduct a real-world driving experiment, it is necessary to prevent the behavioral and cognitive disturbances of drivers in advance to avoid potential traffic accidents. To prevent behavioral disturbance, the proposed system collects biophysiological data using wearable sensors, instead of biometric sensors attached to the body. The self-reporting application for minimizing cognitive disturbances comprises a haptic, acoustic response, and graphical user interface (GUI) based on user experience (UX). In addition, there are concerns about the reflection of strong bias during self-reporting due to false memories or the desire to impress others [15]. To address these concerns, we focused on making the self-reporting interaction occur periodically. All considerations for reliable data are detailed in Section 3. The data collection system is installed on a vehicle, and data collection is performed under real-world driving conditions. Figure 1 shows the data collection vehicle driving during real-world driving.

**Figure 1.** A scene in which a driver's emotional state data is being collected during real-world driving using the proposed data collection system. The driver is self-reporting their emotional state by touching the HMI application mounted on the vehicle center fascia. The screenshot on the right is the English translation of the GUI of the HMI application implemented in Korean.

According to the real-world data collection experiment results using the proposed system, the experiment was completed without any accidents over four months. A large-scale dataset of over 122 h, 4446 km, and 787 GB was collected, along with 6356 self-reporting data points of drivers while driving. Through the statistical analysis of the collected data, the imbalance of self-reported emotion labels and the need for personalized driver emotion

recognition were confirmed. In addition, case studies of driver face detection and personalized single and multimodel driver emotion recognitions are presented, and comprehensive understanding is provided.

Our main contributions can be described as follows:


The rest of this paper is organized as follows. Section 2 introduces related works on the data collection system for driver emotion recognition. Section 3 discusses the proposed data collection system in real-world driving. Section 4 provides data collection experiments, analysis of collected data, and case studies using the collected data. Section 5 concludes this work and describes further work. Appendix A describes details of terminologies and variables used in this paper.

#### **2. Related Works**

Driver state recognition research is being conducted from various viewpoints, from the recognition of inattention [16], distraction [17], stress [5], and behavior [18] for safety to readiness [19] for autonomous driving. This has resulted in research on driver emotion recognition, along with the improvement of data-based human emotion recognition performance [20–22]. Data used for driver emotion recognition is classified into video [11], audio [10], biophysiological [12], and CAN data [15]. In most cases, these data are not used alone but are fused to recognize driver's emotional states [6–9]. However, real-world driving data resources that account for data types do not exist. Ma et al. [11] only collected the video of a driver's face, and CIAIR [23] and DriveDB [7] collected video, audio, and biophysiological data, excluding CAN data. UTDrive DB collected various CAN data, along with video and audio but did not collect bio-physiological data [8]. In this study, we propose the various multimodal data collection system in real-world driving.

Emotion annotation data are as important as sensor data in driver emotion recognition. To annotate a driver's emotional state, three major methods are employed: experimental context, external annotators, and self-reports. The experimental context is the simplest way to annotate an emotional state by estimating the driver's emotional state with the driving situation or environment, e.g., annotate the driver's stress level by road type or congestion level [7–9]. Since this approach presupposes strong assumptions, there are limitations in annotating an accurate emotional state. Although using external annotators requires additional manpower and cost, it enables objective annotation. Jones and Jonsson recorded a driver's speech while driving using a simulator, and an external annotator annotated the driver's emotional state by listening to the recorded speech for driver emotion recognition [10]. Ma et al. developed an annotation tool to allow external annotators to annotate two emotion categories at five levels each based on driver face images collected during real-world driving [11]. This approach also has limitations in that experienced and trained annotators are required. Because self-reporting is an approach to self-report how drivers feel while driving, it can overcome the limitations of other approaches. However, driving is a task that requires considerable concentration, and drivers' self-reporting while

driving affects the experiment. Hence, most self-reporting is performed immediately after the driving experiments. Taib et al. [13] and Ihme et al. [14] conducted a driving simulation experiment for driver frustration and asked participants who drove for self-reporting information after the experiment. Taib et al. used a 9-point Likert scale and Ihme et al. used a self-assessment manikin (SAM) [24] for self-reporting. Kato et al. proposed a selfreport application that can visualize data and selected the driver's emotional state while driving [12]. The proposed application enables a driver's self-reporting to be performed in real time while driving, not after the experiment. This application was only used in a simulation experiment, and to use it in real-world driving experiments, additional safety considerations are required. In addition, concerns about subjective biases that may be included in self-reports are another challenge to overcome [15]. In this study, we propose an HMI application that allows drivers to safely report their emotional states while realworld driving.

#### **3. Proposed Work**

In this section, a system that enables the simultaneous collection of videos, audio, biophysiology, and CAN data during real-world driving is described. The system also includes an HMI application that interacts with the driver and collects the driver's emotional state. In other words, this section demonstrates methods for developing hardware and software systems for a multimodal dataset based on self-reporting in real-world driving for driver emotion recognition. All systems are built into the vehicle, as the data collection is performed under driving conditions. We used an IONIQ 1.6 Hybrid vehicle (Hyundai, Seoul, KR, https://www.hyundai.com/, accessed on 31 March 2022) shown in Figure 2a as the base environment for building the proposed system. Figure 3 shows the flowchart of the entire system. When the system starts, the first thing to check is whether the vehicle is ignited. The system is designed to start after the vehicle is ignited because the surge voltage generated when the vehicle is ignited can reduce the quality of data collected using electronic sensors. In addition, for safety reasons, whether the vehicle is stopped before starting and ending the system is checked (blue rhombus in Figure 3). This prevents the driver from operating the system while driving. After confirming that data collection is possible, two types of metadata are requested before the main data collection. One is the name of the driver, which must be input by the driver manually. The other is the current odometer, which can be obtained automatically via vehicle CAN data. After obtaining the current odometer and treating it as the starting odometer, the main data collection process starts. The main data collection process uses multiprocessing to efficiently collect different multimodel data (orange rectangle in Figure 3). When a suitable end request is input into the system by the driver, the main data collection process is terminated, and if the vehicle is stopped, the vehicle odometer is obtained once more and treated as the ending odometer. Finally, all data, metadata, and collected data (green box in Figure 3) are integrated into one dataset (red rectangle in Figure 3) , and the entire system is shut down. All processes in the proposed system are performed using a computer, shown as Figure 2d. The proposed system is released as an open source repository on GitHub (https://github.com/KMUIMLAB/DMS, accessed on 27 May 2022) and the details of each data type for multimodal data collection are discussed in the following sections.

#### *3.1. Video*

We use two RealSense D435i cameras (Intel, Santa Clara, CA, USA, https://www.intel. com/, accessed on 31 March 2022) to collect video data composed of various modalities. The RealSense camera provides a maximum of three video modalities: red, green, and blue (RGB), infrared (IR), and depth. In addition to the RGB image, the IR image, which is robust to environment changes, such as illumination changes, is essential in real-world driving. One camera is installed on the dashboard to capture the driver's face, as shown in Figure 2b, and the other is installed on the top of the passenger seat window to capture the driver's posture, as shown in Figure 2c. Since the sample rate of the camera can be set, (**a**) (**b**)

we set it as *Rv* Hz. Alternatively, each camera sequentially captures *Rv* individual images per second.

**Figure 2.** Figures of the dataset collection system hardware interface build in the vehicle. (**a**) Vehicle exterior; (**b**) Inside view of the vehicle center fascia; (**c**) Inside view of the vehicle passenger seat; (**d**) Vehicle trunk. Two cameras are installed to collect the image data of a driver's face and posture (green). A microphone is installed on the right side of the driver seat's headrest to collect audio data in the cabin (blue). Wristband-type wearable sensor is worn on the driver's wrist to collect the driver's bio-physiological data, and the collecting status can be monitored through a smartphone (orange). The CAN interface device supports the collection of vehicle CAN data (red). The monitor installed on the center fascia is a touch screen for interaction with the driver (yellow). The computer installed in the trunk of the vehicle integrates the collected data (magenta).

(**c**) (**d**)

**Figure 3.** Flow chart of the proposed data collection system during real-world driving.

#### *3.2. Audio*

The CVM-VM10 II microphone (CoMica Technology, Shenzhen, Guangdong, CN, https://www.comica-audio.com/, accessed on 31 March 2022) was used to collect audio information in the cabin while driving. To collect data with audio information similar to what the driver hears, the cardioid condenser microphone was selected and placed close to the driver's ear. To minimize noise and vibrations that occur during real-world driving, the microphone was installed on the right side of the driver's seat headrest, along with the shock mount and wind muff, as shown in Figure 2c. The audio data collection system

collects *Ra* audio data samples per second until the system stops according to the sample rate, *Ra* Hz.

#### *3.3. Biophysiological*

To collect biophysiological data of the driver, the biometric sensor must be in contact with the driver's body. The attached sensor may cause behavioral disturbances, resulting in potential accidents. For safe biophysiological data collection during real-world driving, it is necessary to prevent behavioral disturbances in advance, and we used an E4 wristband (Empatica, Boston, MA, USA, https://www.empatica.com/, accessed on 31 March 2022), as a solution. The E4 wristband (E4) is a wearable biometric sensor and is used as an alternative sensor while exhibiting similar data quality 85% of the time compared to the clinician standard device [25]. As a result of comparing the E4 and laboratory biometric sensor data in terms of emotion recognition performance, similar accuracy was realized [26]. Hence, we used the E4 for biophysiological data collection during real-world driving. E4 provides skin temperature, electrodermal activity (EDA), photoplethysmography (PPG), and 3-axis acceleration of the band, along with interbeat interval (IBI) and heart rate (HR) through postprocessing. As shown in Figure 2b, biophysiological data collection is possible by simply wearing E4 on the wrist while driving, and real-time monitoring is also possible using a mobile device through the application provided by E4. Unlike video or audio data, E4 collects each data at an optimized sampling rate, so no separate setting is required. Each sample rate is shown in Table 1.

#### *3.4. CAN*

The method of mounting additional sensors or collecting on-board diagnostics (OBD) signals can also be used to access vehicle signals; however, since we can access vehicle CAN, we can collect vehicle signals with the CAN interface device. CAN is a message-based protocol designed to allow vehicle controllers to communicate with each other. The USBcan Pro 2xHS v2 (KVASER, Mission Viejo, CA, USA, https://www.kvaser.com/, accessed on 31 March 2022) is a CAN interface device used to access vehicle CAN signals to collect vehicle data. As shown in Figure 2d, the device is located in the trunk of the vehicle and connects the vehicle CAN line to the computer. Among the many signals on CAN, we select key signals closely related to the driver. Since the selected key signals are updated according to the set cycle time, the sample rate of CAN data, *Rc*, is set according to the cycle time. The collected key data and the sample rate are presented in Table 1.

#### *3.5. HMI*

Drivers' emotion annotation is essential in datasets for driver's emotion recognition. Although external annotators or the experimental context can be employed to estimate and annotate drivers' emotional states, we focused on annotating the driver's emotional state using reports from the driver rather than via estimation. This method is called self-report and will be performed in real-world driving experiments. It must be designed with an emphasis on safety. Requiring drivers to report driving conditions may cause cognitive disturbances, probably leading to severe traffic accidents on the road.

To minimize cognitive disturbances, we proposed the HMI application that periodically interacts with the driver through haptic and acoustic response and receives the emotional state response from the driver. We used a TFX133T DEX monitor (HANSUNG, Seoul, KR, https://www.monsterlabs.co.kr/, accessed on 31 March 2022), and the touch screen has a built-in speaker to realize haptic and acoustic responses. The screen was installed on the center fascia of the vehicle, as shown in Figure 2b. When data collection starts, the HMI application requests that the driver report their emotional state with a sound announcement as follows: "Please enter your current state". If there is no response from the driver for *Irr* seconds from the request, the application requests once more with the same sound announcement. If there is no response from the driver within *Is* seconds from the first request, not to disturb the driver, it is treated as a nonresponse with a sound

announcement as follows: "The input is delayed, so it enters in a nonresponse state". This skipping process is essential as frequent response requests can interfere with safe driving. The driver can input an answer by only touching the screen, and when the input is completed, the input emotional state is displayed on the screen in large fonts; and at the same time, a sound announcement is provided as follows: "Your input is complete". This feedback minimizes confusion for the driver.

In addition to cognitive disturbances, self-reported emotion labels have limitations in that they reflect strong bias because of false memories or the desire to impress other people [15]. Repeated sampling in real-time is necessary to minimize this bias [27]. That is, the self-reporting requests should be continuously made at periodic intervals. Hence, the proposed HMI application continuously requests the response at an interval, *Ir*, from when driving starts to when it ends. The interval between response requests, *Ir*, is tuned through test driving. Moreover, our system allows the driver to report their emotional states at any time by touching the screen even between response intervals. This feature enables logging drivers' rapidly changing emotional changes in real-world varying driving conditions.

The proposed HMI application can apply any representative emotional states as long as they are discretely expressed states. However, since the driver has to choose the most similar to their current emotional state among them, cognitive disturbances can occur if there is difficulty in choosing an emotion no matter how well the interaction with the driver is completed. Therefore, the discrete representative emotional states should be simple, not numerous, and suitable for the driving situation.

#### *3.6. GUI*

We propose a GUI design to reduce drivers' cognitive disturbance in self-reporting through HMI while driving. To propose UX-based GUI of the HMI application, the following four representative driver emotional states by referring to the emotions that can be induced in a driving situation [28] are suggested.


The proposed GUI designs are shown in Figure 4. There are two factors to consider in the GUI design: the layout and color of the emotional states. The layout of the emotional states refers to the valence–arousal plane, a popular concept used in emotional representation [29]. Based on the division of the x-axis into pleasure and misery in the valence–arousal plane, we placed "Happy|Neutral" and "Angry|Disgusting" on the right and left of the screen: "Happy|Neutral" is on the right and "Angry|Disgusting" is on the left. Based on the division of the y-axis into arousal and sleepiness in the valence–arousal plane, we placed "Excited|Surprised" and "Sad|Fatigued" on the top and bottom of the screen: "Excited|Surprised" is on the top and "Sad|Fatigued" is on the bottom. The overall layout of the emotional states is in the form of a rhombus, as shown in Figure 4. In the GUI shown in Figure 4, each emotional state is expressed in different colors. The correlation between basic colors and human psychological state was identified, and states that can be felt by humans were classified according to color characteristics [30]. Based on this, appropriate colors were used for each emotional state. The GUI design provides not only a default GUI, as shown in Figure 4a, but also a touch GUI, as shown in Figure 4b. Therefore, when the driver inputs the current emotional state by touching the screen, it provides visual feedback, as shown in Figure 4c, along with the sound announcement. The UX-based GUI of the HMI application gives the driver more accurate intuition about the proposed representative emotional states.

**Figure 4.** GUI of HMI application for self-reporting of driver emotional state. (**a**) GUI in default; (**b**) GUI in touch; (**c**) GUI example where "Angry|Disgusting" state is touched.

#### **4. Experiments**

This section presents the details of the data collection experiment conducted on the basis of the proposed data collection system and some case studies based on the collected data from the experiment.

#### *4.1. Data Collection Experiment*

Motivated by the need for a dataset in real-world driving, the data collection experiment with the proposed system described in Section 3 was conducted on the road. During real-world driving, the cameras are used to capture RGB and IR modalities at the sample rate, *Rv*, of 15 Hz, and audio data are collected at the sample rate, *Ra*, of 44,100 Hz. Biophysiological data are collected, as described in Section 3.3. The following CAN data signals are collected: accelerator pedal position, brake pedal position, steering wheel angle, yaw rate, longitudinal acceleration, and lateral acceleration. All CAN data are collected at the sample rate, *Rc*, of 100 Hz. The self-reportable application collected the driver's emotional state in five states involving four representative emotional states mentioned in Section 3.5 and nonresponse. The response request time interval, *Ir*, is set to 60 s, and then the sample rate of self-reported emotion label, *Rs*, is <sup>1</sup> <sup>60</sup> Hz. Because the driver is encouraged to self-report whenever there is a change in their emotional state even without that response request, the self-reported emotional state annotation includes information on the driver's emotional change for unexpected or urgent events. The rerequest time interval, *Irr*, and the skip time interval, *Is*, are set to 10 and 20 s, respectively. All interval times have been adjusted through several test drives in real-world driving, so that there is no safely issue. Details, including save format and unit for all data collected through the experiment, are described in Table 1.

To address the lack of long-term datasets, the experiment was conducted with a few people who could participate continuously for a long period. Four males participated in the experiment for four months from July 2021 to October 2021. The detailed information of these participants is described in Table 2.

During these four months, a large-scale dataset was collected by the participants' driving in wild, uncontrolled conditions. The weather conditions were divided into four categories, and the proportions are as follows: Sunny: 20.4%, Cloudy: 40.6%, Overcast: 11.8%, Rainy: 27.3%. Because safety is considered in the proposed data collection system, no accidents occurred during this period, and according to the data collection experiment results, the total experiment time was 122 h 15 min, the total driving mileage was 4446 km, the total number of self-reported emotion labels was 6356, and 787 GB data were collected.


**Table 1.** Details of data collected by experiment.

**Table 2.** Detailed information of participated drivers.


#### *4.2. Case Studies*

This section presents some case studies using the collected multimodal dataset for driver emotion recognition. Section 4.2.1 discusses the detailed analysis of the dataset collected in real-world driving. Sections 4.2.2 and 4.2.3 present case studies of driver emotion recognition using single-modal or multimodal inputs.

#### 4.2.1. Statistical Analysis

In this section, we discuss the detailed analysis results for the collected dataset in the real-world driving experiment. Figure 5 depicts the self-report proportion for each driver as a pie chart. The emotion with the highest proportion was "Happy|Neutral". More than 50% of the drivers' self-reported emotion labels are "Happy|Neutral", and they often account for up to approximately 82%. The proportion of the other three emotions varies by the driver, but it accounts for a small proportion compared to the "Happy|Neutral".

**Figure 5.** Pie charts for self-reported emotion label proportion by driver. (**a**) Driver A; (**b**) Driver B; (**c**) Driver C; (**d**) Driver D; (**e**) Legend of the pie charts.

To confirm the self-reported emotion label tendency of each emotion, the distribution of self-reports and vehicle speed by emotion for all drivers is depicted in Figures 6 and 7. In Figure 6, the start and end of all individuals driving were normalized from 0 to 100 steps and divided into 50 sections. The number of self-reported emotion labels for each section is displayed as a histogram and kernel density estimate plot to evaluate the distribution by emotion. "Happy|Neutral" had several distributions at the start and end of the driving, and had an even distribution throughout the driving process, as shown in Figure 6a. Overall, "Excited|Surprised" and "Angry|Disgusting" had an irregular distribution. "Excited|Surprised" seemed to have a greater variance than "Angry|Disgusting", as shown in Figure 6b,c, and it is judged that "Excited|Surprised" was more maintained when the emotion was induced than "Angry|Disgusting". As shown in Figure 6d, the distribution of "Sad|Fatigued" emotion increases toward the middle and late stages of driving. Figure 7 shows the number of self-reported emotion labels at that vehicle speed with a histogram and kernel density estimate plot to evaluate the distribution of vehicle speed by self-reported emotion labels. "Happy|Neutral" had high distributions from 0 to about 15 kph, and an even distribution throughout the driving process, as shown in Figure 7a. In Figure 7b,c, the fact that the vehicle speed had a relatively irregular distribution compared to "Happy|Neutral" and "Sad|Fatigued" in "Excited|Surprised" and "Angry|Disgusting" is a common feature with the distribution of self-reported emotion labels in Figure 6. As shown in Figure 7d, the distribution of the "Sad|Fatigued" emotion had a particularly high distribution from 0 to about 30 kph. Based on the distribution of self-reports and vehicle speed by emotion (especially in Figure 6a), "Happy|Neutral" was the default emotion and the others were induced while driving.

**Figure 6.** Distribution of self-reported emotion labels in real-world driving. (**a**) Happy|Neutral; (**b**) Excited|Surprised; (**c**) Angry|Disgusting; (**d**) Sad|Fatigued.

In addition to self-reported emotion label data, we used the statistical hypothesis test to analyze the significance of the collected sensor data. We built the null hypothesis (*H*0) that the structured data collected did not differ according to the self-reported emotion label and confirmed the difference by the emotion of each structured data through a Kruskal–Wallis H test [31,32]. According to the Kruskal–Wallis H test results, if the significance probability expressed as the *p*-value is less than the significance level, 0.05, the null hypothesis (*H*0) can be rejected and the alternative hypothesis (*H*1) can be accepted as true. The statistical significance by self-reported emotion label of each data is described using the *p*-value and which hypothesis was accepted as true in Table 3. If the statistical significance between the four self-reported emotion labels is confirmed by the Kruskal–Wallis H test, it is also necessary to confirm how many of the pairs show statistical significance through the posthoc test. We confirmed the statistical significance of a total of six self-reported emotion label

pairs through the Mann–Whitney U test [33,34], a nonparametric statistical hypothesis test, and the total number of the null hypothesis (*H*0) rejection pairs is also listed in Table 3. As shown in Table 3, all collected structured data had statistically different distributions for selfreported emotion labels, and three or more pairs out of six pairs were statistically significant.

**Figure 7.** Distribution of vehicle speed by self-reported emotion labels in real-world driving. (**a**) Happy|Neutral; (**b**) Excited|Surprised; (**c**) Angry|Disgusting; (**d**) Sad|Fatigued.


**Table 3.** Statistical hypothesis test results of structured data by self-reported emotion label.

Although the statistical hypothesis test results can explain the significance of the emotion recognition of the collected sensor data, another aspect that requires analysis is whether there is a significant distribution difference according to the driver. Therefore, the same statistical hypothesis test as above was repeated by separating the data for each driver, and the results are shown in Table 4. EDA and steering wheel angle are the only structured data with the same results for all drivers. Not only were the post-hoc results different, but also the results of determining whether to reject the null hypothesis were different for each driver. That means the collected data significantly vary from driver to driver. This may be because each driver has a different way of expressing their emotions while driving. Therefore, different data will be required to recognize each driver's emotion. In other words, emotion recognition research requires personalization.


**Table 4.** Statistical hypothesis test results of structured data by self-reported emotion label according to driver.

#### 4.2.2. Driver Face Detection

One of the most common approaches to recognizing a driver's emotional state is using face images. Studies adopting this approach generally use a well-known face detector to crop only the face image from the driver's frontal image and use it as input data. The most popular face detectors have proven their performance only on in-the-wild datasets such as FDDB [35] or WIDER FACE [36]. Thus, we evaluate the performance of five popular face detectors, Haar [37], Dlib [38], OpenCV [39], MMOD [40], and MTCNN [40], on detecting the driver's front image in the collected real-world driving dataset. First, the detection results of the five detectors for the collected IR-front images were output and qualitatively compared. Figure 8 is an example of the detection results of the five detectors. According to the results, Haar has a high false positive rate, i.e., nonfaces are detected, and Dlib has a high false negative rate, i.e., faces are not detected. In contrast to Haar and Dlib, other detectors are capable of detecting the driver's face to a similar degree.

**Figure 8.** Example of the detection results of five face detectors. The bounding boxes (red) are face detection results. (**a**) Haar; (**b**) Dlib; (**c**) OpenCV; (**d**) MMOD; (**e**) MTCNN.

For accurate performance comparison of the similar three face detectors, we selected 200 different images and labeled face bounding boxes. If the intersection over union (IoU) value between the labeled bounding box and the detection bounding box is greater than or equal to the threshold, it is considered true positive (TP); if the IoU value is less than the threshold, it is considered false positive (FP). Figure 9 shows the precision–recall (PR) curve drawn using the considered TP and FP. Quantitative performance comparison of face detectors can be made with the average precision (AP) value calculated by the area under

the PR curve. Depending on whether the threshold is 0.5, 0.75, or 0.95, AP performance is expressed as AP50, AP75, or AP95, respectively. Refer to Table 5 for detailed comparison results. Since the inference speed of the face detector is as important as detection accuracy, Table 5 describes the inference speed and the GPU specifications.

**Figure 9.** PR curve for face detectors capable of detecting the driver's face. The thresholds are 0.5 and 0.75. (**a**) OpenCV, threshold is 0.5; (**b**) MMOD, threshold is 0.5; (**c**) MTCNN, threshold is 0.5; (**d**) OpenCV, threshold is 0.75; (**e**) MMOD, threshold is 0.75; (**f**) MTCNN, threshold is 0.75.

**Table 5.** Driver's face detection performance comparison of face detectors.


OpenCV has the fastest inference speed, but its detection performance is low. For MMOD and MTCNN, AP50 is at a similar level, but at AP75, the detection performance of MMOD decreases rapidly. Although the AP75 performance of MTCNN is inferior to AP50, it is insignificant. Conversely, in the case of inference speed performance, MMOD significantly outperforms MTCNN. Since the inference speed of MTCNN is also insufficient, it seems appropriate to use a suitable face detector as the driver face detector depending on the purpose or computational sources. In terms of AP95, the performance of all detectors is 0.0. This is due to the small area occupied by the driver's face in the driver's front image, and the IoU value may not exceed the threshold value of 0.95 due to differences in determining whether only the eyes and nose are included, or including the forehead or chin when the bounding box is labeled. Figure 10 shows an example image of the detected and labeled driver face bounding boxes with an IoU value of 0.68, it detects the driver's facial expression sufficiently. In face detection for driver emotion recognition, the threshold should not be as high as 0.5 or 0.95. Therefore, we crop the face image using the MMOD face detector, which achieved the highest detection performance in AP50 for driver emotion recognition, as discussed in Section 4.2.3.

**Figure 10.** Example image with IoU of 0.68. Area of union (green and red) is 7441, and area of overlap (blue) is 5040.

#### 4.2.3. Personalized Driver Emotion Recognition

This section discusses the results of personalized driver emotion recognition utilizing single or multimodal data. Since individual driver data are required for personalized driver emotion recognition training, the data required to complete the training should be as small as possible, and the performance of the trained recognition model should be preserved for as long as possible. Therefore, the collected data are sorted in ascending order of mileage, and the mileage for completing the collection of training data, *K*, is determined. The data collected during *K* km driving from the initial mileage for each individual are used as training data, and the data from thereafter to the last data are used as test data. We set the completing mileage for the training data, *K*, to 500 km, and to obtain more test data than training data, we experimented with drivers A and B, who collected data over 1000 km.

We proposed a personalized driver emotion recognition model based on deep learning networks that recognize a driver's emotional state using four multimodal inputs: front and side image, biophysiological, and CAN data. The proposed model is trained and verified using only individual data, and, as shown in Figure 11, each multimodal input performs single-modal emotion recognition and multimodal emotion recognition through an ensemble model. Each single-modal model and multimodal recognition model are described as follows.

• Single-modal of front image (*Sf*): The single-modal recognition model of the front image uses front IR images for 2 s from 4 s to 2 s before the driver's self-reporting. Because RGB images are vulnerable to changes in illuminance, IR images that can always capture a stable image are used as input. From 2 s before self-reporting, it shows uniform motion for self-reporting, so it is excluded from the input data. The input images are evenly time-divided into six equal parts and input to a face detector; the MMOD-based face detector outputs one cropped face image with the highest confidence value for each input. The cropped images are resized to the input shape of the feature extractor and sequentially fed into a feature extractor and a classifier based on CAPNet [41]. Because the classification form is different from that of CAPNet, only the number of units in the top layer of the classifier is modified to the number of representative driver emotional states. The last activation function is softmax and outputs the probability of each representative driver emotional state.


**Figure 11.** Deep learning-based personalized driver emotion recognition model.

It is necessary to define a loss function when training the proposed models. Because the self-reported emotion label has data imbalance, as described in Section 4.2.1, high performance cannot be expected if a typical loss function is used such as cross entropy. We overcome the data imbalance problem by making the precision and recall differentiable by computing the likelihood values of TP, FP, and false negative (FN) using probabilities. The loss function we used is shown as follows:

$$L(\mathbf{y}, \mathbf{\hat{y}}) = 1 - \frac{1}{N} (\frac{p\_1^{\text{TP}}}{p\_1^{\text{TP}} + p\_1^{\text{FP}} + \epsilon} + \sum\_{i=2}^{N} \frac{p\_i^{\text{TP}}}{p\_i^{\text{TP}} + p\_i^{\text{FN}} + \epsilon}) \tag{1}$$

$$\mathbf{p}^{\text{TP}} = \mathbf{y} \circ \mathbf{\hat{y}} \tag{2}$$

$$\mathbf{p}^{\text{FP}} = (\begin{bmatrix} 1.\\ 1.\\ 1.\\ 1. \end{bmatrix} - \mathbf{y}) \circ \mathbf{\hat{y}} \tag{3}$$

$$\mathbf{p}^{\text{FN}} = \mathbf{y} \circ (\begin{bmatrix} 1. \\ 1. \\ 1. \\ 1. \end{bmatrix} - \mathbf{y}) \tag{4}$$

where **y** and **yˆ** represent a one-hot vector of the self-reported emotion and predicted emotion, respectively, where the first element of each vector represents the default emotion, "Happy|Neutral". **p**TP, **p**FP, and **p**FN are the likelihood values of TP, FP, and FN, respectively, where ◦ is an element-wise product.

Equation (1) is a loss function for increasing the precision of default emotion and for increasing the recall of induced emotions, where *N* represents the total number of representative emotions, and  represents a very small value that prevents the precision or recall values from going to infinity. This loss function, *L*(**y**, **yˆ**), can be used for backpropagation by probabilistically expressing the precision and recall for each prediction class. It increases precision for the majority class, the default emotional state, and increases recall for minority class, inducible emotional states.

The evaluation results with test data are in terms of F1 score, precision, and recall, and are described for each driver. As mentioned in Section 4.2.1, since the representative driver emotional states are divided into default and inducible emotions, the recognition performance of inducible emotions is evaluated first. Tables 6 and 7 summarize the performance of inducible emotion recognition between default and inducible emotions for each driver. The highest recognition performance is the F1 score 0.698 of *Ss* for Driver A and 0.667 of *Msbc* for Driver B. As expected in Section 4.2.1, the input modals with the best performance for each driver differed. Driver A achieved the best performance in a single front image, and Driver B achieved the best in a side image, biophysiological, CAN data combination. However, their performance was similar. Driver B had similar performance between all evaluated models from 0.562 to 0.667. For Driver A, models without CAN data had a similar performance from 0.613 to 0.696, but models with CAN data such as *Sc*, *Mf sc*, *Mf bc*, *Msbc*, and *Mf sbc* had a significantly lower performance from 0.228 to 0.469. Driver B can interpret that when inducible emotions are induced while driving, emotions are expressed overall in the front and side images and biophysiological, and CAN data, whereas driver A can interpret that the induction of emotion is not expressed in CAN data. These results may support the fact that driver emotion recognition necessitates personalization.


**Table 6.** Performance of inducible emotion recognition of Driver A.

**Table 7.** Performance of inducible emotion recognition of Driver B.


The performance of driver emotion recognition among the inducible emotions for each driver is also summarized. The recognition performance for each of the three inducible emotions and the average of three F1 scores are described in Tables 8 and 9. Comparing the recognition performance using the F1 scores of each emotion and average value, none of the input models with the best performance matched among the drivers. The common results, regardless of the driver, were that "Sad|Fatigued" emotion had the best recognition performance and "Angry|Disgusting" emotion had the worst recognition performance. "Sad|Fatigued" emotion recognition performance was 0.835 and 0.859 and "Excited|Surprised" emotion recognition performance was 0.653 and 0.583 for Drivers A and B, respectively. Both of which are similar performances. However, in the case of "Angry|Disgusting" emotion, recognition performance differed, 0.571 and 0.373 for each driver. Notably, there was very little performance difference between all evaluated models. The difference between the highest and lowest average F1 score was 0.163 and 0.061 for Drivers A and B, respectively. This can be a fail-safe method of the driver emotion recognition model, and each input modal will ensure each other's redundancy.


**Table 8.** Performance of driver emotion recognition among inducible emotions of Driver A.


**Table 9.** Performance of driver emotion recognition among inducible emotions of Driver B.

#### **5. Conclusions**

Although real-world datasets for driver emotion recognition are diverse, to overcome the limitation of the lack of consistency in collected data, we proposed a data collection system capable of collecting multimodal datasets during real-world driving. The proposed system was installed in a vehicle and collected the following multimodal data while driving on the real road: videos captured from two viewpoints, audio inside the cabin, driver's biophysiological data, and vehicle sensor signals via CAN. We designed a self-reportable HMI application to annotate driver emotional states, used as labels for driver emotion recognition. This application allows the driver to select the emotion most similar to their current emotional state among representative emotions. Thus, emotion labels are collected as self-reported emotion labels and no longer inferred by others. In addition, continuous and repeated report requests were made over a long-term period, making the driver's bias not be reflected in the self-reported emotion label. Since safety is the most important factor in real-world driving, we focused on minimizing drivers' behavioral and cognitive disturbances in all processes, including sensor selection, flow, and GUI design while designing the data collection system.

According to the results of the data collection experiment in real-world driving, more than 122 h, 4446 km of driving, and 787 GB of data were collected without any accidents. Through statistical analysis of the collected data, the imbalance and report characteristics of self-reported emotion labels were identified, and default and inducible emotions were distinguished. Based on the statistical hypothesis test, the null hypothesis (*H*0) that there is no difference according to the self-reported emotion label for all collected structured data was rejected. The significance of the difference for each driver differed, suggesting the need for personalization of driver emotion recognition. We compared the state-ofthe-art face detectors using the collected front images and presented the most suitable face detector and performance evaluation metric for driver face detection. Finally, we conducted a personalized driver emotion recognition study using the collected images and biophysiological and CAN data. The evaluation results of single-modal and multimodal using the above data suggested that multimodal data and personalization are necessary for driver emotion recognition.

Although several case studies were conducted by collecting a large-scale dataset using the proposed system design, enabling safe data collection in real-world driving, the dataset was collected by few drivers over a long period. Because the number of drivers is insufficient to generalize the case studies, these may be treated as particular cases. Based on further collected data, we will continue to study the generalization performance of multimodal personalized driver emotion recognition.

**Author Contributions:** Conceptualization, S.L. (Sejoon Lim) and S.L. (Sangho Lee); methodology, G.O., E.J. and R.C.K.; software, G.O. and E.J.; validation, J.H.Y., S.L. (Sejoon Lim) and S.L. (Sangho Lee); formal analysis, G.O.; investigation, G.O., J.H.Y. and S.L. (Sejoon Lim); resources, G.O., J.H.Y. and S.H.; data curation, G.O., E.J., R.C.K. and S.H.; writing—original draft preparation, G.O.; writing—review and editing, J.H.Y. and S.L. (Sejoon Lim); visualization, G.O. and E.J.; supervision, S.L. (Sejoon Lim); project administration, S.L. (Sejoon Lim) and S.L. (Sangho Lee); and funding acquisition, J.H.Y., S.L. (Sejoon Lim) and S.L. (Sangho Lee). All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Hyundai Motor Group, the Knowledge Service Industry Core Technology Development Program funded by the Ministry of Trade, Industry, and Energy of Korea (No. 20003519), the Basic Science Research Program of the National Research Foundation of Korea funded by the Ministry of Science, ICT, and Future Planning (No. 2021R1A2C1005433), the BK21 Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (No. 5199990814084), and the Korea Institute of Police Technology (KIPoT) grant funded by the Korea government (KNPA) (No. 092021C26S03000, Development of infrastructure information integration and management technologies for real time traffic safety facility operation).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Kookmin University (protocol code: KMU-202104-HR-264; date of approval: 2 June 2021).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Acknowledgments:** The authors thank Junghwan Ryu, Taesan Kim, and Joonghoo Park for builing a vehicle with a data collection system and Youngdong Kwon and Myengkyu Lee for setting representative emotions and GUI design.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **Appendix A**

The part describes terminologies and variables used in the main text. Table A1 contains details of terminologies and variables.


**Table A1.** Deficition of terminologies and variables used on the main text.

#### **References**


## *Article* **Identification of Video Game Addiction Using Heart-Rate Variability Parameters**

**Jung-Yong Kim 1, Hea-Sol Kim 1, Dong-Joon Kim 2,\*, Sung-Kyun Im <sup>2</sup> and Mi-Sook Kim <sup>3</sup>**


**Abstract:** The purpose of this study is to determine heart rate variability (HRV) parameters that can quantitatively characterize game addiction by using electrocardiograms (ECGs). 23 subjects were classified into two groups prior to the experiment, 11 game-addicted subjects, and 12 non-addicted subjects, using questionnaires (CIUS and IAT). Various HRV parameters were tested to identify the addicted subject. The subjects played the *League of Legends* game for 30–40 min. The experimenter measured ECG during the game at various window sizes and specific events. Moreover, correlation and factor analyses were used to find the most effective parameters. A logistic regression equation was formed to calculate the accuracy in diagnosing addicted and non-addicted subjects. The most accurate set of parameters was found to be pNNI20, RMSSD, and LF in the 30 s after the "being killed" event. The logistic regression analysis provided an accuracy of 69.3% to 70.3%. AUC values in this study ranged from 0.654 to 0.677. This study can be noted as an exploratory step in the quantification of game addiction based on the stress response that could be used as an objective diagnostic method in the future.

**Keywords:** HRV parameter; game addiction; *League of Legends*; stress response; sensitivity; specificity; logistic regression

#### **1. Introduction**

The game industry is growing, with a market size of more than US \$123.4 billion worldwide. South Korea is ranked fifth in the world, with 6.7% of the world market share [1], and accounts for 55.8% of Korea's content industry exports in 2018 [2]. Ryu and Lee [3] stated that such booming of the game industry has a positive influence on society, including stress management, the realization of the ideal self, and physical ability improvement. In particular, in the current COVID-19 environment, online games are recognized as a complementary means of social distancing [4,5]. However, Internet game players are not protected from becoming addicted to gaming. This addiction problem could adversely affect personal life as well as family and society, and has become a serious public health issue. Byun and Lee [6] found that Internet addiction is closely related to the increased frequency and duration of Internet use, and leads to anxiety, fear, depression, and obsessive-compulsive disorder, with adolescents being vulnerable target users. Koepp et al. [7] observed that dopamine is secreted from the brains of addicted adolescents with a similar pattern to that of drug addiction.

Adverse effects on adolescents have been studied by many authors [8–10]. In particular, it is notable that the most influential factor causing Internet addiction is stress due to excessive competition, and that adolescents exposed to excessive stress sources were readily immersed in the Internet [11]. Adolescents often experience alienation or loneliness when they are addicted to Internet games [12]. They relieve the stress related to daily life

**Citation:** Kim, J.-Y.; Kim, H.-S.; Kim, D.-J.; Im, S.-K.; Kim, M.-S. Identification of Video Game Addiction Using Heart-Rate Variability Parameters. *Sensors* **2021**, *21*, 4683. https://doi.org/10.3390/ s21144683

Academic Editors: Yvonne Tran and Ki H. Chon

Received: 15 April 2021 Accepted: 28 June 2021 Published: 8 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and loneliness by using internet games, which were easily accessible [13]. The higher the level of stress, the more they tended to fall into game addiction [14]. According to a study by Lee [15], game addiction prevents adolescents from coping with stress sources properly, causing various psychological problems and stress responses. Likewise, the literature indicates that Internet game addiction and mental stress are closely related.

In recent years, heart rate variability (HRV) has been used in many studies to evaluate stress levels [16–19]. Since stress affects the autonomic nervous system (ANS), HRV controlled by the ANS is often referenced as a stress indicator. A number of studies on HRV parameters have been conducted in this regard. Taelman et al. [20] and Vuksanovi´c and Gal [21] observed that the mean of the NN interval, which is often expressed as the RR interval, and the standard deviation of all NN intervals (SDNN) decreased significantly under mental stress. Taelman et al. [20] and Tharion et al. [22] showed that pNN50 (percentage of successive RR intervals greater than 50 ms) is significantly decreased under stress. Papousek et al. [23] and Traina et al. [24] reported an increase in the low-frequency power range (LF), a decrease in the high-frequency power range (HF), and a significant increase in the LF/HF ratio when subjects experience stress. Park et al. [25] tested the newly developed measuring system to examine electrocardiograms (ECGs) and found a consistent increase in HR and SDNN as the level of addiction increased. At the same time, the LF and LF/HF parameters showed an obvious increasing trend at a high level of addiction.

On the other hand, Hafeez et al. [26] used EEGs to classify game addicts and nonaddicts using cluster analysis and pattern discrimination. They introduced a statistical method to quantify the addiction phenomenon, and Hafeez et al. [27] and Kim et al. [28] identified the theta and theta/alpha parameters of the right occipital region as the discriminating variables between addicts and non-addicts. Likewise, the attempt to quantify the particular characteristics of addiction is an ongoing topic for researchers. If such a numerically quantifiable approach can be successful and assist physicians in identifying an addicted patient, they will be able to treat the patient more efficiently and objectively. Therefore, in this study, the authors are challenged to search for a quantifiable indicator of addiction in ECG response by investigating various HRV parameters. The purpose of this study is to extract quantitative HRV parameters that characterize the particular stress response of game addicts. To achieve this research goal, an exhaustive approach was performed by testing all the candidate parameters collected using window sizes of 30, 60, 90, and 120 s.

#### **2. Methods**

#### *2.1. Subjects*

A total of 23 male students participated in the experiment. The mean age was 23 years (±3 years). Eleven participants were addicted, and 12 of them were non-addicted. They were categorized using the Compulsive Internet Use Scale (CIUS) by Meerkerk et al. [29], and the Internet Addiction Test (IAT) by Young and De Abreu [30]. Based on CIUS, subjects with 2.5 or higher were categorized as addicted, and those with scores less than 1.5 were categorized as non-addicted [31]. An IAT score of 50 or higher has been used to classify the game-addicted by many researchers [10,32–35]. In this study, a subject was categorized as a game addict only when the subject met both the IAT and CIUS standards. For nonaddicted subjects, an IAT score of 40 or lower was required. 14 addicted subjects and 14 non-addicted subjects were selected. 3 addicted subjects and 2 non-addicted subjects were discarded due to a technical error in the measurement system. Controlling the compounding effect of gender in this study, only male participants were tested in this study. Alcohol consumption was prohibited for 24 h before the start of the experiment, and smoking and coffee consumption were prohibited for 1 h before the start of the experiment. A fee was paid to the participants. The experiment was conducted in accordance with the regulations under consideration by the Institutional Review Board of Hanyang University in the Republic of Korea (IRB approval number: HYU-2019-08-004-1).

#### *2.2. Apparatus*

The questionnaires used to categorize subjects into two groups prior to the experiment were the CIUS by Meerkerk et al. [29] and the IAT by Young and De Abreu [30,36].

*League of Legends* by Riot Games Inc. (Los Angeles, CA, USA) was chosen for the experiment. This game was one of the most frequently played games among internet game players [37], and the frequent battles in the game made players experience a simulated life and death situation associated with probable stress reactions.

For data collection, an auxiliary channel of QEEG-64FX by LAXTHA Inc. (Daejeon, Republic of Korea) was used for ECG measurements (Figure 1). A data collection program called Telescan was used. The data sampling rate was set to 500 Hz. The experiment was conducted in a room equipped with a computer, a table, and a chair, where other external stimuli were restrained.

**Figure 1.** ECG measurement equipment. (**a**) Top view of experimental set-up, (**b**) The experimental scene.

#### *2.3. Experimental Design*

The experiment was designed to test HRV parameters to determine whether they could differentiate subjects into two groups: addicted and non-addicted. A betweensubjects design was used in this study. The independent variables were the addiction status of the group, and the dependent variables were 14 parameters, including 7 time-domain variables and 7 frequency-domain variables. The time-domain parameters are NN interval average (RR interval average), SDNN, SDSD, pNNI50, pNNI20, RMSSD, and heart rate average (Table 1). The frequency-domain parameters are LF, HF, LF/HF ratio, LFnu, HFnu, total power, and VLF (Table 2). This study observed specific events during gameplay, including a "killed event", when a player's character was killed by an opponent, and a "killing event", when the player killed an opponent. The data collection window sizes for these events were 30 s, 60 s, 90 s, and 120 s, respectively, to consider the possible delay of the response.

#### *2.4. Procedure*

Positive electrode was placed in the V1 location (between the right rib 3 and 4), and the negative electrode was placed in the left infraclavicular fossa according to the standard limb guidance method [39]. The experimental procedure was briefly explained to the subject, and the ECG sensors were attached and tested to ensure that stable signals were obtained for 1 min while the subjects were relaxing. A "normal game", which is a practice game that does not affect the player's score, was played for familiarization; a "ranked game", which is a competing game affecting the player's score, was played for 30–40 min. For players' immersion in the game, the ranked game was played based on the individual skill level. Subjects played a "normal game" once and a "ranked game" twice, while the ECG was obtained. Subjects were not informed about the addiction test score; thus, they

did not know whether they were categorized in the addicted group or not. The detailed experimental procedure is shown in Figure 2.

**Table 1.** Time-domain variables for heart rate variability [38].


**Table 2.** Frequency-domain variables of heart rate variability [38].


**Figure 2.** Experimental process.

#### *2.5. Data Analysis*

The data were analyzed in batches using Python, and time series analysis and frequency analysis were performed at the same time. The parameters used for time series analysis were extracted by using the Christov ECG R-peak segmentation algorithm. The extracted parameters were NN interval average, SDNN, RMSSD, pNNI50, pNNI20, SDSD, and heart rate average. The signal was also extracted and transformed into frequency parameters using the fast Fourier transform. Welch's periodogram was applied to estimate the spectral properties of the HRV signals, using a Hanning window. VLF (power in verylow-frequency ranges, 0.0033–0.04 Hz), LF (power in low-frequency ranges, 0.04–0.15 Hz), HF (Power in high-frequency ranges, 0.15–0.4 Hz), and total power (Power in all the frequency ranges, ≤0.4) were obtained by the sum of the power in the relevant frequency range of the spectrum. Based on these power values, the values of LF/HF ratio, LFnu, and HFnu were calculated.

Normality was tested by using Kolmogorov–Smirnov test for individual data set. The dataset with a low normality value was graphically examined to ensure an adequate level of normality. During the process, illegal outliers were treated. The *t*-test was performed (*p* < 0.1) to find the parameters and window size that statistically differentiate two groups: the addicted and non-addicted. The statistical analysis was an exhaustive process used to identify the set of most effective parameters and the window size. A correlation analysis was also performed to determine the redundancy of parameters, and a factor analysis was performed to choose the main parameters representing the characteristics of each group. Finally, a logistic regression analysis was conducted to test the sensitivity and specificity of the statistical model in identifying addicted or non-addicted subjects based on the current experimental data. The analysis process is illustrated in Figure 3. Statistical analysis was performed using SPSS Statistics 24.

**Figure 3.** Data analysis process.

#### **3. Results**

An elimination process was used to sort out the best combination of parameters out of 14 parameters from 4 window sizes through statistical analyses.

#### *3.1. The t-Test Results between Groups by Window Size*

There were no significant differences in average parameter values between the addicted and non-addicted groups for the entire window sizes during the experiment (*p* > 0.1).

#### *3.2. The t-Test Results between Groups after Specific Event*

There was no significant difference of HRV parameters between groups for window sizes of 30 s, 60 s, 90 s, and 120 s after "killing events" (*p* > 0.1). However, as shown in Tables 3 and 4, the HRV parameters measured for window sizes of 30 s and 60 s after "killed events" showed a significant difference in some parameters between the two groups. In particular, pNNI20 and LF showed a significant difference (*p* < 0.05), and a marginally significant difference was observed for SDSD, RMSSD, and total power (*p* < 0.1).

#### *3.3. Correlation Analysis and Factor Analysis with HRV Parameters*

A correlation analysis was performed to examine the redundancy of the parameters in differentiating between the two groups. LF and pNNI20 with significant *p*-values (*p* < 0.05) in the *t*-test indicated a low correlation coefficient (0.264). Both could be used to improve statistical power in differentiating the two groups. On the other hand, SDSD and RMSSD showed a correlation coefficient of 1.000, and the total power and LF indicated a coefficient of 0.958. Thus, only one parameter was used to build the statistical model. Therefore, the correlation analysis suggested that the combination of the [pNNI20, LF, SDSD] or [pNNI20, LF, RMSSD] parameter set could be the best combination of parameters with the least redundancy.

Factor analysis was also performed to examine whether the selected parameters covered various factors of the data (Figure 4). The parameters with high eigenvalues for Factor 1 were RMSSD, SDSD, pNNI\_50, and pNNI\_20, and the parameters with high eigenvalues for Factor 2 were LF, total power, and SDNN. That is, the [pNNI20, LF, SDSD] or [pNNI20, LF, RMSSD] parameter set from the correlation analysis (Table 5) were found to have the highest eigenvalues for both Factor 1 and Factor 2 (Table 6). Therefore, the final combination of parameters for statistical modeling was [pNNI20, LF, RMSSD] or [pNNI20, LF, SDSD]. In logistic regression modeling, [pNNI20, LF, RMSSD] was arbitrarily selected to test the model performance in this study because both RMSSD and SDSD were highly correlated with each other (r = 1.000).

**Figure 4.** Factor analysis results.

**Table 3.** The *t*-test results for data from 30 s window size after "killed event"; mean (±standard deviation).


\*\* *p* < 0.05, \* *p* < 0.1.


**Table 4.** The *t*-test results for data from 60 s window size after "killed event"; mean (±standard deviation).

\* *p* < 0.1.

**Table 5.** Correlation coefficients among heart rate variability parameters.


\*\* Correlation is significant at the 0.01 level (both sides).

**Table 6.** The eigenvalues from factor analysis.


#### *3.4. Logistic Regression Models*

Logistic regression models were developed using the selected parameters. A total of 15 mathematical equations were designed to test the maximum sensitivity and specificity of the parameters using natural logarithms and squares. In terms of identifying the addicted group, the sensitivity was computed, and ranged from 0.324 to 0.400; the specificity ranged from 0.828 to 0.922. The overall accuracy ranged from 67.7% to 70.3%. The model with the highest specificity of 0.922 was constructed using pNNI20, ln(RMSSD), and LF. The model with the highest sensitivity of 0.400 was obtained using ln(pNNI20), (RMSSD)2, and ln(LF). The model with the highest overall accuracy of 70.3% was obtained using pNNI20, ln(RMSSD), and LF. The second-highest overall accuracy model (69.7%) was obtained using ln(pNNI20), ln(RMSSD), and (LF)2. The results are summarized in Table 7.


**Table 7.** Summary results of four logistic regression models with the highest accuracy.

#### *3.5. Characteristics of Distributions Affecting the Sensitivity and Specificity*

The true positive rate (sensitivity) was less than 0.4 in the above analysis, which is not good enough to provide a diagnosis of addiction for medical treatment. Such a relatively low sensitivity could be a part of the outcome based on the logistic regression model to maximize the total accuracy. To see the characteristics of the probability distribution of the data, Figures 5–7 are shown under the assumption of a normal distribution. As shown, there is a substantial overlap between distributions that could make either sensitivity or specificity low. From the observations, the criterion beta used for decision-making seemed to be biased to a conservative standard rather than a liberal one, considering that the specificity was much higher than the sensitivity. For example, for Model 1, with a maximum accuracy of 72.3%, the cut-off point associated beta value was set to 0.523, and the sensitivity and specificity were computed as 0.324 and 0.953, respectively. If a different cutoff value was then used, such as 0.372 in Model 2, the sensitivity and specificity can be computed as 0.656 and 0.703, respectively, with 67.3% accuracy.

#### *3.6. Area under the Curve (AUC) Values*

Figure 8 shows the ROC curves of the four models. The AUC value of 0.677 was for Model 1, 0.655 for Model 2, 0.673 for Model 3, and 0.654 for Model 4. According to Hosmer and Lemeshow's study [40], models having an AUC value of 0.5 or less have no discriminating power. A model can be considered acceptable only if the AUC value is between 0.7 and 0.8, and a model has excellent discriminating power if the AUC value is between 0.8 and 0.9. Thus, the AUC value of the current logistic regression model is close to the acceptable level, but further refinement is required for the model to be acceptable.

**Figure 5.** Probability density distribution of RMSSD parameter.

**Figure 6.** Probability density distribution of LF parameter.

**Figure 7.** Probability density distribution of pNNI20 parameter.

**Figure 8.** Receiver operating characteristic (ROC) curves of four representative models.

#### **4. Discussion**

The study showed that "being killed" in a virtual situation generated a greater signal response among the addicted subjects than non-addicted subjects. Klimt et al. [41] mentioned that a shift in self-perception would occur while enjoying the game and identifying oneself with the game character or when playing games experiencing flow or psychological mastery. Turkay and Kinzer [42] stated that the customization process of avatars by players could greatly influence players to identify themselves as game characters. Therefore, it is reasonable to think that such an affective attachment with an avatar could psychologically influence the players, and this phenomenon could be even more severe among addicted subjects than non-addicted ones.

Regarding the model building, three different statistical methods were used to select the parameters to build the best logistic regression model. Through the *t*-test, the pNNI20 and LF parameters were selected because they showed the most significant results (*p* < 0.05) in differentiating the two groups 30 s after the "being killed" event. This means that both time-domain and frequency-domain parameters could be effective in statistically discriminating between the two groups. The total power parameter showed a significant *p*-value (<0.072); however, it was not selected for the final logistic model because it was highly correlated with the LF parameter (r = 0.958) to avoid redundancy. In addition, the RMSSD (or SDSD) parameter was used for the logistic regression model because it showed the highest eigenvalue (0.86) of Factor 1 in the factor analysis. The LF parameter with a significant *p*-value in the *t*-test also showed the highest eigenvalue (0.715) for Factor 2, which was used for the final logistic regression model.

The final parameters selected in this study were found to be associated with the stress response based on previous studies. Bernardi et al. [43] evaluated HRV parameters under the mentally stressful situation of a subject performing arithmetic while speaking or reading, and they observed the increased power of LF when subjects were hurrying to perform the calculation task. Huang et al. [44] found that RMSSD and the combination of various variables had a positive correlation with mental fatigue induced by mental stress. According to a study by Jang et al. [45], RMSSD was also found to have a marginal correlation with tension (r = 0.268, *p* = 0.039), depression (r = 0.356, *p* = 0.005), fatigue (r = 0.259, *p* = 0.041), and frustration (r = 0.304, *p* = 0.018). Lee et al. [46] observed changes in HRV during physical and mental stress in patients with depression, and they reported a significant increase in RMSSD during the stress period compared with the rest period. Mallinani et al. [47] explained that increased sympathetic activity could be functionally characterized by an increase in the LF component in terms of LF–HF balance. Kim et al. [48] reviewed the function of HRV parameters and concluded that low parasympathetic activity was frequently related to a decrease in HF and an increase in LF.

To investigate the efficacy of the regression model in diagnosing game addiction patients, the AUC values were calculated and compared with the reference values. The computed AUC value in this study ranged from 0.654 to 0.677, which is known to have insufficient accuracy for field applications. This indicated that the increased stress response of the addicted during a "killed event" was statistically meaningful, but it might not fully reflect the symptom of addiction that the players were experiencing. Regarding the sensitivity and specificity score, the sensitivity was computed and ranged from 0.324 to 0.400, and the specificity ranged from 0.828 to 0.922 based on the logistic regression model with the default cut-off point used as a decision criterion. However, the values could change when different cut-off points were used. For now, the AUC value was less than 0.7, which could expect only less-than-accurate decision-making. Therefore, it is necessary to test the model performance under various experimental conditions. At any rate, it is important to understand the nature of HRV parameters among addicted game players, who have been very responsive to stressful stimuli, which was worthwhile to investigate further for quantification of addictive symptoms during game playing.

#### **5. Conclusions**

In this study, the difference in HRV parameters between the addicted and non-addicted group was measured during game playing, and it was found that pNNI20, RMSSD, and LF reflected the difference in stress response sensitively for a window size of 30 s after a "being killed" event. To identify the difference between the game-addicted and non-addicted subjects, the AUC score was computed and found to be less than accurate. The quantification of the psychophysiological response of the addictive game was a challenging task, as was shown in this study, but it is worth pursuing the prevention and rehabilitation of addicted patients in the future. For further study, various types and greater numbers of subjects need to be tested for better representation of the addiction symptoms. Additional mathematical exploration using artificial intelligence techniques could be another option for analyzing bio-information with a high level of variability and probable irregularity. It would also be intriguing to examine and compare the HRV parameters to other psychophysiological signals to identify the unknown patterns of game addiction.

**Author Contributions:** J.-Y.K., H.-S.K., D.-J.K., S.-K.I. and M.-S.K. drafted parts of the manuscript and reviewed and edited the full manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Education and the Ministry of Science and ICT, grant numbers No.NRF-2018R1D1A1B07050786 and No. 2020-0-01343.

**Institutional Review Board Statement:** This experiment was conducted in accordance with the regulations under consideration by the Institutional Review Board of Hanyang University in the Republic of Korea (IRB approval number: HYU-2019-08-004-1).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This research was partly supported by a National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (No.NRF-2018R1D1A1B07050786, Development of the algorithm to identify the EEG pattern of game addicts) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-01343, Artificial Intelligence Convergence Research Center (Hanyang University ERICA)).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


statement from the Councils on Cardiovascular Nursing, Clinical Cardiology, and Cardiovascular Disease in the Young: Endorsed by the International Society of Computerized Electrocardiology and the American Association of Critical-Care Nurses. *Circulation* **2004**, *110*, 2721–2746.


## *Article* **Individual's Social Perception of Virtual Avatars Embodied with Their Habitual Facial Expressions and Facial Appearance**

**Sung Park 1,\*, Si Pyoung Kim <sup>2</sup> and Mincheol Whang <sup>3</sup>**


**Abstract:** With the prevalence of virtual avatars and the recent emergence of metaverse technology, there has been an increase in users who express their identity through an avatar. The research community focused on improving the realistic expressions and non-verbal communication channels of virtual characters to create a more customized experience. However, there is a lack in the understanding of how avatars can embody a user's signature expressions (i.e., user's habitual facial expressions and facial appearance) that would provide an individualized experience. Our study focused on identifying elements that may affect the user's social perception (similarity, familiarity, attraction, liking, and involvement) of customized virtual avatars engineered considering the user's facial characteristics. We evaluated the participant's subjective appraisal of avatars that embodied the participant's habitual facial expressions or facial appearance. Results indicated that participants felt that the avatar that embodied their habitual expressions was more similar to them than the avatar that did not. Furthermore, participants felt that the avatar that embodied their appearance was more familiar than the avatar that did not. Designers should be mindful about how people perceive individuated virtual avatars in order to accurately represent the user's identity and help users relate to their avatar.

**Keywords:** virtual avatar; virtual human; virtual character; embodied conversational agent; social interaction; empathy

#### **1. Introduction**

Humans communicate with others via verbal and non-verbal communication. Through dyadic social interaction, people elicit the other's intention and emotion [1]. Facial expressions represent non-verbal communication channels [2]. The face is the most recognizable region and has unique characteristics that represent an individual [3]. Humans are born with an innate capability to sense and perceive the most important person (i.e., mother) at the early stage of life. Infants are known to discriminate facial features starting at two months after birth [4], and they also prefer facial features over other shapes and forms [5]. Hiding one's face implies the concealment of one's identity. For example, covering a face with a mask may be considered negative social behavior [6].

The rapid advancement of VR (Virtual Reality) technology facilitates the introduction of expressive services tailored to the metaverse. Virtual experiences using HMD (Head-Mounted Display) are now prevalent in households due to video games. In addition, the AR (Augmented Reality) industry is growing through mobile platforms with the availability of engaging entertainment services. Naturally, virtual avatars, a conduit that connects the virtual world to the user, have gained much attention. Many users are interested in projecting or extending their identities through avatars in the internet's social landscape.

**Citation:** Park, S.; Kim, S.P.; Whang, M. Individual's Social Perception of Virtual Avatars Embodied with Their Habitual Facial Expressions and Facial Appearance. *Sensors* **2021**, *21*, 5986. https:// doi.org/10.3390/s21175986

Academic Editor: Stefanos Kollias

Received: 20 August 2021 Accepted: 4 September 2021 Published: 6 September 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

There are various ways to express oneself through a virtual avatar. The most direct way is to apply one's physical characteristics to an avatar that embodies the user's facial appearance or proportions [7]. Studies are also considering the application of a user's habitual expressions based on facial muscle movement [8]. A virtual avatar with the user's unique signature may elicit social responses such as perceived similarity and familiarity.

#### *1.1. Habitual Facial Expressions and Facial Appearance*

The human face consists of 20 facial muscles. Humans communicate through an interplay of these muscles, which produce expressions. Facial expressions enable social communication, which abides by shared rules [9]. They are a powerful source of visual information that embodies the individual's emotions, behavioral predisposition, and intention [10]. Humans can infer the interaction partner's psychological state through facial expressions and identify their traits [11]. In psychology, an individual's traits are, by definition, their habitual pattern of thoughts or affect.

Facial expressions are individual behavioral habits that consist of patterned muscle movement. Such patterns include unique muscle characteristics (e.g., the intensity of the movement of each facial muscle). As a result of these individual differences, people can reliably discriminate themselves from others [8].

On the other hand, facial appearance provides a person's unique identity from the physical features, specifically face and head. Although the perception of appearance relies on many environmental factors (e.g., head pose, lighting conditions), there are descriptive characteristics of a particular individual, such as the location of the eye, nose, and mouth. In our study, we used such facial landmarks to identify critical regions of the face by defining their coordinates (x,y) on the facial image.

Visual perception plays an integral part in facial recognition, which also applies to recognizing oneself. The easiest way to look at oneself is through a mirror. Being able to recognize one's own face is one of the critical prerequisites of self-consciousness and selfidentity. Only humans and a few animals may recognize themselves through a mirror [12]. For humans, this ability develops at the age of two. This ability correlates with empathic and altruistic behavior.

Humans feel a sense of closeness to familiar entities. They also feel more intimate with objects that they are repeatedly exposed to, even without interacting with these (i.e., mere exposure) [13]. An object to which a person is familiarized through repetitive exposure may elicit positive responses [14,15]. For example, stimuli such as names [16] or photos [17] may elicit positive responses after repeated exposure. This phenomenon may also be observed with facial perception. When participants viewed a specific face repetitively, they described it as more familiar, similar, and attractive than those who did not [18].

Humans belong to social circles of varying size. Individuals have a higher chance of getting exposed to a member in the same group than to a member in a different group. When exposed to identical situations, people in the same group tend to exhibit similar responses. The more members express different responses, the lesser the probability of sustaining the group [19].

Exhibiting a similar response to an identical stimulus is related to empathy. In a dyadic interaction, an empathic response is manifested by mimicking the other's facial expressions or gestures [20]. Sustaining a similar expression or empathic response for a long time results in the repeated utilization of the respective muscles responsible for empathic expressions. Repetitive use of certain muscles affects bone structure and as a result, leads to an appearance that is similar to that of the significant other [21].

Furthermore, perceived similarity is known to entail a positive face-to-face interaction. People are predisposed to think that in dyadic socialization, a part of their partner's attitude, values, and beliefs is similar to theirs [22,23]. People tend to like and trust people who have a similar physical appearance more than those who do not [24].

#### *1.2. Virtual Avatar*

The term avatar is derived from a Sanskrit word and connotes the incarnation of a deity. In modern society, the user's mental model of an avatar is that it is an alter ego of the user that can interact with other virtual avatars in a virtual world [25]. Recently, the need for a virtual avatar has not only come from games, movies, advertisements, and remote collaboration but has extended to medical practice and crime investigation. Research, design, and development explore the avatar model and how it can imitate users in real time. Realistic animation is possible by depicting the movement based on bone and muscle structure, considering the real-world laws of physics.

In general, the more similar the illustration of a virtual avatar is to the user, the more immersive their experience [26,27]. Nevertheless, a very realistic but imperfect depiction of a user may lead to negative feelings [28]. Virtual characteristics that reach a certain point of human likeness tend to elicit a feeling of eeriness.

Much research has been conducted on the interaction channels of virtual avatars. There has been much attention on non-verbal expressions such as the gaze, the facial expression, and gestures of an avatar. For example, minute movements of the pupil add a sense of immersion and social presence. Studies found that participants perceived a higher level of social presence when communicating via richer media than through a text-based medium [29–31].

In a virtual environment, users may use their virtual avatar to represent themselves. Users tend to prefer an avatar that embodies their unique and exclusive characteristics that differentiate them from the others. Some people prefer an avatar that is similar to themselves, while others prefer their avatar to be an idealized version of themselves. Users who adopted such avatars reported higher satisfaction and attachment [32]. Users are more motivated to use avatars that have a facial appearance similar to theirs than those that do not [24].

However, the majority of avatar illustrations and expressions do not consider the individual's facial characteristics. Applying individualized facial habits or appearances does not require sophisticated technology and is viable with the current computer systems available to the mass. However, software that can animate such virtual avatars needs to be developed with investment and resources.

Another reason why individuated avatars are not prevalent involves the users. Many users do not recognize their own facial habits and would have trouble customizing the facial characteristics by themselves. It would be necessary for the application to capture and analyze the user's facial movements and suggest a personalized avatar for approval before use. The users may feel that this is a hassle, not to mention that there is resistance from users against taking a video of their own face. Most importantly, research lacks an understanding of common elements applicable to individuated virtual avatars. Specifically, we do not clearly understand the social effects of personalized virtual avatars with individualized features. Would people prefer avatars with their appearance or habitual expressions? Would people perceive a similarity between the avatar and themselves? Would people be able to relate to the avatar and use it for their profile in a social networking service?

#### *1.3. Research Goal*

Humans have universally recognizable expressions. Ekman found a universal relationship between facial muscle movements and specific emotions (e.g., happiness, sadness, anger, fear, surprise, disgust, interest) [33]. Despite the universality, individual differences exist in the *intensity* of each muscle movement. Researchers also found that the asymmetrical measures of facial regions identify stable individual differences [34].

A facial habit results from a habitual personal pattern that exhibits a unique individual signature. Facial recognition based on these individual differences in expression analyzes the movement pattern of facial muscles to discriminate individuals [8].

Another factor to consider is the individual's appearance. The perception of a form is necessary to identify an object [35]. The holistic form is a pivotal component required to distinguish an individual [36].

In summary, our research aims to evaluate the perceived social effect of a virtual avatar using two markers: (1) habitual facial expressions captured through the *intensity* of muscle movement and (2) facial appearance identified using facial *landmarks*. The research hypotheses are summarized accordingly in Table 1. We added the third hypothesis because both facial habit and facial appearance involve the facial muscle, and therefore, an interaction may occur. Thus, we intend to analyze whether facial habits (independent variable) have a different effect on the social constructs (dependent variables) depending on facial appearance (independent variable).

**Table 1.** Research hypotheses.


In short, the study aims to evaluate people's social perception of an avatar that embodies the unique and individual characteristics of the user. We planned to investigate the interaction of the two independent variables (facial appearance, facial habit) and their respective main effects.

#### **2. Methods**

#### *2.1. Participants*

Forty-five university students were recruited as participants. The participants' average age was 23.78 years (SD = 2.88), with 20 males and 25 females. We recommended that the participants get sufficient sleep the day before the experiment. We selected participants with a corrective vision of 0.7 or above to ensure the participants' reliable recognition of visual stimuli. All participants were briefed on the purpose and procedure of the experiment and signed a consent form. Participants were given participation fees as compensation.

#### *2.2. Materials*

#### 2.2.1. Video Stimulus

The current study used a video stimulus to elicit participants' facial responses to produce data to create an individuated avatar. We used video materials known to evoke emotions, which were empirically verified by an experiment conducted in and provided by Stanford University (*n* = 411, [37]).

For each emotional state (positive and negative), we selected two candidate stimuli from Stanford's materials [37]. We conducted a manipulation check on all candidate materials. With regard to the positive stimuli, participants perceived the two video stimuli as positive. The results did not show a significant difference from those of the Stanford study. However, there was no significant change in the facial expression of participants when the negative stimuli were exposed. In a follow-up questionnaire, participants reported having a negative emotional state but did not display a negative facial expression. Since the current experiment requires valid participant data on emotional expression to be applied to a virtual avatar, we decided not to include stimuli evoking a negative emotional state.

#### 2.2.2. Video Analysis

We used Open Face, which is open-source software that enables face recognition with deep neural networks [38]. We used AU (Action Units) as the basic unit for appraisal from the Facial Action Code System (FACS) [39]. Figure 1 depicts the process. We first normalized the facial region from the participants' videos. The video was organized as a sequence of images of fixed size (200 × 200 pixels). From this image sequence, we elicited the intensity of AU movement and the 68 facial landmarks (see Figure 2). The landmarks extract the coordinates (x,y) of key facial regions (e.g., the eye, eyebrows, nose, lips, and chin). The movement and intensity of AU were identified from the AU vector data in HOG (Histograms of Oriented Gradients) [40]. We elicited the individual's habitual expression data from the AU movement intensity. We elicited the individual's facial appearance from the landmark data.

**Figure 1.** The analytical process of identifying individual muscle movements and facial appearance.

**Figure 2.** The 68 facial landmarks used to identify the participant's facial appearance.

#### 2.2.3. Virtual Avatar

We designed two baseline avatars, male and female, to embody the participant's expressive habits and facial appearances (see Figure 3). For the female model, we modified a public model available from an open source [41]. To visualize the muscle movement, we produced AU-based blend shapes using the animation software Maya (Autodeck). We used blend shapes that morphed the lower face of the virtual avatar for a more natural look. Table 2 shows the relationship between blend shapes and facial muscles.

**Figure 3.** The baseline virtual avatar models in the study.



We used the Unity 3D engine to render and animate the virtual avatar [42]. Figure 4 depicts the two versions of the avatar with the participant's facial signature (facial appearances, habitual expression) applied. How participants viewed such variations and what was measured will be explained in Section 2.3 (Experiment Procedure).

**Figure 4.** An example of the baseline virtual human morphed based on the participant's (**a**) facial appearances and (**b**) habitual expression.

#### 2.2.4. Subjective Appraisal of Social Constructs

The current study investigated participants' perceptions (similarity, familiarity, attraction, liking, and involvement) of virtual avatars. All constructs involve the subjective appraisal by participants rather than an objective quantitative measurement. Table 3 depicts their operational definition. Each construct was measured on a 7-point Likert scale. For example, the seven items of similarity were *slightly*, *somewhat*, and *extremely* toward both ends (dissimilar and similar) with *neutral* in the middle.

**Table 3.** The operational definition of the social constructs of interest.


Similarity connotes the degree to which the user sees themselves as similar with the avatar. Some research includes attitudinal similarity (e.g., personality, attitude, belief system) in the definition [18,43]. However, in this study, we limited the definition to only include the physical likeliness to the participant and formulated the survey question accordingly. We purposely designed the study to eliminate interaction with the virtual avatar to investigate the effect of its mere presence without any convoluted variables that may arise from interactions. Since there is no interaction with the virtual avatar, it is extremely difficult to validly assess attitudinal similarity.

It is important to emphasize that we investigated *perceived* similarity as opposed to actual similarity. Researchers have made a clear distinction between the two constructs [44]. *Actual* similarity is measurable and quantifiable using standardized personality assessment. As the paper will discuss later, the relationship between similarity and attraction is critical. Some research studies suggest that only perceived similarity is a prerequisite to eliciting attraction [45–47]; other research emphasizes the importance of actual similarity [48]. In this study, mainly for consistency with other perceived constructs, we investigated the perceived similarity.

Perceived familiarity was measured to assess the degree to which participants were familiar with the virtual avatar that had the participant's facial characteristics applied. In interpersonal and social science literature, this construct connotes "being knowledgeable" or acquainted with a person [18,49] or a concept [50,51]. That is, a priori knowledge is necessary to measure perceived familiarity. For example, in psychology, after an interaction (e.g., phone call, discussion) with a person, the participant felt subjective familiarity with the person similar to what they would feel with a close friend [49]. Other studies measured familiarity using objective quantitative measures, such as the amount of exposure to a person's photo and not just focusing on perception [18].

Some studies use the terms perceived familiarity and resemblance (perceived similarity) interchangeably [49]; however, we measured the two constructs (perceived similarity and perceived familiarity) independently. The literature suggests that the two constructs correlate and have a causal relationship, with attraction as a mediating variable [18]. In our study, we minimized interaction with the virtual humans (e.g., conversation) to test the mere exposure effect.

Since the pioneering work of Byrne [52] (for a review of attraction as a research paradigm, see [53]), researchers have investigated interpersonal attraction in relationships [54]. Researchers widely accept Newcomb's definition of attraction as the most comprehensive one, and it is defined as follows: "Attraction refers to any direction orientation (on the part of one person toward another) which may be described in terms of sign and intensity" (Page 6) [55].

Studies on attraction generally investigate the relationship between the independent variables (e.g., attitudinal similarity, physical attractiveness) and the attraction response as a dependent variable. It is critical to note that attraction is distinguished from attractiveness, i.e., characteristics (e.g., attractive personality, good looks) that attract others [56]. In our study, we obtained the participant's perceived attraction (dependent variable) to the virtual avatar, which varied according to different facial features (independent variable). The intensity of attraction depends on many factors such as their relationship (e.g., parent–child, wife–husband) and the duration of interaction (e.g., long-term, first acquaintance) [57].

Perceived liking, as a construct, is defined as the degree to which the participant likes or dislikes the other person in a dyad. A causal pattern consists between the perception of being liked and liking the other [58]. Compared to attraction, perceived liking has a corresponding place on a like–dislike spectrum, whereas attraction is located on an attraction–repulsion spectrum [59].

In psychology, involvement connotes *approach* predispositions (e.g., empathy, sympathy, challenge) as opposed to distance, which refers to *avoidance* predispositions (e.g., antipathy, irritation, boredom) [24]. The two constructs are unipolar. Involvement refers to the degree to which the participants relate to and empathize with the virtual avatar. Since empathy is mainly dependent on the task and context [60,61]), we provided the context that the virtual agent would be used in a profile for a social networking service.

#### *2.3. Procedure*

Figure 5 outlines the experiment procedure. The experiment was conducted twice, with an interval of one week between the two sessions (i.e., Session #1 and Session #2).

In the first experiment, the participants were briefed about the purpose of the experiment and the procedures. Then, participants viewed the two affective stimuli from the display in a relaxed position (see Figure 6). Participants were guided not to force any expression but display the natural expression felt from the viewing. The web camera on display recorded a video of the participant's facial responses for 90 s. Then, the participants left the experiment after a brief explanation of the second experiment session.

**Figure 5.** The experiment consists of two sessions, with one week in between for each participant.

**Figure 6.** The experiment environment.

In between the two sessions, we produced the following four virtual avatars for the second experiment session based on the data acquired from the participants:


For an avatar without any habitual facial expression applied (2 and 4), AU movement based on the literature was applied instead [39]. For an avatar without any facial appearance applied (3 and 4), the original baseline appearance of the avatar was used (Figure 3).

Then, the participants viewed these virtual avatar stimuli. The study used a 2 × 2 within-subject design. There were two levels of habitual expression (applied or not) and facial appearance (applied or not), respectively.

Every participant viewed all four virtual avatar types. The order of the virtual avatar was randomized using a Latin square to counter the potential learning and fatigue effect. After viewing the avatar for 30 s, the participant responded to a subjective questionnaire.

Interaction with the virtual human was limited to mere exposure as opposed to an interactive one (e.g., conversation). The strength of the subjective response was contingent on the nature of the task [62] and may have elicited a confounding effect, which would be difficult to identify.

#### *2.4. Statistical Analysis*

To understand the effects of the two independent variables (habitual facial expression, facial appearance), we conducted a two-way ANOVA on the participant's subjective evaluation of the four avatars.

Data from participants who did not exhibit any facial expressions during the experiment were excluded during the acquisition process. The exclusion criteria are outlined as follows. First, we divided the non-expression interval and the expression interval. The latter was defined based on the average expression data. The intensities of AU 6 (Cheek raiser) and AU 12 (Lip corner puller) during the expression interval were compared to those of the non-expression interval. If the intensity during the expression interval was less than the non-expression interval or non-existent, we excluded the participant's data. The Latin square factors were tested to examine whether the order affected the dependent variable. The Latin square order did not affect data, so all results were collapsed over these variables.

#### **3. Results**

#### *3.1. Similarity*

The results of analysis of subjective perception involving similarity are as follows. Figure 7 depicts participants' responses to the different avatars that varied according to two factors (facial habit and facial appearance). The *Y*-axis indicates the average of subjective Likert ratings. The results showed no significant interaction between Facial Habit × Facial Appearance, F(1, 163) = 2.517, *p* > 0.11. Of particular importance, the results showed that Facial Habit had a significant main effect, F(1, 81) = 5.182, *p* < 0.05. On the other hand, Facial Appearance had no significant main effect, F(1, 81) = 0.576, *p* > 0.44.

#### *3.2. Familiarity*

The results of analysis of subjective perception involving familiarity are as follows. Figure 8 depicts participants' responses to the different avatars that varied according to two factors (facial habit and facial appearance). The *Y*-axis indicates the average of subjective Likert ratings. The results showed no significant interaction between Facial Habit × Facial Appearance, F(1, 163) = 0.004, *p* > 0.94. Of particular importance, the results showed that Facial Appearance had a significant main effect, F(1, 81) = 4.182, *p* < 0.05, whereas Facial Habit had no significant effect, F(1, 81) = 0.966, *p* > 0.32.

**Figure 8.** Subjective appraisal of perceived similarity.

#### *3.3. Attraction*

The results of the analysis of subjective perception involving attraction are as follows. Figure 9 depicts participants' responses to the different avatars that varied according to two factors (Facial Habit and Facial Appearance). The *Y*-axis indicates the average of subjective Likert ratings. The results showed no significant interaction between Facial Habit

× Facial Appearance, F(1, 163) = 2.3, *p* > 0.13. Both Facial Appearance, F(1, 81) = 0.047, *p* > 0.82, and Facial Habit, F(1, 81) = 0.631, *p* > 0.42, had no significant main effect.

**Figure 9.** Subjective appraisal of perceived attraction.

#### *3.4. Liking*

The results of analysis of subjective perception involving liking are as follows. Figure 10 depicts participants' responses to the different avatars that varied according to two factors (Facial Habit and Facial Appearance). The *Y*-axis indicates the average of subjective Likert ratings. There was no significant interaction between Facial Habit × Facial Appearance, F(1, 163) = 1.165, *p* > 0.28. Both Facial Appearance, F(1, 81) = 0.004, *p* > 0.94, and Facial Habit, F(1, 81) = 2.133, *p* > 0.14, had no significant main effect.

**Figure 10.** Subjective appraisal of perceived liking.

#### *3.5. Involvement*

The results of analysis of subjective perception related to involvement are as follows. Figure 11 depicts participants' responses to the different avatars that varied according to two factors (Facial Habit and Facial Appearance). The *Y*-axis indicates the average of subjective Likert ratings. The results showed no significant interaction between Facial Habit × Facial Appearance, F(1, 163) = 0.221, *p* > 0.63. Both Facial Appearance, F(1, 81) = 0.055, *p* > 0.81, and Facial Habit, F(1, 81) = 0.221, *p* > 0.63, had no significant main effect.

**Figure 11.** Subjective appraisal of perceived involvement.

#### *3.6. The Correlations between Social Perceptions*

We conducted a bivariate correlation analysis to understand the relationship among participant's social perceptions of the virtual avatars (see Table 4). The results show a significant correlation in all pairs of the analysis. The correlation between perceived attraction and liking was the highest (r = 0.695, *p* < 0.01) (see Figure 12). The implications of the correlation results will be discussed, integrating results from other analyses.



**Figure 12.** The correlational relationship between social constructs. \*\*\* *p* < 0.001

#### *3.7. Data Categorization*

Thus far, we identified that facial habit had a main effect on similarity, while facial appearance had a main effect on familiarity. However, these variables had no effects on attraction. As discussed in the operational definitions, attraction is based on a person's liking for the other, and perceived liking in the initial stage of interaction may lead to feelings of attraction [58]. Our results also show that among the constructs, perceived attraction and liking have the highest correlation (r = 0.695, *p* < 0.01).

However, attraction is a much larger and multifaceted construct [63]. Based on the pioneering work by Byrne [64], both perceived similarity and liking lead to attraction, and many researchers have attempted to understand the exact interplay and different weights of the two on attraction [44]. Therefore, we conducted a two-way ANOVA on the sum of perceived liking and similarity (i.e., data categorization) of the four avatar conditions (see Figure 13). The *Y*-axis indicates the addition of the Likert ratings of perceived liking and similarity.

**Figure 13.** Subjective appraisal of the sum of similarity and liking.

The results showed that Facial Habit had a significant main effect, F(1, 81) = 4.836, *p* < 0.05, whereas Facial Appearance had no significant main effect, F(1, 81) = 0.610, *p* > 0.69. Furthermore, there was no significant interaction between Facial Habit × Facial Appearance, F(1, 163) = 2.467, *p* > 0.12.

The research investigated the participant's social perception (similarity, familiarity, attraction, liking, and involvement) of virtual avatars engineered with the participant's unique facial signature (facial appearance, facial habit). In summary, the participants perceived significant similarity to an avatar with habitual expression applied compared to the avatar that did not (*p* < 0.05). In addition, habitual expressions also significantly affected the sum of perceived similarity and perceived liking (*p* < 0.05). The participants perceived familiarity with the avatar with facial appearance applied compared to the avatar that did not (*p* < 0.05).

#### **4. Discussion and Conclusions**

To our knowledge, this is the first research to reveal that participants can perceive similarity to a virtual human that had their characteristic facial movements (i.e., habitual pattern), which has significant implications for the design of virtual agents. The virtual human community had long researched the effects of virtual agent realism. The consensus is that behavioral realism is more critical than visual realism in eliciting believability [27]. The suspension of disbelief refers to the deliberate avoidance of critical thinking, whereas a reality check involves deciding what is possible or not in the real world [65,66]. Thus, behavioral realism is more socially engaging and believable than visual realism [27].

In the context of this study, the effect of perceived similarity of a virtual agent to oneself is consistent with research findings on believability. Specifically, while participants did not perceive similarity in virtual avatars to which their facial appearance were applied (i.e., visual realism), they perceived similarity in virtual avatars to which their facial habits were applied (i.e., behavioral realism). This implies that designs may go beyond anthropomorphic design. For example, future research may conduct studies using animalinspired avatars with facial features (e.g., eyes) and see if participants can perceive similarity to these avatars when their facial movements are applied.

There is much empirical evidence that similarity, as a social construct, elicits attraction [44], and this relationship is regarded as "one of the most robust relationships in all of the behavioral sciences (p. 281)" [67]. Researchers found a positive linear relationship between similarity and attraction (i.e., the law of attraction) [68]. However, the various virtual avatars had no significant effect on attraction. This may be due to interaction being limited to one-time mere exposure. We purposely limited interaction to exclude variables (e.g., perception of personality) that may influence the perceived measures, which may have been brought on by prolonged interaction. Perceived similarity is influenced not only by physical appearance [69] but also attitude [70] and personality [71]. Future studies may add a persona to the virtual avatar to test the complexities of perceived similarity.

The study's limitation in understanding the effects of an individuated avatar on attraction is apparent. Since perceived attraction is a multifaceted construct, it typically requires more interaction, building up from initial liking [58]. Future studies may investigate the degree of attraction as a function of time or when participants interact with the individuated virtual avatar. The perceived relationship also influences attraction; thus, future studies need to address the relationship between the avatar (e.g., companion, butler, assistant) and the participant carefully.

Nevertheless, through data categorization, we found that habitual expressions had a main effect on the sum of perceived similarity and perceived liking (*p* < 0.05). Since the interplay between perceived similarity and liking leads to attraction [64], these results suggest that an individuated avatar may elicit attraction with prolonged interaction.

Additionally, the individualized virtual avatars had no significant effect on perceived involvement. Although we provided the context that the virtual agent would be used as part of a profile for a social networking service, we also acknowledge that many users do not use profiles similar to their appearance. Future studies should cluster the participants based on who use or intend to use avatars with a similar appearance as an alter ego and assess their perceived responses accordingly.

The perceived familiarity with a virtual avatar to which the participant's facial appearance was applied may be due to the participant's repetitive exposure to their reflections in mirrors or still photos of themselves. Repetitive exposure elicits familiarity [13]. On the other hand, people may not be familiar with their habitual expressions during various emotional states.

Finally, the study is limited in that the virtual avatars were designed based on only positive emotional expressions. Future research on individualized virtual avatars should also include negative or complex emotions.

**Author Contributions:** S.P.: methodology, validation, formal analysis, investigation, writing, review, editing; S.P.K.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing, visualization, project administration; M.W.: conceptualization, methodology, writing, review, supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (NRF-2020R1A2B5B02002770).

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Sangmyung University (protocol code BE2017-20, approved at 22 September 2017).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the subjects to publish this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-4487-8