**1. Introduction**

Emotion is an integral part of human behaviour, exerting a powerful influence in mechanisms such as perception, attention, decision making and learning. Indeed, what humans tend to notice and memorise are usually not monotonous, commonplace events but the ones that evoke feelings of joy, sorrow, pleasure, or pain [1]. Therefore, understanding emotional states is crucial to understand human behaviour, cognition and decision making. The computer science field dedicated to the study of emotions is denoted as Affective Computing, whose modern potential applications include, among many others: (1) automated driver assistance—e.g., through an alert system monitoring and warning the user for sleepiness, unconscious or unhealthy states potentially hindering driving; (2) healthcare—e.g., through wellness monitoring applications identifying causes of stress, anxiety, depression or chronic diseases; (3) adaptive learning—e.g., through a teaching application able to adjust the content delivery rate and number of iterations according to the user enthusiasm and frustration level; (4) recommendation systems—e.g., assisting and asserting personalised content according to the user preferences as perceived by their response.

Emotions are communicated via external (facial or body expressions such as a smile, tense shoulders, and others) and internal body expressions (alterations in heart rate (HR), respiration rate, perspiration, and others). Such manifestations generally occur naturally and subconsciously, and their sentic modulation can be used to infer the subjects' current emotional state. Acquired in a systematic daily setting, it could be possible to infer the probability of a subjects' mood for the following day and their health condition.

External physical manifestations (e.g., facial expressions) are easily collected through a camera; however, they present low reliability since they depend highly on the user environment (if he is alone or in a group setting), or cultural background (if the subject grew up in a society promoting the externalisation or internalisation of emotion), and can be easily faked or manipulated according to the subject goals, compromising the assessment of the true emotional state [2]. On the other hand, for internal physical manifestations, these constraints are less prominent, since the subject has little control over his bodily states. Alterations in the physiological signals are not easily controlled by the subject, thus, these entitle a more authentic insight into the subject emotional experience.

Given these considerations, our work aims to perform a comprehensive study on automatic emotion recognition using physiological data, namely from Electrocardiography (ECG), Electrodermal Activity (EDA), Respiration (RESP), Blood Volume Pulse (BVP) sensors. This choice of modalities is due to three factors: (1) Data can be easily extracted from pervasive, discrete wearable technology, rather than more intrusive sensors (e.g., Electroencephalography (EEG), or Functional near-infrared spectroscopy (fNIRS)); (2) Widely reported in the recent state-of-the-art; (3) Publicly available multimodal datasets validated in literature. We use five public state-of-the-art datasets to evaluate two major techniques: Feature Fusion (FF) versus Decision Fusion (DF) on a feature-based representation, exploring also an extensive set of features comparatively to previous work. Furthermore, instead of the discrete model, the users' emotional response is assessed on the two-dimensional space: Valence (measuring how unpleasant or pleasant is the emotion), and Arousal (measuring the emotion intensity level).

The remaining of this paper is organised as follows: Section 2 presents a brief literature review on ER, with special emphasis on articles that describe the datasets used in our work. Section 3 describes the overall machine learning pipeline of the proposed methods. Section 4 evaluates our methodology in five public datasets. Lastly, in Section 5, the main conclusions of this work are presented along with future work directions.

#### **2. State of the Art**

In literature, human emotion processing is generally described using two models: One decomposing emotion in discrete categories divided into basic/primary (arriving from innate, fast and in response to "flight-or-fight" behaviour) and complex/secondary emotions (deriving from cognitive processes) [3,4]. On the other hand, the second model quantifies emotions into continuous dimensions. A popular model, proposed by Lang [5], suggested a Valence (unpleasant–pleasant level) versus Arousal (activation level) two-dimensional model [6], which we adopt in this work. Concerning affect elicitation, it is generally performed through films snippets [6], virtual reality [7], music [8], recall [9], or stressful environments [6], with no commonly established norm on which is the optimal methodology for ER elicitation.

Regarding the automated recognition of emotional states, it is usually performed based on two methodologies [2,10,11]: (1) Traditional Machine Learning (ML) techniques [12–14]; (2) Deep learning approaches [15–17]. Due to the limited size of existing datasets, most of the work focuses on traditional ML algorithms, in particular Supervised Learning (SL), such as Support Vector Machines (SVM) [18–20], k-Nearest Neighbour (kNN) [21–23], Decision Trees (DT) [24,25], and others [26,27], with the SVM method being the most commonly applied algorithm, showing overall good results and low computational complexity.

Many physiological modalities and features have been evaluated for ER, namely Electroencephalography (EEG) [28–30], Electrocardiography (ECG) [31–33], Electrodermal Activity (EDA) [34–36], Respiration (RESP) [26], Blood Volume Pulse (BVP) [26,35] and Temperature (TEMP) [26]. Multi-modal approaches have prevailed; however, there is still no clear evidence of which feature combinations and physiological signals are the most relevant. The literature has shown that the classification performance improves with the simultaneous exploitation of different signal modalities [2,8,10,37], and that modality fusion can be performed at two main levels: FF [24,38,39] and DF [8,26,37,40,41]. In the former, features are extracted from each modality and latter concatenated to form a single feature vector space, to be used as input for the ML model. On the other hand, in DF, from each modality, a feature vector is extracted to form a classifier prediction through a voting system. Hence, with *k* modalities, *k* classifiers will be created leading to *k* predictions that can be combined to yield a final result. Both methodologies are found in the state-of-the art [42], but it is unclear which is the best to use in the area of ER using multimodal physiological data obtained from non-intrusive wearable technology.

Detailed information on the current state-of-the-art in a more generalized perspective, we refer the reader to the surveys [2,11,43–47] and references therein, where a comprehensive review of the latest work on ER using ML and physiological signals can be found, highlighting the main achievements, challenges, take-home messages, and possible future opportunities.

The present work extends the state-of-the-art of ER through: (1) Classification performance analysis, in the arousal/valence space, of ER for five publicly available datasets that cover multiple elicitation methods; (2) Summarising the ranges of the classification accuracy reported across the existing literature for the evaluated datasets; (3) Characterising the results for diverse classifiers, sensor modalities and feature set combinations for ER using accuracy and F1-score as evaluation metrics (the later not being commonly reported albeit important to evaluate classification bias); (4) Exploration of an extended feature set for each modality, analyzing also their relevance through feature selection; (5) Systematic analysis of multimodal classification in DF and FF approaches, with superior or comparable results to those reported in the state-of-the-art for the selected datasets.
