Next Article in Journal
Utilizing VR Visual Novels Incorporating Social Stories for Learning in Children with Autism Spectrum Disorder: A Systematic Literature Review
Previous Article in Journal
The Impact of Dating Apps on the Mental Health of the LGBTIQA+ Population
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Activity in Brain Areas Associated with Emotion Processing Using Multimodal Behavioral Signals

by
Lahoucine Kdouri
1,*,
Youssef Hmamouche
1,
Amal El Fallah Seghrouchni
1,2 and
Thierry Chaminade
3
1
International Artificial Intelligence Center of Morocco, University Mohammed VI Polytechnique, Rabat 11103, Morocco
2
Laboratoire de Recherche en Informatique, UMR 7606, CNRS-Sorbonne University, 75252 Paris, France
3
Institut de Neurosciences de la Timone, UMR 7289, CNRS-Aix-Marseille University, 13385 Marseille, France
*
Author to whom correspondence should be addressed.
Multimodal Technol. Interact. 2025, 9(4), 31; https://doi.org/10.3390/mti9040031
Submission received: 6 February 2025 / Revised: 10 March 2025 / Accepted: 18 March 2025 / Published: 31 March 2025

Abstract

:
Artificial agents are expected to increasingly interact with humans and to demonstrate multimodal adaptive emotional responses. Such social integration requires both perception and production mechanisms, thus enabling a more realistic approach to emotional alignment than existing systems. Indeed, existing emotion recognition methods rely on behavioral signals, predominantly facial expressions, as well as non-invasive brain recordings, such as Electroencephalograms (EEGs) and functional Magnetic Resonance Imaging (fMRI), to identify humans’ emotions, but accurate labeling remains a challenge. This paper introduces a novel approach examining how behavioral and physiological signals can be used to predict activity in emotion-related regions of the brain. To this end, we propose a multimodal deep learning network that processes two categories of signals recorded alongside brain activity during conversations: two behavioral signals (video and audio) and one physiological signal (blood pulse). Our network enables (1) the prediction of brain activity from these multimodal inputs, and (2) the assessment of our model’s performance depending on the nature of interlocutor (human or robot) and the brain region of interest. Results demonstrate that the proposed architecture outperforms existing models in anterior insula and hypothalamus regions, for interactions with a human or a robot. An ablation study evaluating subsets of input modalities indicates that local brain activity prediction was reduced when one or two modalities are omitted. However, they also revealed that the physiological data (blood pulse) achieve similar levels of predictions alone compared to the full model, further underscoring the importance of somatic markers in the central nervous system’s processing of social emotions.

1. Introduction

The process of emotion recognition requires the development of models capable of predicting emotional states from various types of signals. Some are directly observable, such as audios and/or videos, while others, such as physiological time-series, require more specific recording devices. Deep learning models have been shown to be effective in many cases and have the potential to improve our understanding of emotions and how they are represented in different types of conversational signals [1]. The ability to detect and recognize human emotions is a crucial aspect of many use cases, especially in the context of human–machine interactions. It represents an important research area of affective computing, which focuses on designing and developing computational systems that can recognize and imitate human affective behavior [2].
Emotion recognition has many applications. For example, in education, the use of emotion-aware robots has been shown to be effective in enhancing student engagement and learning outcomes [3]. Similarly, in the field of mental health and cognitive disorders, social robots have been explored as a potential tool for delivering therapy and providing emotional support to individuals suffering from mental health conditions [4]. Also, emotion recognition can improve the effectiveness of human–robot collaboration [5]. Overall, recognizing human emotions is crucial for developing more effective and efficient human–machine interaction systems. Further research in this field is essential to fully understand the potential applications and benefits of emotion recognition technology.
The process of emotion recognition can be improved by incorporating multiple modalities. Combining different signals, such as images, audio, and text, can provide complementary information that can be used to make more informed decisions. Multimodal learning involves relating features from these different modalities to create a shared representation that an artificial neural network can exploit to make intelligent decisions. In general, multimodal learning approaches are capable of more robust inference compared to unimodal approaches [6].
Existing emotion recognition methods predominantly leverage multimodal behavioral signals such as facial expressions, speech patterns, or rely on non-invasive brain recordings like EEG and fMRI to predict emotions [7,8,9,10]. Although these approaches have shown promising results, accurate emotion labeling remains a persistent challenge due to subjective variability and the complexity of emotional experiences. It should be noted though that there is no “neural signature” for single emotions, but accumulated research has led to identifying a subset of brain regions regularly associated with several emotional processes [11,12]. A key limitation in existing research is the lack of a comprehensive framework that examines the interaction between behavioral and physiological signals to influence brain activity within emotion-related regions. To address this gap, we present a novel multimodal deep learning framework that explores how external signals affect brain activity during emotional processing. Specifically, our model processes two categories of signals recorded alongside brain activity during conversations with either a human or a robot: (1) behavioral signals, including video and audio features, and (2) physiological signals, such as blood pulse measurements. Using this multimodal input, our approach aims to achieve two key objectives: predicting brain activity patterns in response to different emotional stimuli and evaluating the impact of the nature of the interlocutor (human or robot). Most emotion detection systems currently depend on audio-visual data and are trained using manually labeled facial expressions as ground truth [7]. However, facial expressions may not always accurately represent real emotions. In this article, we argue that the most reliable approach to approximate real emotions is through brain activity, since objectively measuring emotions presents significant challenges. Recent advancements in brain activity recording technologies can aid in this regard, by utilizing systems based on EEG, fMRI, or Magnetoencephalography (MEG).
Overall, this article presents an investigation of the possibility of predicting activity in emotion-related brain areas as a proxy for emotion recognition. This approach accounts for the complex and dynamic nature of human interaction by combining conversational and physiological signals, as well as considering the temporal aspect of the data. The proposed deep learning architecture is simple yet effective compared to existing multimodal architectures. The main assumption of our work is that brain activation is a more relevant and accurate representation of the emotional states of individuals than classical approaches solely based on visual expressions. Figure 1 illustrates the interaction protocol, the recorded signals, and the approach described in this manuscript. The addition of physiological data, namely, the blood pulse signals recorded with a photoplethysmograph, is particularly important because the autonomous nervous system is considered to be associated with emotional arousal [13]. Therefore, we expect that the use of blood pulse will significantly improve the capacity to predict the response in brain areas associated with emotional processes.
The rest of the paper is organized as follows: Section 2 presents related work. Section 3 describes the proposed methodology. In Section 4, we focus on the experiment and results. Section 5 discusses the results obtained and the implications for affective neuroscience. And Section 6 contains concluding remarks, limitations, and future perspectives relevant to the current work.

2. Related Work

Human emotional expression is inherently multimodal, relying on a complex interplay of physiological changes in addition to verbal and non-verbal cues, such as in changes blood pulse, voice prosody, and facial expressions (e.g., raised cheeks). Each modality contributes uniquely to emotion, making emotion recognition a challenging task for machine learning [14]. In this section, we review related work on multimodal learning for emotion recognition, focusing on deep learning approaches and fusion techniques.
Multimodal learning has shown its effectiveness in tackling traditional machine learning challenges. In particular, it has been widely used to enhance prediction accuracy in classification and regression tasks [15,16]. In recent years, deep learning has been successfully applied to process multimodal data, as well as learn correlations and joint representations across different modalities [17,18]. However, achieving optimal predictive performance requires large-scale datasets [19,20]. This need is particularly important in multimodal prediction tasks, such as emotion recognition, where the effective integration of different modalities is essential. In [21], for example, a large-scale database named C A S ( M E ) 3 , comprising visual, voice, and physiological signals, was introduced and processed through deep learning models for multimodal emotion recognition. Integrating data from multiple modalities, as demonstrated in [8,9,10,22], can provide complementary information and significantly enhance emotion recognition performances. However, this integration process poses challenges due to the unique statistical properties and the complex interplay among low-level features associated with each modality [23].
Previous studies have investigated various strategies for merging multiple modalities, which can be categorized into three main approaches: early fusion, which concatenates features from different modalities as input to a machine learning model [24]; model-level fusion, which combines the hidden representations of each modality during training [23]; and late fusion, a method that applies models to each modality and then aggregates prediction scores according to specific criteria [25]. Temporal learning is also valuable for emotion recognition, which involves analyzing time-varying data such as speech and video [26,27]. Multimodal deep learning models can capture this temporal information by employing Recurrent Neural Networks (RNNs) or Transformers, thereby accounting for the dynamic nature of emotions over time.
Several approaches have been proposed in this context, particularly leveraging the model-level strategy. In [15], cross-modal shared representation learning and multimodal fusion achieved effective emotion classification. Additionally, Srivastava and Salakhutdinov [16] introduced Multimodal Deep Boltzmann Machines to learn a joint density model over a space of multimodal inputs, demonstrating the utility of fusing higher-level representations of disparate modalities (images and text) within a deep learning framework. In a detailed model, Krishna and Patil [28] used a Convolutional Neural Network (CNN) with a cross-modal focus for emotion classification. Notably, the model proposed in [27] predicts multiple emotion labels (Anger, Disgust, Fear, Happiness, Neutral, Sad, and Surprise). This model was trained on short emotional video clips sourced from movies and TV shows, combining acoustic and visual information by applying Mel-Frequency Cepstral Coefficients (MFCCs) and DenseFace [29]. It then utilizes a Long Short-Term Memory (LSTM) network with an early fusion strategy, followed by feed-forward layers to make the predictions. Another approach, proposed in [30], uses a multimodal strategy to integrate audio, visual, and text recorded from video elements: speech, faces, and subtitles, respectively. The performance of various feature combinations is assessed using multiple CNN architectures, including Inception and ResNet, for visual feature extraction. Meanwhile, Bidirectional Encoder Representations from Transformers (BERTs) and spectral analysis are applied to textual and audio data, respectively. A fusion process employs Bidirectional LSTM networks to capture temporal dependencies and combine extracted features. Another similar architecture was proposed in [31] to classify four emotions (sad, excited, neutral, and angry). During the feature extraction step, a combination of models, including CNNs, LSTMs, and attention mechanisms, was utilized. Finally, fully connected layers were applied to learn from the fusion of acoustic and textual features and to perform emotion classification.
Other approaches have employed attention-based neural networks for model-level fusion [32,33]. For instance, in [32], the authors integrated multimodal features by assigning attention mechanisms to each modality, yielding promising results for emotion recognition. A noteworthy study by [33] introduced Transformers for multimodal learning, focusing on unaligned multimodal language sequences, and presented an effective method for processing and combining information from diverse modalities. The use of Transformers for multimodal self-supervised learning from video, audio, and text (VATT) was recently investigated in [34], where the VATT model unifies these three modalities within a single network. Each modality is processed separately through the embedding stage of the Transformer, and self-supervised learning enables training on an extensive dataset, resulting in representations applicable to a wide range of tasks, including emotion recognition.
Despite significant advances in emotion recognition using multimodal data and deep learning models, our work draws inspiration from fundamental studies [15,16] that demonstrate effective strategies for multimodal integration and fusion. We leverage the complementary strengths of multiple modalities, specifically conversational signals and physiological data, to enhance predictive accuracy. Furthermore, our proposed model incorporates temporal modelling to integrate information from conversational cues, physiological signals, and neural data, an aspect that has been relatively under-emphasized in existing models [30,31]. In contrast, these existing models typically rely on classifying emotions from facial expressions. Our proposed approach aims to predict a continuous signal corresponding to the activation of brain regions associated with genuine emotional states directly from neural activity. To achieve this, we employ a Transformer encoder in which attention mechanisms effectively identify and weigh the most salient features. The experimental results indicate that our proposed model outperforms existing models. Furthermore, ablation studies evaluating the contributions of each modality underscore the critical role of physiological data in elucidating brain activity.

3. Methods

This research aims to develop a method for predicting brain activity in regions involved in emotion processing, using multimodal conversational behavioral signals completed by peripheral physiological recordings. This approach enables the inference of a participant’s emotional state during a conversation, whether the interlocutor is human or robotic. Table 1 provides details of the specific brain regions examined in this study. We focus on the stimuli perceived by the participant, including visual and auditory inputs, as well as the participant’s blood pulse, resulting in the following modalities: video (of the interlocutor), audio (of both the participant and the interlocutor), and blood pulse (of the participant). In this section, we first outline the importance of predicting brain activity for emotion recognition before detailing the proposed methodology.

3.1. The Importance of Predicting Brain Activity for Emotion Recognition

Analyzing brain activity helps improve our understanding of human cognitive processes through the application of domain knowledge. Neuroscience has conducted numerous studies aimed at identifying brain regions associated with emotions [35]. Therefore, the chosen brain areas are directly linked to real emotions based on well-established neuroscience experiments. Using deep learning to connect with these studies is a logical step in approaching human behavior through artificial intelligence [36]. Many studies rely on facial expressions of emotion to evaluate emotional states; however, it is important to recognize that these expressions are primarily a proxy. Like other mental states, the emotional states that are actually experienced remain hidden, as they cannot be directly measured using objective tools. Instead, we focus on analyzing brain activity, which we postulate offers a more accurate representation of emotions. Finally, through prediction of brain activity, the predictions obtained can be used as a latent space to infer other forms of emotion. For example, a decoder network could be employed to reconstruct the next likely emotional states from brain activity. As suggested by recent studies [37,38,39,40,41], this approach is related to recent methods that aim to reconstruct original stimuli from brain activity.
Table 1. Studied brain areas and their corresponding Brainnetome atlas codes, to which we added the hypothalamus (see text).
Table 1. Studied brain areas and their corresponding Brainnetome atlas codes, to which we added the hypothalamus (see text).
AbbrevROIsBrainnetome Code
l,r vAleft and right Ventral Agranular insula165/166
l,r dAleft and right Dorsal Agranular insula167/168
l,r vDleft and right Ventral Dysgranular insula169/170
l,r dDleft and right Dorsal Dysgranular insula173/174
l,r dGleft and right Dorsal Granular insula171/172
l,r Hleft and right Hypergranular insula163/164
l,r MAleft and right Medial Amygdala211/212
l,r LAleft and right Lateral Amygdala213/214
HyHypothalamus[42]

3.2. Proposed Approach

In this section, we present an overview of our proposed multimodal architecture, which aims to predict activity in brain regions associated with emotional states using three input modalities: audio, video, and blood pulse. First, we embed and encode each modality independently using pre-trained models and Transformer encoders. Secondly, we unify the multimodal features using late fusion. Finally, the encoder outputs are fed into a dense network that performs the classification task. The architecture of our model is shown in Figure 2.

3.2.1. Embedding

In the following, we provide a detailed description of the embedding layers used in the network for audio and video inputs. It should be noted that the raw blood pulse signal is obtained as a time series and does not require a specific embedding.
  • Audio embedding: Spectral features are extracted from the raw audio signal using Mel-Frequency Cepstral Coefficients. These techniques aim to capture the characteristics of the vocal tract that are relevant for distinguishing different sounds, speech, or speakers.
  • Video embedding: We extract visual features using FaceNet, a deep learning-based facial recognition system developed by Schroff et al. (2015) [29]. The architecture consists of a CNN model followed by a few fully connected layers. The CNN is designed to learn a representation of facial images by extracting key features, such as facial landmarks, textures, and distinctive patterns, which are essential to differentiate between faces. The fully connected layers project these features into an embedding that is invariant to variations in lighting, pose, and facial expressions.

3.2.2. Network Formulation

The model operates temporally, depending on the frequency of fMRI recordings ( 1.2 s). Since our data are organized into 1 min conversations, the model provides several predictions for each conversation by considering a sequence of past frames from the multimodal signals over a period τ (6 s, i.e., 5 time steps). This period of 6 s corresponds approximately to the maximum length of the activation window of the hemodynamic response function, which characterizes the Blood Oxygen Level-Dependent (BOLD) signal used to measure fMRI brain activity, and peaks between 3 and 5 s after a triggering stimulus [43]. Formally, the function of our network can be expressed by Equation (1).
Y t = F ( X v T , X a T , X b T ) + ε t .
where F is the function of the model, and Y t represents the target variable (brain area activation). Each variable { X i T } i { v , a , b } is represented as a 2 D temporal tensor in R T × d i where T is the sequence length, and d i is the embedding dimension for each input modality. The ε t represents the error vector of the model.
The input variable X i T is defined as the sequence { X i } j t τ : t , where j t h element corresponds to the input sequence of values between t and t τ . Prior to the embedding layer, all input signals have the same duration but different frequencies. Next, an embedding layer ( L e m b , i ) is applied to each input X i T to unify their representation.
[ I v , I a , I b ] = L e m b [ X v T , X a T , X b T ] = [ L e m b , v ( X v T ) , L e m b , a ( X a T ) , L e m b , b ( X b T ) ] .
After the embedding, a sampling block is applied to resample the obtained features along the time axis, ensuring that all features have the same number of time steps T. Separate Transformer encoders are then applied to the sequences of each embedded modality, such that
[ H v l a y e r , H a l a y e r , H b l a y e r ] = [ E n c o d e r ( I v ) , E n c o d e r ( I a ) , E n c o d e r ( I b ) ] .
The outputs from the three modalities are then concatenated using feature-level fusion. This process is expressed in Equation (2), where H represents the output of the concatenated layers.
H = C o n c a t [ H v l a y e r , H a l a y e r , H b l a y e r ] .
Finally, a fully connected layer is applied to the concatenated features Y t ^ = F C ( H ) , where the output layer contains a linear activation function and we adopt the Mean Squared Error (MSE) as a loss function.

4. Experiments and Results

In this section, we describe the datasets used and provide details of the experimental setup. Following this, we present the results for the HHI and HRI experiments separately, facilitating a comprehensive comparative analysis.

4.1. Dataset Description

The corpus utilized in this study was collected during an fMRI experiment, as described in [12], where the authors detail the acquisition process of the conversational signals. In addition to the conversational signals, this study also analyzed blood pulse, a physiological signal that was recorded during the experiment using the scanner’s photoplethysmograph. It was captured using a photoplethysmography device placed on the tip of the participant’s left index finger. The device was wirelessly connected via Bluetooth, and data were continuously acquired at a frequency of 200 Hz. The synchronization of the recording was inherent to the scanner’s procedures, aligning with the time intervals of the fMRI recordings. The corpus consists of 25 participants and four sessions. In each session, participants engage in six conversations of 60 s duration: three with a human and three with a conversational robot, alternately. The brain areas studied are listed in Table 1.
The selected areas are directly associated with positive and negative emotions. These include the insular regions in both the left and right hemispheres, as well as regions within the amygdala and hypothalamus. These regions, with the notable exception of the hypothalamus mask taken from [42], are identified using the Brainnetome atlas [44]. The amygdala consists of medial and lateral parts, while the insular cortex is subdivided into ventral and dorsal agranular, dysgranular, granular, and hypergranular areas. The hypothalamus is an important region for homeostasis, yet it is not included in the Brainnetome atlas.

4.2. Experimental Setup

The experimental design employed to assess the proposed architecture based on a deep learning model is outlined below.
Architecture details: The first layer of the proposed network is designed for multimodal embedding. A total of 645 features are extracted from the input video using FaceNet, and 30 Mel filter bank features are extracted from the input audio, with a window length of 2048 and a skip length of 512. The features are then sequenced (with timestep = 5 as shown in Section 3.2.2) before being encoded by the second network layer. The Transformer encoders consist of three modality-specific attention blocks, each incorporating a multi-head attention layer with a key dimension of 64. These blocks differ in the number of attention heads and dropout rates: the audio block employs 16 heads with a dropout rate of 0.1 , the video block employs 32 heads with a dropout rate of 0.15 , and the blood pulse block employs 8 heads with a dropout rate of 0.1 . Layer normalization with ε = 10 6 is applied following each attention block. Once the individual modalities are encoded, a late-fusion step unifies the three output features into a single feature representation. This fused representation is then fed into three fully connected layers of sizes 512, 256, and 64, each using ReLU activation and dropout rates of 0.6 and 0.4 . Finally, a linear output layer generates 17 values to predict the continuous BOLD signal. Figure 2 illustrates the overall architecture.
Hyperparameters: For all evaluated models, the number of training iterations is set to 50, with a batch size of 64. The Adam optimization algorithm is used with β 1 = 0.9 , β 2 = 0.98 , and an initial learning rate of α = 10 4 . In our model, the value of the parameter γ in the normalization layer is fixed at 10 6 .
Implementation: The datasets are split into training, validation, and test sets, maintaining class proportions of 70 % , 10 % , and 20 % respectively. The split was divided by participants so that the data of each participant belonged to one and only one set (training, validation, or test set) to avoid overfitting. All models were implemented using Google Colab with GPU support, and the preprocessing data with code repository code will be made available upon request.

4.3. Evaluated Models

To evaluate our approach, we conduct a comparative analysis with existing baselines and multimodal architectures from the literature. Our study is the first to utilize a specific combination of different input signals, including behavioral and peripheral physiological data, in order to predict a continuous BOLD signal representing brain activity recorded with fMRI. The following models are selected for comparison with our work due to their relevance to the multimodal prediction task, which involves various signals as well as multimodal feature extraction and fusion techniques. Other similar versions of these models have been applied to other tasks, such as the Emotion Recognition in the Wild Challenge (EmotiW) [45,46] and Educational Video Summarization (EDUVSUM) [47]. To compare our proposal with the selected models, we adapted and reconfigured them to suit our specific task. They are described as follows:
  • MULTILAYER PERCEPTRON (MLP): Baseline model that consists of two hidden layers with dropout, and ReLU activation functions applied after each layer. The third layer with the linear activation function and 17 output neurons represents the targets of the model. The dimensionality of the hidden layers is considered as a hyperparameter to be tuned.
  • MODEL 1 [27]: This model applies MFCCs and DenseFace to extract acoustic and visual features, respectively. It employs LSTM networks with early fusion to represent the hidden multimodal features and feed-forward layers to predict emotions from seven labels (anger, disgust, fear, happiness, neutral, sad, and surprise). Originally, it was trained on a dataset of short emotional video clips taken from movies and TV shows.
  • MODEL 2 [31]: This approach integrates audio, video, and text recorded from video frames with speech and subtitles. It incorporates a CNN network to extract visual features, evaluated using three classical classification backbones (VGGNet, ResNet, and GoogleNet). It also incorporates Bidirectional Encoder Representations from Transformers (BERTs) and spectral analysis to extract features from text and audio, respectively. Finally, a fusion step uses Bidirectional LSTM networks to capture the temporal dependencies and combine the extracted hidden features.
  • MODEL 3 [30]: The proposed model combines CNN and LSTM, along with self-attention, to extract acoustic features from speech signals. Simultaneously, a BiLSTM network is employed to extract textual features from transcripts. The extracted features are then fused and fed into a deep fully connected network to predict the probabilities of the four target emotions (sad, excited, neutral, and angry).

4.4. Statistical Test

In this section, we evaluate the stability of the performance improvements using statistical significance testing, comparing key settings at a confidence level of α = 0.05 . The analysis was conducted using the Almost Stochastic Order (ASO) test [48,49], as implemented in the deep significance Python library [50].
This test is commonly used to compare the performance scores of two evaluate machine learning models, without making any assumptions about the distributions of the scores. Given the performance scores of two models, M 1 and M 2 over multiple runs, the ASO test computes a test statistic ε min which quantifies how far model M 1 is from being significantly better than model M 2 , for a predefined significance level α ( 0 , 1 ) . When the distance ε min = 0.0 , it can be concluded that model M 1 is stochastically dominant over model M 2 . If ε min < 0.5 , model M 1 is said to almost stochastically dominate model M 2 . Conversely, when ε min = 1.0 , model M 2 stochastically dominates model M 1 . Finally, for ε min = 0.5 , no stochastic order can be established between M 1 and M 2 .

4.5. Results

The evaluated models were trained using the dataset described in Section 4.1, with the Average MSE as the prediction metric. The performance of the current model is compared to the state of the art, as shown in Figure 3. Additionally, the predictions of blood responses in each region are presented in Table 2 and Table 3 for human–human and human–robot interactions, respectively. Furthermore, Table 4 displays the performance of our model (AVB) compared to other models based on one and two modalities, following an evaluation protocol similar to that used in [51].

5. Discussion

We investigate three principal aspects of our findings: how the proposed model outperforms previously evaluated models, an ablation study that details the contribution of each modality, and the broader implications for affective neuroscience. In general, the model demonstrates robust predictive performance in regions of the brain related to emotion, highlighting the benefits of integrating multiple signals. However, certain findings emphasize the complexity of predictive modeling in social contexts, indicating that different analytical methods may offer complementary perspectives. We address these elements in the following subsections.

5.1. Models Comparison

According to Figure 3, which illustrates the average MSE across regions of interest for both human and robot conditions, and as confirmed by the ASO test comparisons in Figure 4a,b, the proposed model AVB (column) is almost stochastically dominant across all evaluated models, with ε < 0.5 . This highlights the enhanced predictive performance compared to the other evaluated models, particularly outperforming the MLP, which utilizes a basic fusion approach with fully connected layers. Table 2 and Table 3 demonstrate that, in more than half of the regions investigated, the proposed model achieves the lowest MSE during interactions with both humans and robots, with a higher frequency observed for human interactions compared to those with robots.
The proposed model provided better predictions for the majority of brain regions selected for their involvement in emotion processing, with particularly strong results for the hypothalamus in both conditions (human–human and human–robot). This is particularly significant given the hypothalamus’s role not only in emotions but also in maintaining homeostasis, which refers to the responses of the autonomic and peripheral nervous systems, often considered markers of emotional arousal. Additionally, the proposed model outperformed other models in both conditions within the more anterior insular regions, which are more strongly associated with homeostatic functions compared to the posterior insular regions, which are generally linked to the cognitive or motor dimensions of behavior.
In contrast, the improvement of the proposed model over others that were tested is not clearly evident in the amygdala. Although speculative, the difference between the model’s clear improvements in the hypothalamus and anterior insula and the mixed results in other regions may be related to their distinct roles in emotional processing. The paradigm indicates that we are investigating emotion processing in the context of social interactions, rather than emotions per se.

5.2. Ablation Study

The ablation study results, presented in Table 4, demonstrate that the proposed approach based on three modalities outperformed models employing a combination of two modalities or only one modality. Remarkably, the model relying solely on blood pulse demonstrated performance closely comparable to the model incorporating all three modalities: 0.0321 vs. 0.0317 for HHI, and 0.0319 vs. 0.0314 for HRI. In contrast, other combinations yielded an MSE above 0.0328 , or 0.033 when excluding models including the blood pulse.
These findings are clearly illustrated in Figure 4. They include test statistics indicating that the unimodal model (blood pulse—Model B), which is very close to the proposed model AVB in terms of average MSE, was evaluated using the ASO test. This evaluation confirms that model AVB (column) almost stochastically dominates model B (row), with ε = 0037 and ε = 0.01 in Figure 4c and Figure 4d, respectively. Although the proposed model demonstrates a superior overall predictive accuracy, model B outperforms all other single- or two-modality models. Indeed, as shown in Figure 4c,d, model B (row) stochastically dominates every other model, with ε = 1.0 . This strongly indicates that blood pulse represents a significant modality for understanding brain responses, extending beyond its conventional association as a source of artifacts (e.g., pulsations in the cerebrospinal fluid).

5.3. Implications for Affective Neuroscience

Our interpretation of the results follows [52], where the authors show a significant correlation between facial expressions of happiness by the interlocutor during conversation and activity in the insular and hypothalamic regions of the recorded participants’ brains. Figure 5, which is adapted from [52], illustrates the localization of these brain areas, which were also used in the current exploration. Therefore, it is known in advance which of these brain areas respond to at least one of the features used as a predictor in the current exploration, namely. the facial emotion expressed by the interlocutor (see results illustrated in Figure 2 of [52]).
Irrespective of the brain area, the obtained MSEs are comparable around 0.03 , implying that, for all models, the approach presented here is able to predict the brain response in regions involved in emotional processing with similar accuracy. An important remark is the absence of a clear effect of the two experimental variables—type of interaction and region of interest—on the prediction accuracy. For example, we could have expected differences in prediction to be similar to the differences of BOLD response in all amygdala regions except the left medial amygdala reported in [52], but we found the opposite: the most reduced MSE (and the larger difference between the two types of interactions) was for the human interlocutor in the left medial amygdalar region. Also, counterintuitively, predictions improved for the robot agent in the hypothalamus while the response was reported as higher in previous experiments [12]. Similar improvements have been seen in other brain regions (e.g., rdA, l and r vD, or even rLA). Together, these differences between the current approach that uses prediction of brain activity and previous approaches that rely on general linear models to calculate the importance of experimental factors in explaining the brain response suggest that these two approaches should be considered complementary. One possibility could be that emotional processes are blunted in human–robot interactions so that, for some but not all processes, they are easier to predict. Explaining this complementarity is beyond the reach of the current study.
The ablation investigations also provide fuel for developing a better understanding of the neurobiology of emotional processing in natural interactions. First, all but one case of ablation yielded an increased MSE, which means that the three types of multimodal information used in the prediction were truly complementary. Also noteworthy is the fact that the unimodal model using the blood pulse B yielded the second-best score, after the full model, confirming that internal physiological states reflect emotional processing to a similar extent as observable behaviors [13]. The next step will be to dissect these effects in individual regions of interest instead of the average across the regions, given the variability highlighted in the previous paragraph.

6. Conclusions

In this study, we introduced a new methodology to predict fMRI responses in brain areas involved in positive and negative emotions. This approach, as well as the datasets used, had not been explored in the previous literature related to the objectives of the present study. To evaluate our method, we adapted existing face-based multimodal emotion recognition architectures from the literature to our task. The proposed multimodal deep learning network is simple in design, but effective, as demonstrated by the comparative results.
In this study, we mainly investigated the predictive power of multimodal conversational signals, including visual information from interlocutors, for predicting emotion processing linked to brain activity of participants. It is important to note that one should not actually expect a link between the interlocutor’s expression and the participants’ own emotional states. Therefore, a null hypothesis could be the following: “The interlocutor’s expression should have no predictive power over the individual’s internal emotional states.” However, it is possible to defend a possible predictive power based on the contagion of emotions, which is a generalization of the role of mirror neurons sharing the same code for first- and third-person representations of internal states in the domain of emotions [53].
Finally, another important question is about the link between the activation of the areas of the brain studied and emotions. The main limitation here concerns the ability to objectively label participants’ emotions, which requires a specific experimentation protocol. To address this issue, we conducted a study to analyze the predictability of positive and negative emotions based on activation of these areas compared to a model predicting emotions from audio. We evaluated two architectures: one using audio input [54] and another using fMRI signals to predict emotions. Emotions were classified as neutral, positive, or negative from text depending on the context of the conversation of the same datasets used in the current work. Surprisingly, we found that fMRI signals predicted emotions better than audio, indicating that they better encode internal emotional states. In order to focus on a specific topic for each contribution, we decided to explore this idea further in a future study, to investigate this possibility in detail, and to construct a deep learning model to decode emotions from fMRI brain signals. As a result, we can integrate this with the proposed model to design an end-to-end encoder–decoder that connects multimodal conversations to fMRI and emotions, serving two objectives: predicting fMRI activity and linking the prediction of regions of interest with emotions.

Author Contributions

Conceptualization, L.K. and Y.H.; methodology, L.K. and Y.H.; validation, L.K., Y.H. and T.C.; formal analysis, L.K.; investigation, T.C.; resources, Y.H. and A.E.F.S.; data curation, T.C. and Y.H.; writing—original draft preparation, L.K.; writing—review and editing, L.K., Y.H., T.C. and A.E.F.S.; visualization, L.K.; supervision, Y.H. and A.E.F.S.; project administration, Y.H. and A.E.F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This analysis was conducted on data recorded as part of a corpus of synchronous fMRI and behavioral correlates of natural conversations. The recordings were collected in accordance with the Helsinksi Declaration for Ethics in Medical Research and was promoted by CNRS (National French Research Center, N°16-004) and approved by the local Ethics Committee (Comité de Protection de la Personne Sud-Méditerrannée I, n° 2016-A01008-43).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original data presented in the study are openly available in [12].

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Fu, T.; Gao, S.; Zhao, X.; Wen, J.r.; Yan, R. Learning towards conversational AI: A survey. AI Open 2022, 3, 14–28. [Google Scholar] [CrossRef]
  2. Spezialetti, M.; Placidi, G.; Rossi, S. Emotion Recognition for Human-Robot Interaction: Recent Advances and Future Perspectives. Front. Robot. AI 2020, 7, 532279. [Google Scholar] [CrossRef]
  3. Bouhlal, M.; Aarika, K.; Abdelouahid, R.A.; Elfilali, S.; Benlahmar, E. Emotions recognition as innovative tool for improving students’ performance and learning approaches. Procedia Comput. Sci. 2020, 175, 597–602. [Google Scholar] [CrossRef]
  4. Johansson, R.; Skantze, G.; Jönsson, A. A Psychotherapy Training Environment with Virtual Patients Implemented Using the Furhat Robot Platform. In Intelligent Virtual Agents; Springer International Publishing: Cham, Switzerland, 2017; Volume 10498, pp. 184–187. [Google Scholar] [CrossRef]
  5. Barros, P.; Weber, C.; Wermter, S. Emotional expression recognition with a cross-channel convolutional neural network for human-robot interaction. In Proceedings of the 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), Seoul, Republic of Korea, 3–5 November 2015; pp. 582–587. [Google Scholar] [CrossRef]
  6. Liu, W.; Qiu, J.L.; Zheng, W.L.; Lu, B.L. Comparing Recognition Performance and Robustness of Multimodal Deep Learning Models for Multimodal Emotion Recognition. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 715–729. [Google Scholar] [CrossRef]
  7. Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst. Appl. 2024, 237, 121692. [Google Scholar] [CrossRef]
  8. Dixit, C.; Satapathy, S.M. Deep CNN with late fusion for real time multimodal emotion recognition. Expert Syst. Appl. 2024, 240, 122579. [Google Scholar] [CrossRef]
  9. Xu, C.; Du, Y.; Wang, J.; Zheng, W.; Li, T.; Yuan, Z. A joint hierarchical cross attention graph convolutional network for multimodal facial expression recognition. Comput. Intell. 2024, 40, e12607. [Google Scholar] [CrossRef]
  10. Bilotti, U.; Bisogni, C.; De Marsico, M.; Tramonte, S. Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets. Eng. Appl. Artif. Intell. 2024, 130, 107708. [Google Scholar] [CrossRef]
  11. Pessoa, L. A Network Model of the Emotional Brain. Trends Cogn. Sci. 2017, 21, 357–371. [Google Scholar] [CrossRef]
  12. Rauchbauer, B.; Nazarian, B.; Bourhis, M.; Ochs, M.; Prévot, L.; Chaminade, T. Brain activity during reciprocal social interaction investigated using conversational robots as control condition. Phil. Trans. R. Soc. B 2019, 374, 20180033. [Google Scholar] [CrossRef]
  13. Damasio, A.R. Emotion in the perspective of an integrated nervous system. Brain Res. Rev. 1998, 26, 83–86. [Google Scholar] [CrossRef]
  14. Liang, P.P.; Zadeh, A.; Morency, L.P. Multimodal Local-Global Ranking Fusion for Emotion Recognition. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; pp. 472–476. [Google Scholar] [CrossRef]
  15. Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Madison, WI, USA, 28 June 2011; pp. 689–696. [Google Scholar]
  16. Srivastava, N.; Salakhutdinov, R.R. Multimodal Learning with Deep Boltzmann Machines. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
  17. Gao, J.; Li, P.; Chen, Z.; Zhang, J. A Survey on Deep Learning for Multimodal Data Fusion. Neural Comput. 2020, 32, 829–864. [Google Scholar] [CrossRef] [PubMed]
  18. Akkus, C.; Chu, L.; Djakovic, V.; Jauch-Walser, S.; Koch, P.; Loss, G.; Marquardt, C.; Moldovan, M.; Sauter, N.; Schneider, M.; et al. Multimodal Deep Learning. arXiv 2023, arXiv:2301.04856. [Google Scholar] [CrossRef]
  19. Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 2023, 561, 126866. [Google Scholar] [CrossRef]
  20. Sosea, T.; Caragea, C. CancerEmo: A Dataset for Fine-Grained Emotion Detection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 16–20 November 2020; pp. 8892–8904. [Google Scholar] [CrossRef]
  21. Li, J.; Dong, Z.; Lu, S.; Wang, S.J.; Yan, W.J.; Ma, Y.; Liu, Y.; Huang, C.; Fu, X. CAS(ME)3: A Third Generation Facial Spontaneous Micro-Expression Database with Depth Information and High Ecological Validity. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2782–2800. [Google Scholar] [CrossRef]
  22. Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
  23. Deschamps-Berger, T.; Lamel, L.; Devillers, L. Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations. In Proceedings of the International Conference on Multimodal Interaction, New York, NY, USA, 7–11 November 2022; ACM: New York, NY, USA, 2022; pp. 144–153. [Google Scholar] [CrossRef]
  24. Ranchordas, A.; Araujo, H.J. VISAPP 2008: Proceedings of the Third International Conference on Vision Theory and Applications; Funchal, Madeira, Portugal, January 22–25, 2008. In VISIGRAPP 2008; Ranchordas, A.N., Ed.; INSTICC Press: Lisboa, Portugal, 2008. [Google Scholar]
  25. Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. In Proceedings of the Workshop on Speech, Music and Mind (SMM 2018), Hyderabad, India, 1 September 2018; pp. 21–25. [Google Scholar] [CrossRef]
  26. Chen, Z.; Lin, M.; Wang, Z.; Zheng, Q.; Liu, C. Spatio-temporal representation learning enhanced speech emotion recognition with multi-head attention mechanisms. Knowl.-Based Syst. 2023, 281, 111077. [Google Scholar] [CrossRef]
  27. Wang, S.; Wang, W.; Zhao, J.; Chen, S.; Jin, Q.; Zhang, S.; Qin, Y. Emotion recognition with multimodal features and temporal models. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; ACM: New York, NY, USA, 2017; pp. 598–602. [Google Scholar] [CrossRef]
  28. Krishna, D.N.; Patil, A. Multimodal Emotion Recognition Using Cross-Modal Attention and 1D Convolutional Neural Networks. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 4243–4247. [Google Scholar] [CrossRef]
  29. Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  30. Cai, L.; Hu, Y.; Dong, J.; Zhou, S. Audio-Textual Emotion Recognition Based on Improved Neural Networks. Math. Probl. Eng. 2019, 2019, 2593036. [Google Scholar] [CrossRef]
  31. Ghauri, J.A.; Hakimov, S.; Ewerth, R. Classification of Important Segments in Educational Videos using Multimodal Features. arXiv 2020, arXiv:2010.13626. [Google Scholar] [CrossRef]
  32. Poria, S.; Cambria, E.; Hazarika, D.; Mazumder, N.; Zadeh, A.; Morency, L.P. Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 1033–1038. [Google Scholar] [CrossRef]
  33. Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6569. [Google Scholar] [CrossRef]
  34. Akbari, H.; Yuan, L.; Qian, R.; Chuang, W.H.; Chang, S.F.; Cui, Y.; Gong, B. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. In Advances in Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 24206–24221. [Google Scholar]
  35. Lindquist, K.A.; Wager, T.D.; Kober, H.; Bliss-Moreau, E.; Barrett, L.F. The brain basis of emotion: A meta-analytic review. Behav. Brain. Sci. 2012, 35, 121–143. [Google Scholar] [CrossRef]
  36. Aybek, S.; Nicholson, T.R.; O’Daly, O.; Zelaya, F.; Kanaan, R.A.; David, A.S. Emotion-Motion Interactions in Conversion Disorder: An fMRI Study. PLoS ONE 2015, 10, e0123273. [Google Scholar] [CrossRef] [PubMed]
  37. Tang, J.; LeBel, A.; Jain, S.; Huth, A.G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 2023, 26, 858–866. [Google Scholar] [CrossRef] [PubMed]
  38. Rogers, J. Non-invasive continuous language decoding. Nat. Rev. Neurosci. 2023, 24, 393. [Google Scholar] [CrossRef] [PubMed]
  39. Takagi, Y.; Nishimoto, S. High-Resolution Image Reconstruction with Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14453–14463. [Google Scholar]
  40. Zhang, J.; Li, C.; Liu, G.; Min, M.; Wang, C.; Li, J.; Wang, Y.; Yan, H.; Zuo, Z.; Huang, W.; et al. A CNN-transformer hybrid approach for decoding visual neural activity into text. Comput. Methods Programs Biomed. 2022, 214, 106586. [Google Scholar] [CrossRef] [PubMed]
  41. Luo, S.; Rabbani, Q.; Crone, N.E. Brain-Computer Interface: Applications to Speech Decoding and Synthesis to Augment Communication. Neurotherapeutics 2022, 19, 263–273. [Google Scholar] [CrossRef] [PubMed]
  42. Wolfe, F.H.; Auzias, G.; Deruelle, C.; Chaminade, T. Focal atrophy of the hypothalamus associated with third ventricle enlargement in autism spectrum disorder. NeuroReport 2015, 26, 1017–1022. [Google Scholar] [CrossRef]
  43. Gössl, C.; Fahrmeir, L.; Auer, D.P. Bayesian modeling of the hemodynamic response function in BOLD fMRI. NeuroImage 2001, 14, 140–148. [Google Scholar] [CrossRef]
  44. Fan, L.; Li, H.; Zhuo, J.; Zhang, Y.; Wang, J.; Chen, L.; Yang, Z.; Chu, C.; Xie, S.; Laird, A.R.; et al. The Human Brainnetome Atlas: A New Brain Atlas Based on Connectional Architecture. Cereb. Cortex 2016, 26, 3508–3526. [Google Scholar] [CrossRef]
  45. Sun, M.; Li, J.; Feng, H.; Gou, W.; Shen, H.; Tang, J.; Yang, Y.; Ye, J. Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction, Online, 25–29 October 2020; ACM: New York, NY, USA, 2020; pp. 835–840. [Google Scholar] [CrossRef]
  46. Zhu, B.; Lan, X.; Guo, X.; Barner, K.E.; Boncelet, C. Multi-rate Attention Based GRU Model for Engagement Prediction. In Proceedings of the 2020 International Conference on Multimodal Interaction, Online, 25–29 October 2020; ACM: New York, NY, USA, 2020; pp. 841–848. [Google Scholar] [CrossRef]
  47. Oliveira, L.M.R.; Shuen, L.C.; Da Cruz, A.K.B.S.; Soares, C.D.S. Summarization of Educational Videos with Transformers Networks. In Proceedings of the 29th Brazilian Symposium on Multimedia and the Web, Online, 25–29 October 2023; ACM: New York, NY, USA, 2023; pp. 137–143. [Google Scholar] [CrossRef]
  48. Dror, R.; Shlomov, S.; Reichart, R. Deep Dominance - How to Properly Compare Deep Neural Models. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2773–2785. [Google Scholar] [CrossRef]
  49. Del Barrio, E.; Cuesta-Albertos, J.A.; Matrán, C. An Optimal Transportation Approach for Assessing Almost Stochastic Order. In The Mathematics of the Uncertain; Gil, E., Gil, E., Gil, J., Gil, M.Á., Eds.; Series Title: Studies in Systems, Decision and Control; Springer International Publishing: Cham, Switzerland, 2018; Volume 142, pp. 33–44. [Google Scholar] [CrossRef]
  50. Ulmer, D.; Hardmeier, C.; Frellsen, J. deep-significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks. arXiv 2022, arXiv:2204.06815. [Google Scholar] [CrossRef]
  51. Hosseini, S.S.; Yamaghani, M.R.; Poorzaker Arabani, S. Multimodal modelling of human emotion using sound, image and text fusion. SIViP Signal Image Video Process. 2024, 18, 71–79. [Google Scholar] [CrossRef]
  52. Chaminade, T.; Spatola, N. Perceived facial happiness during conversation correlates with insular and hypothalamus activity for humans, not robots. Front. Psychol. 2022, 13, 871676. [Google Scholar] [CrossRef] [PubMed]
  53. Decety, J.; Chaminade, T. When the self represents the other: A new cognitive neuroscience view on psychological identification. Conscious. Cogn. 2003, 12, 577–596. [Google Scholar] [CrossRef] [PubMed]
  54. Alsabhan, W. Human–Computer Interaction with a Real-Time Speech Emotion Recognition with Ensembling Techniques 1D Convolution Neural Network and Attention. Sensors 2023, 23, 1386. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Illustrative schema of the proposed process for predicting brain responses associated with participant emotional states in both human–human and human–robot interactions. During each conversation, multiple signals are recorded, including the participant’s and interlocutor’s audio, the interlocutor’s video, and the participant’s blood pulse. These multimodal inputs are then integrated using deep learning algorithms to predict the continuous BOLD response, reflecting activity in brain regions linked to emotional states.
Figure 1. Illustrative schema of the proposed process for predicting brain responses associated with participant emotional states in both human–human and human–robot interactions. During each conversation, multiple signals are recorded, including the participant’s and interlocutor’s audio, the interlocutor’s video, and the participant’s blood pulse. These multimodal inputs are then integrated using deep learning algorithms to predict the continuous BOLD response, reflecting activity in brain regions linked to emotional states.
Mti 09 00031 g001
Figure 2. The architecture of the proposed model. The model’s structure involves several key components. Firstly, an embedding layer is utilized to convert input multimodal signals into numerical representations via pre-trained and static models (FaceNet and Mel Spectrogram computation). Next, the resulting multimodal hidden features are encoded using Transformers. This is followed by a late fusion, which combines the multimodal encoded features into a single representation, and a fully connected layer is then employed for prediction.
Figure 2. The architecture of the proposed model. The model’s structure involves several key components. Firstly, an embedding layer is utilized to convert input multimodal signals into numerical representations via pre-trained and static models (FaceNet and Mel Spectrogram computation). Next, the resulting multimodal hidden features are encoded using Transformers. This is followed by a late fusion, which combines the multimodal encoded features into a single representation, and a fully connected layer is then employed for prediction.
Mti 09 00031 g002
Figure 3. Average MSE for both HHI and HRI experiments. Our proposed model achieves the lowest MSE in both settings, demonstrating improved accuracy and robustness than the other evaluated model. Although the MLP model performs well in HHI, it exhibits significantly higher errors in HRI, illustrating the influence of task-specific variables on model performance.
Figure 3. Average MSE for both HHI and HRI experiments. Our proposed model achieves the lowest MSE in both settings, demonstrating improved accuracy and robustness than the other evaluated model. Although the MLP model performs well in HHI, it exhibits significantly higher errors in HRI, illustrating the influence of task-specific variables on model performance.
Mti 09 00031 g003
Figure 4. ASO test statistics results for model comparisons and modality combinations in HHI and HRI experiments. (a,b) show the ASO matrices for all evaluated models (including our proposed AVB model) in HHI and HRI, respectively. (c,d) present the ASO matrices for all modality combinations (audio, video, and blood pulse) in HHI and HRI, respectively. Results are read from row to column; for example, in HHI (c), our proposed model AVB (row) is stochastically dominant over the AV (audio–video) model (column) with an ε m i n of 0.01 .
Figure 4. ASO test statistics results for model comparisons and modality combinations in HHI and HRI experiments. (a,b) show the ASO matrices for all evaluated models (including our proposed AVB model) in HHI and HRI, respectively. (c,d) present the ASO matrices for all modality combinations (audio, video, and blood pulse) in HHI and HRI, respectively. Results are read from row to column; for example, in HHI (c), our proposed model AVB (row) is stochastically dominant over the AV (audio–video) model (column) with an ε m i n of 0.01 .
Mti 09 00031 g004
Figure 5. Illustration of the regions of interest (ROIs) used for prediction and listed in Table 1. Color code indicate the brain structures: amygdala in yellow, hypothalamus in green, and insula in blue.
Figure 5. Illustration of the regions of interest (ROIs) used for prediction and listed in Table 1. Color code indicate the brain structures: amygdala in yellow, hypothalamus in green, and insula in blue.
Mti 09 00031 g005
Table 2. HHI test set results: average MSE for each region of interest. The bold values highlight the lowest (best) average MSE for that specific region.
Table 2. HHI test set results: average MSE for each region of interest. The bold values highlight the lowest (best) average MSE for that specific region.
ModelsMSE per ROI
lvArrAldArdAlvDrVDldDrdDldGrdGlHrHlMArMAlLArLAHy
CURRENT0.03210.03180.03230.03220.03340.03240.0280.02520.03340.03280.03220.02860.02560.03040.03040.03260.0324
MLP0.03650.03560.03370.03420.0400.03600.02940.03210.03210.03380.03320.03110.03060.02890.02870.03370.0338
MODEL 10.03280.03290.03290.03320.03530.03370.02890.03130.03380.03400.03430.02970.02700.02980.02960.03350.0339
MODEL 20.03220.03200.03260.03240.03370.03310.02840.03090.03280.03200.03130.02870.02540.02930.02850.03400.0333
MODEL 30.03310.03250.03270.03260.03460.03370.02860.03090.03260.03250.03220.02890.02540.02870.02850.03470.0332
Table 3. HRI test set results: average MSE for each region of interest. The bold values highlight the lowest (best) average MSE for that specific region.
Table 3. HRI test set results: average MSE for each region of interest. The bold values highlight the lowest (best) average MSE for that specific region.
ModelsMSE per ROI
lvArrAldArdAlvDrVDldDrdDldGrdGlHrHlMArMAlLArLAHy
CURRENT0.03420.03130.03180.02980.03350.03010.02860.03020.03150.03480.03170.02710.02900.03060.03540.03180.0314
MLP0.05770.04990.04580.04240.04140.03870.03640.03200.03410.03710.03090.05190.06950.02580.03040.03870.0396
MODEL 10.04040.03610.04140.03490.03500.03170.03150.03100.03060.03240.03040.03620.04240.02420.02830.03400.0355
MODEL 20.03650.03290.03320.03020.03300.03040.02930.03030.03110.03220.03130.02930.03730.02570.02890.03250.0316
MODEL 30.03520.03240.03220.03000.03360.03050.02830.03060.03180.03240.03090.02910.03340.02570.02820.03240.0317
Table 4. Results for average MSE across all possible modalities (unimodal, bimodal, and all modalities (AVB) data) for HHI and HRI experiments. The bold values indicate the lowest (best) average MSE for each experiment.
Table 4. Results for average MSE across all possible modalities (unimodal, bimodal, and all modalities (AVB) data) for HHI and HRI experiments. The bold values indicate the lowest (best) average MSE for each experiment.
ModalitiesMSE
MODELSAVBHHIHRI
AVB0.03170.0314
Bimodal0.03370.0352
0.03700.0328
0.03390.0367
Unimodal0.03470.0338
0.04260.0614
0.03210.0319
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kdouri, L.; Hmamouche, Y.; El Fallah Seghrouchni, A.; Chaminade, T. Predicting Activity in Brain Areas Associated with Emotion Processing Using Multimodal Behavioral Signals. Multimodal Technol. Interact. 2025, 9, 31. https://doi.org/10.3390/mti9040031

AMA Style

Kdouri L, Hmamouche Y, El Fallah Seghrouchni A, Chaminade T. Predicting Activity in Brain Areas Associated with Emotion Processing Using Multimodal Behavioral Signals. Multimodal Technologies and Interaction. 2025; 9(4):31. https://doi.org/10.3390/mti9040031

Chicago/Turabian Style

Kdouri, Lahoucine, Youssef Hmamouche, Amal El Fallah Seghrouchni, and Thierry Chaminade. 2025. "Predicting Activity in Brain Areas Associated with Emotion Processing Using Multimodal Behavioral Signals" Multimodal Technologies and Interaction 9, no. 4: 31. https://doi.org/10.3390/mti9040031

APA Style

Kdouri, L., Hmamouche, Y., El Fallah Seghrouchni, A., & Chaminade, T. (2025). Predicting Activity in Brain Areas Associated with Emotion Processing Using Multimodal Behavioral Signals. Multimodal Technologies and Interaction, 9(4), 31. https://doi.org/10.3390/mti9040031

Article Metrics

Back to TopTop