Applied AI in Emotion Recognition

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Artificial Intelligence".

Deadline for manuscript submissions: closed (15 October 2024) | Viewed by 12210

Special Issue Editors


E-Mail Website1 Website2
Guest Editor
Zhejiang Lab, Hangzhou 311121, China
Interests: affective computing; speech signal processing; machine learning; digital health

E-Mail Website
Guest Editor
School of Information and Communication Engineering, Nanjing Institute of Technology, Nanjing 211167, China
Interests: signal processing; affective computing; speech signal processing

E-Mail Website
Guest Editor
School of Information Science and Engineering, Southeast University, Nanjing 210096, China
Interests: speech signal processing; speech emotion recognition; machine learning

Special Issue Information

Dear Colleagues,

In recent years, the use of artificial intelligence in emotion recognition has become increasingly important, and the development of intelligent computing technology has significantly aided the growth of emotion-related research. Thus, application scenarios have expanded beyond human–computer collaboration to include health care, security, business intelligence, and education platforms.

As a result of artificial intelligence, emotional intelligence has been improved, and this improvement is not limited to the classification of emotion types; it also requires higher standards for the interaction with emotional information. It is necessary to investigate new methods and mechanisms from the perspectives of emotion perception, cognition, expression, and generation.

In the field of emotion recognition, we are currently facing a number of intriguing challenges and research topics. For emotional databases, there is a challenge regarding how to collect data in a natural and non-intrusive manner. There are also challenges associated with learning and adapting. Adapting to language, personal, and context differences, as well as facial masks during the COVID-19 pandemic, are examples of how we can comprehend emotional meanings in a variety of circumstances. In contrast, the rapid development and widespread application of physiological sensors have provided us with access to emotional data 24 hours a day, 7 days a week. New technologies pertaining to sensors and embedded emotion and health systems are becoming increasingly popular research topics.

The primary purpose of this Special Issue of Electronics is to present newly emerged research interests in specific emotions and novel A.I. approaches. In addition, it will present novel developments in intelligent computing methods and novel hardware systems to advance emotion recognition technology for the scientific community and industry.

Topics include, but are not limited to, the following:

  • Emotion recognition, conversion and synthesis;
  • Identification of emotions related to learning and cognitive processes;
  • Health-related emotional states studies, such as depression, ASD and fatigue;
  • Analysis of novel emotional features to improve generality and robustness;
  • Micro-expression recognition, vocal burst recognition, and deceptive speech detection;
  • Personalized and contextual adaptation;
  • Embedded systems for emotion recognition, including new methods and new protocols;
  • Practical applications of emotion recognition technology, including call-center applications, health and medical applications, education and online-learning applications.

Dr. Chengwei Huang
Prof. Dr. Yongqiang Bao
Prof. Dr. Li Zhao
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • emotion recognition
  • health-related emotions
  • learning and cognitive process
  • deceptive speech detection
  • cross-database emotion recognition
  • personalized adaptation
  • embedded systems

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

28 pages, 7955 KiB  
Article
Level of Agreement between Emotions Generated by Artificial Intelligence and Human Evaluation: A Methodological Proposal
by Miguel Carrasco, César González-Martín, Sonia Navajas-Torrente and Raúl Dastres
Electronics 2024, 13(20), 4014; https://doi.org/10.3390/electronics13204014 - 12 Oct 2024
Viewed by 495
Abstract
Images are capable of conveying emotions, but emotional experience is highly subjective. Advances in artificial intelligence have enabled the generation of images based on emotional descriptions. However, the level of agreement between the generative images and human emotional responses has not yet been [...] Read more.
Images are capable of conveying emotions, but emotional experience is highly subjective. Advances in artificial intelligence have enabled the generation of images based on emotional descriptions. However, the level of agreement between the generative images and human emotional responses has not yet been evaluated. In order to address this, 20 artistic landscapes were generated using StyleGAN2-ADA. Four variants evoking positive emotions (contentment and amusement) and negative emotions (fear and sadness) were created for each image, resulting in 80 pictures. An online questionnaire was designed using this material, in which 61 observers classified the generated images. Statistical analyses were performed on the collected data to determine the level of agreement among participants between the observers’ responses and the generated emotions by AI. A generally good level of agreement was found, with better results for negative emotions. However, the study confirms the subjectivity inherent in emotional evaluation. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

22 pages, 336 KiB  
Article
Multimodal Emotion Recognition Based on Facial Expressions, Speech, and Body Gestures
by Jingjie Yan, Peiyuan Li, Chengkun Du, Kang Zhu, Xiaoyang Zhou, Ying Liu and Jinsheng Wei
Electronics 2024, 13(18), 3756; https://doi.org/10.3390/electronics13183756 - 21 Sep 2024
Viewed by 479
Abstract
The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the [...] Read more.
The research of multimodal emotion recognition based on facial expressions, speech, and body gestures is crucial for oncoming intelligent human–computer interfaces. However, it is a very difficult task and has seldom been researched in this combination in the past years. Based on the GEMEP and Polish databases, this contribution focuses on trimodal emotion recognition from facial expressions, speech, and body gestures, including feature extraction, feature fusion, and multimodal classification of the three modalities. In particular, for feature fusion, two novel algorithms including supervised least squares multiset kernel canonical correlation analysis (SLSMKCCA) and sparse supervised least squares multiset kernel canonical correlation analysis (SSLSMKCCA) are presented, respectively, to carry out efficient facial expression, speech, and body gesture feature fusion. Different from the traditional multiset kernel canonical correlation analysis (MKCCA) algorithms, our SLSKMCCA algorithm is a supervised version and is based on the least squares form. The SSLSKMCCA algorithm is implemented by the combination of SLSMKCCA and a sparse item (L1 Norm). Moreover, two effective solving algorithms for SLSMKCCA and SSLSMKCCA are presented in addition, which use the alternated least squares and augmented Lagrangian multiplier methods, respectively. The extensive experimental results on the popular public GEMEP and Polish databases show that the recognition rate of multimodal emotion recognition is superior to bimodal and monomodal emotion recognition on average, and our presented SLSMKCCA and SSLSMKCCA fusion methods both obtain very high recognition rates, especially for the SSLSMKCCA fusion method. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

19 pages, 5480 KiB  
Article
PH-CBAM: A Parallel Hybrid CBAM Network with Multi-Feature Extraction for Facial Expression Recognition
by Liefa Liao, Shouluan Wu, Chao Song and Jianglong Fu
Electronics 2024, 13(16), 3149; https://doi.org/10.3390/electronics13163149 - 9 Aug 2024
Viewed by 981
Abstract
Convolutional neural networks have made significant progress in human Facial Expression Recognition (FER). However, they still face challenges in effectively focusing on and extracting facial features. Recent research has turned to attention mechanisms to address this issue, focusing primarily on local feature details [...] Read more.
Convolutional neural networks have made significant progress in human Facial Expression Recognition (FER). However, they still face challenges in effectively focusing on and extracting facial features. Recent research has turned to attention mechanisms to address this issue, focusing primarily on local feature details rather than overall facial features. Building upon the classical Convolutional Block Attention Module (CBAM), this paper introduces a novel Parallel Hybrid Attention Model, termed PH-CBAM. This model employs split-channel attention to enhance the extraction of key features while maintaining a minimal parameter count. The proposed model enables the network to emphasize relevant details during expression classification. Heatmap analysis demonstrates that PH-CBAM effectively highlights key facial information. By employing a multimodal extraction approach in the initial image feature extraction phase, the network structure captures various facial features. The algorithm integrates a residual network and the MISH activation function to create a multi-feature extraction network, addressing issues such as gradient vanishing and negative gradient zero point in residual transmission. This enhances the retention of valuable information and facilitates information flow between key image details and target images. Evaluation on benchmark datasets FER2013, CK+, and Bigfer2013 yielded accuracies of 68.82%, 97.13%, and 72.31%, respectively. Comparison with mainstream network models on FER2013 and CK+ datasets demonstrates the efficiency of the PH-CBAM model, with comparable accuracy to current advanced models, showcasing its effectiveness in emotion detection. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

21 pages, 9542 KiB  
Article
A Research on Emotion Recognition of the Elderly Based on Transformer and Physiological Signals
by Guohong Feng, Hongen Wang, Mengdi Wang, Xiao Zheng and Runze Zhang
Electronics 2024, 13(15), 3019; https://doi.org/10.3390/electronics13153019 - 31 Jul 2024
Viewed by 637
Abstract
Aiming at problems such as the difficulty of recognizing emotions in the elderly and the inability of traditional machine-learning models to effectively capture the nonlinear relationship between physiological signal data, a Recursive Map (RM) combined with a Vision Transformer (ViT) is proposed to [...] Read more.
Aiming at problems such as the difficulty of recognizing emotions in the elderly and the inability of traditional machine-learning models to effectively capture the nonlinear relationship between physiological signal data, a Recursive Map (RM) combined with a Vision Transformer (ViT) is proposed to recognize the emotions of the elderly based on Electroencephalogram (EEG), Electrodermal Activity (EDA), and Heart Rate Variability (HRV) signals. The Dung Beetle Optimizer (DBO) is used to optimize the variational modal decomposition of EEG, EDA, and HRV signals. The optimized decomposed time series signals are converted into two-dimensional images using RM, and then the converted image signals are applied to the ViT for the study of emotion recognition of the elderly. The pre-trained weights of ViT on the ImageNet-22k dataset are loaded into the model and retrained with the two-dimensional image data. The model is validated and compared using the test set. The research results show that the recognition accuracy of the proposed method on EEG, EDA, and HRV signals is 99.35%, 86.96%, and 97.20%, respectively. This indicates that EEG signals can better reflect the emotional problems of the elderly, followed by HRV signals, while EDA signals have poorer effects. Compared with Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbors (KNN), the recognition accuracy of the proposed method is increased by at least 9.4%, 11.13%, and 12.61%, respectively. Compared with ResNet34, EfficientNet-B0, and VGG16, it is increased by at least 1.14%, 0.54%, and 3.34%, respectively. This proves the superiority of the proposed method in emotion recognition for the elderly. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

18 pages, 445 KiB  
Article
Speech Emotion Recognition Using Dual-Stream Representation and Cross-Attention Fusion
by Shaode Yu, Jiajian Meng, Wenqing Fan, Ye Chen, Bing Zhu, Hang Yu, Yaoqin Xie and Qiurui Sun
Electronics 2024, 13(11), 2191; https://doi.org/10.3390/electronics13112191 - 4 Jun 2024
Cited by 2 | Viewed by 905
Abstract
Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning [...] Read more.
Speech emotion recognition (SER) aims to recognize human emotions through in-depth analysis of audio signals. However, it remains challenging to encode emotional cues and to fuse the encoded cues effectively. In this study, dual-stream representation is developed, and both full training and fine-tuning of different deep networks are employed for encoding emotion patterns. Specifically, a cross-attention fusion (CAF) module is designed to integrate the dual-stream output for emotion recognition. Using different dual-stream encoders (fully training a text processing network and fine-tuning a pre-trained large language network), the CAF module is compared to other three fusion modules on three databases. The SER performance is quantified with weighted accuracy (WA), unweighted accuracy (UA), and F1-score (F1S). The experimental results suggest that the CAF outperforms the other three modules and leads to promising performance on the databases (EmoDB: WA, 97.20%; UA, 97.21%; F1S, 0.8804; IEMOCAP: WA, 69.65%; UA, 70.88%; F1S, 0.7084; RAVDESS: WA, 81.86%; UA, 82.75.21%; F1S, 0.8284). It is also found that fine-tuning a pre-trained large language network achieves superior representation than fully training a text processing network. In a future study, improved SER performance could be achieved through the development of a multi-stream representation of emotional cues and the incorporation of a multi-branch fusion mechanism for emotion recognition. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

16 pages, 741 KiB  
Article
Speech Emotion Recognition Based on Temporal-Spatial Learnable Graph Convolutional Neural Network
by Jingjie Yan, Haihua Li, Fengfeng Xu, Xiaoyang Zhou, Ying Liu and Yuan Yang
Electronics 2024, 13(11), 2010; https://doi.org/10.3390/electronics13112010 - 21 May 2024
Viewed by 747
Abstract
The Graph Convolutional Neural Networks (GCN) method has shown excellent performance in the field of deep learning, and using graphs to represent speech data is a computationally efficient and scalable approach. In order to enhance the adequacy of graph neural networks in extracting [...] Read more.
The Graph Convolutional Neural Networks (GCN) method has shown excellent performance in the field of deep learning, and using graphs to represent speech data is a computationally efficient and scalable approach. In order to enhance the adequacy of graph neural networks in extracting speech emotional features, this paper proposes a Temporal-Spatial Learnable Graph Convolutional Neural Network (TLGCNN) for speech emotion recognition. TLGCNN firstly utilizes the Open-SMILE toolkit to extract frame-level speech emotion features. Then, a bidirectional long short-term memory (Bi LSTM) network is used to process the long-term dependencies of speech features which can further extract deep frame-level emotion features. The extracted frame-level emotion features are then input into subsequent network through two pathways. Finally, one pathway constructs the extracted frame-level deep emotion feature vectors into a graph structure applying an adaptive adjacency matrix to catch latent spatial connections, while the other pathway concatenates emotion feature vectors with graph-level embedding obtained from learnable graph convolutional neural network for prediction and classification. Through these two pathways, TLGCNN can simultaneously obtain temporal speech emotional information through Bi-LSTM and spatial speech emotional information through Learnable Graph Convolutional Neural (LGCN) network. Experimental results demonstrate that this method achieves weighted accuracy of 66.82% and 58.35% on the IEMOCAP and MSP-IMPROV databases, respectively. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

17 pages, 2467 KiB  
Article
Multi-Representation Joint Dynamic Domain Adaptation Network for Cross-Database Facial Expression Recognition
by Jingjie Yan, Yuebo Yue, Kai Yu, Xiaoyang Zhou, Ying Liu, Jinsheng Wei and Yuan Yang
Electronics 2024, 13(8), 1470; https://doi.org/10.3390/electronics13081470 - 12 Apr 2024
Cited by 1 | Viewed by 715
Abstract
In order to obtain more fine-grained information from multiple sub-feature spaces for domain adaptation, this paper proposes a novel multi-representation joint dynamic domain adaptation network (MJDDAN) and applies it to achieve cross-database facial expression recognition. The MJDDAN uses a hybrid structure to extract [...] Read more.
In order to obtain more fine-grained information from multiple sub-feature spaces for domain adaptation, this paper proposes a novel multi-representation joint dynamic domain adaptation network (MJDDAN) and applies it to achieve cross-database facial expression recognition. The MJDDAN uses a hybrid structure to extract multi-representation features and maps the original facial expression features into multiple sub-feature spaces, aligning the expression features of the source domain and target domain in multiple sub-feature spaces from different angles to extract features more comprehensively. Moreover, the MJDDAN proposes the Joint Dynamic Maximum Mean Difference (JD-MMD) model to reduce the difference in feature distribution between different subdomains by simultaneously minimizing the maximum mean difference and local maximum mean difference in each substructure. Three databases, including eNTERFACE, FABO, and RAVDESS, are used to design a large number of cross-database transfer learning facial expression recognition experiments. The accuracy of emotion recognition experiments with eNTERFACE, FABO, and RAVDESS as target domains reach 53.64%, 43.66%, and 35.87%, respectively. Compared to the best comparison method chosen in this article, the accuracy rates were improved by 1.79%, 0.85%, and 1.02%, respectively. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

14 pages, 567 KiB  
Article
Fidgety Speech Emotion Recognition for Learning Process Modeling
by Ming Zhu, Chunchieh Wang and Chengwei Huang
Electronics 2024, 13(1), 146; https://doi.org/10.3390/electronics13010146 - 28 Dec 2023
Cited by 1 | Viewed by 702
Abstract
In this paper, the recognition of fidgety speech emotion is studied, and real-world speech emotions are collected to enhance emotion recognition in practical scenarios, especially for cognitive tasks. We first focused on eliciting fidgety emotions and data acquisition for general math learning. Students [...] Read more.
In this paper, the recognition of fidgety speech emotion is studied, and real-world speech emotions are collected to enhance emotion recognition in practical scenarios, especially for cognitive tasks. We first focused on eliciting fidgety emotions and data acquisition for general math learning. Students practice mathematics by performing operations, solving problems, and orally responding to questions, all of which are recorded as audio data. Subsequently, the teacher evaluates the accuracy of these mathematical exercises by scoring, which reflects the cognitive outcomes of the students. Secondly, we propose an end-to-end speech emotion model based on a multi-scale one-dimensional (1-D) residual convolutional neural network. Finally, we conducted an experiment to recognize fidgety speech emotions by testing various classifiers, including SVM, LSTM, 1-D CNN, and the proposed multi-scale 1-D CNN. The experimental results show that the classifier we constructed can identify fidgety emotion well. After conducting a thorough analysis of fidgety emotions and their influence on the learning process, a clear relationship between the two was apparent. The automatic recognition of fidgety emotions is valuable for assisting on-line math teaching. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

15 pages, 5269 KiB  
Article
Multimodal Emotion Recognition in Conversation Based on Hypergraphs
by Jiaze Li, Hongyan Mei, Liyun Jia and Xing Zhang
Electronics 2023, 12(22), 4703; https://doi.org/10.3390/electronics12224703 - 19 Nov 2023
Cited by 2 | Viewed by 1418
Abstract
In recent years, sentiment analysis in conversation has garnered increasing attention due to its widespread applications in areas such as social media analytics, sentiment mining, and electronic healthcare. Existing research primarily focuses on sequence learning and graph-based approaches, yet they overlook the high-order [...] Read more.
In recent years, sentiment analysis in conversation has garnered increasing attention due to its widespread applications in areas such as social media analytics, sentiment mining, and electronic healthcare. Existing research primarily focuses on sequence learning and graph-based approaches, yet they overlook the high-order interactions between different modalities and the long-term dependencies within each modality. To address these problems, this paper proposes a novel hypergraph-based method for multimodal emotion recognition in conversation (MER-HGraph). MER-HGraph extracts features from three modalities: acoustic, text, and visual. It treats each modality utterance in a conversation as a node and constructs intra-modal hypergraphs (Intra-HGraph) and inter-modal hypergraphs (Inter-HGraph) using hyperedges. The hypergraphs are then updated using hypergraph convolutional networks. Additionally, to mitigate noise in acoustic data and mitigate the impact of fixed time scales, we introduce a dynamic time window module to capture local-global information from acoustic signals. Extensive experiments on the IEMOCAP and MELD datasets demonstrate that MER-HGraph outperforms existing models in multimodal emotion recognition tasks, leveraging high-order information from multimodal data to enhance recognition capabilities. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

15 pages, 2332 KiB  
Article
Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network
by Zhichao Peng, Hua Zeng, Yongwei Li, Yegang Du and Jianwu Dang
Electronics 2023, 12(22), 4620; https://doi.org/10.3390/electronics12224620 - 12 Nov 2023
Viewed by 1171
Abstract
Dimensional emotion can better describe rich and fine-grained emotional states than categorical emotion. In the realm of human–robot interaction, the ability to continuously recognize dimensional emotions from speech empowers robots to capture the temporal dynamics of a speaker’s emotional state and adjust their [...] Read more.
Dimensional emotion can better describe rich and fine-grained emotional states than categorical emotion. In the realm of human–robot interaction, the ability to continuously recognize dimensional emotions from speech empowers robots to capture the temporal dynamics of a speaker’s emotional state and adjust their interaction strategies in real-time. In this study, we present an approach to enhance dimensional emotion recognition through modulation-filtered cochleagram and parallel attention recurrent neural network (PA-net). Firstly, the multi-resolution modulation-filtered cochleagram is derived from speech signals through auditory signal processing. Subsequently, the PA-net is employed to establish multi-temporal dependencies from diverse scales of features, enabling the tracking of the dynamic variations in dimensional emotion within auditory modulation sequences. The results obtained from experiments conducted on the RECOLA dataset demonstrate that, at the feature level, the modulation-filtered cochleagram surpasses other assessed features in its efficacy to forecast valence and arousal. Particularly noteworthy is its pronounced superiority in scenarios characterized by a high signal-to-noise ratio. At the model level, the PA-net attains the highest predictive performance for both valence and arousal, clearly outperforming alternative regression models. Furthermore, the experiments carried out on the SEWA dataset demonstrate the substantial enhancements brought about by the proposed method in valence and arousal prediction. These results collectively highlight the potency and effectiveness of our approach in advancing the field of dimensional speech emotion recognition. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

15 pages, 4532 KiB  
Article
MMATERIC: Multi-Task Learning and Multi-Fusion for AudioText Emotion Recognition in Conversation
by Xingwei Liang, You Zou, Xinnan Zhuang, Jie Yang, Taiyu Niu and Ruifeng Xu
Electronics 2023, 12(7), 1534; https://doi.org/10.3390/electronics12071534 - 24 Mar 2023
Cited by 5 | Viewed by 2199
Abstract
The accurate recognition of emotions in conversations helps understand the speaker’s intentions and facilitates various analyses in artificial intelligence, especially in human–computer interaction systems. However, most previous methods need more ability to track the different emotional states of each speaker in a dialogue. [...] Read more.
The accurate recognition of emotions in conversations helps understand the speaker’s intentions and facilitates various analyses in artificial intelligence, especially in human–computer interaction systems. However, most previous methods need more ability to track the different emotional states of each speaker in a dialogue. To alleviate this dilemma, we propose a new approach, Multi-Task Learning and Multi-Fusion AudioText Emotion Recognition in Conversation (MMATERIC) for emotion recognition in conversation. MMATERIC can refer to and combine the benefits of two distinct tasks: emotion recognition in text and emotion recognition in speech, and production of fused multimodal features to recognize the emotions of different speakers in dialogue. At the core of MATTERIC are three modules: an encoder with multimodal attention, a speaker emotion detection unit (SED-Unit), and a decoder with speaker emotion detection Bi-LSTM (SED-Bi-LSTM). Together, these three modules model the changing emotions of a speaker at a given moment in a conversation. Meanwhile, we adopt multiple fusion strategies in different stages, mainly using model fusion and decision stage fusion to improve the model’s accuracy. Simultaneously, our multimodal framework allows features to interact across modalities and allows potential adaptation flows from one modality to another. Our experimental results on two benchmark datasets show that our proposed method is effective and outperforms the state-of-the-art baseline methods. The performance improvement of our method is mainly attributed to the combination of three core modules of MATTERIC and the different fusion methods we adopt in each stage. Full article
(This article belongs to the Special Issue Applied AI in Emotion Recognition)
Show Figures

Figure 1

Back to TopTop