1. Introduction
Sentiment analysis has emerged as a pivotal technology, playing an increasingly critical role in effectively enhancing human–computer interaction (HCI) experiences through the automated interpretation of human affective states [
1]. Currently, the academic community widely adopts multimodal data fusion methods for sentiment analysis research [
2,
3], with commonly used data modalities including text, audio, and video. Distinct from traditional sentiment analysis, emotion recognition in conversations, as a significant branch within the field, necessitates not only an in-depth exploration of emotional information but also the integration of key elements such as speaker characteristics and dialogue context. This is crucial for accurately capturing the complex and nuanced emotional dynamics inherent in conversational settings. Within the domain of conversational emotion analysis, the emphasis shifts from “sentiment analysis”, which primarily assesses sentiment polarity, to “emotion recognition”. This transition is driven by the need to capture the subtle nuances and dynamic shifts inherent in human emotions during conversations. Consequently, emotion recognition extends beyond simplistic polarity detection, encompassing a richer spectrum of emotional categories, such as anger and joy. While existing Emotion Recognition in Conversation (ERC) datasets, such as MELD [
4,
5] and M3ED [
6], have significantly advanced emotion understanding in human–human conversations (HHCs), they fundamentally fail to model human–machine conversational contexts (HMCs). This is a critical limitation given the proliferation of intelligent dialogue systems. This critical gap manifests in three dimensions: First, current benchmarks predominantly capture symmetrical emotional exchanges between humans, neglecting the asymmetrical interaction patterns characteristic of human–machine dialogues, where users exhibit task-contingent emotional responses. Second, multimodal data, in this context, refer to the integration of information from different modalities, such as visual information from facial expressions, auditory information from vocal intonation, and textual information from language. Existing multimodal annotations fail to account for machine-specific contextual factors, including system latency profiles, task completion trajectories, and interface affordance constraints. Third, traditional emotion taxonomies prove inadequate for capturing the attenuated affective expressions (e.g., suppressed frustration during system failures) prevalent in HMC scenarios. The methodological constraints of HHC-oriented approaches become particularly apparent when analyzing
Figure 1 versus
Figure 2. While hierarchical LSTMs [
7] and graph neural networks [
8] achieve strong performance on HHC benchmarks, their direct application to HMC contexts yields suboptimal results due to divergent emotional causality mechanisms. For instance, user affective responses in HMC exhibit stronger coupling with task progression states compared to HHC’s social alignment dynamics. This limitation stems from fundamental dataset deficiencies: Current resources neither record micro-temporal emotion transitions during system response delays nor annotate multimodal disfluency patterns unique to human–machine exchanges. While methodologically rigorous, the IEMOCAP corpus [
9] employs laboratory-controlled HHC scenarios that poorly generalize to real-world HMC contexts. Furthermore, ecologically valid HHC collections like M3ED lack essential HMC metadata, including system performance metrics and user adaptation trajectories.
While existing ERC research predominantly confines its investigations to HHC [
10,
11]—empirically exemplified in
Figure 1—the rapid proliferation of intelligent dialogue systems has precipitated a critical research imperative: developing robust emotion recognition frameworks for human–machine conversational contexts, a methodological necessity visually substantiated in
Figure 2. In contrast to HHC, HMC exhibits distinct disparities in emotional expression and interactional patterns. First, users often manifest more nuanced and subdued emotional expressions when interacting with machines, which are intrinsically linked to task completion progress, the quality of system feedback, and interactional fluency. Second, under specific circumstances such as task failure, delayed system responses, or insufficient system support, the affective reactions of users can demonstrate divergent patterns compared to purely human–human interactions. For example, suppressed frustration and restraint may be evident during task failures, and apprehension and unease while awaiting responses, as well as perplexity and frustration stemming from system misunderstandings. Traditional ERC methodologies predicated on human–human interactions face significant challenges in this nascent domain. First, existing datasets do not sufficiently characterize the granular behavioral patterns and contextual variables inherent in human–machine interactions, hindering the capture of task-driven emotional evolution. Second, conventional models often neglect critical situational factors within human–machine interactions, such as conversational history, task execution progress, and system feedback efficacy. Third, current approaches do not comprehensively account for the unique modalities of emotional expression and their evolutionary trajectories when modeling user affect in human–machine interactions.
Therefore, the creation of a high-quality multimodal ERC dataset and corresponding methodologies tailored for human–machine dialogue scenarios has become a pressing research imperative. An ideal dataset should exhibit the following characteristics: (1) Comprehensive capture of key contextual factors in human–machine interactions, including dialogue history, task progress, and machine feedback quality. (2) Accurate recording of user multimodal emotional expression patterns within specific interaction contexts. (3) Reflection of the dynamic evolution of emotional states in task-driven dialogues. Concurrently, novel computational models must be developed to better model the unique emotional expression patterns and contextual dependencies inherent in human–machine interactions. This development is crucial for providing theoretical support and practical guidance in building more natural and intelligent dialogue systems.
(1) Overcoming the limitations of traditional emotion datasets restricted to human–human conversation scenarios, this work introduces a novel approach. By combining machine-generated and human-authored content, we constructed dialogue scripts that simulate authentic human–machine interactions. We collected multimodal data and provided structured annotations, including dialogue topic classification, speech act types, emotional directivity, emotion intensity, and contextual dependencies.
(2) We propose a novel multimodal emotion recognition framework specifically designed for human–machine dialogue scenarios. This framework integrates various multimodal fusion methods and effectively utilizes structured emotional analysis constraints, such as conversation context and speaker information derived from human–machine interactions. This approach leads to more accurate and robust dialogue emotion analysis.
(3) We conducted comprehensive benchmark experiments, ablation studies, and visualization analyses to rigorously evaluate our approach. The experimental results demonstrably show that our proposed multimodal emotion recognition framework significantly outperforms existing methods in human–machine interaction scenarios, achieving substantial improvements across key performance metrics.
3. Multi-HM Dataset
This work investigates the recognition and analysis of users’ complex emotional states in human–machine interaction scenarios. To address this research objective, we have innovatively employed a dual-layer annotation system. This system facilitates fine-grained corpus annotation across both sentiment and emotion categories. This research specifically focuses on the characteristics of emotional experiences and their evolutionary patterns during interactions between users and intelligent dialogue systems. Recognizing the unique nature and complexity of human–machine interaction scenarios, we developed a systematic three-stage process for constructing a multimodal emotional dataset. This process comprises: (1) a scenario-driven design of dialogue scripts; (2) a synchronous collection and recording of multimodal interaction data; and (3) the development of a structured annotation system for dialogue emotion inference.
3.1. Data Build
This study aims to construct a high-quality, diverse data build, specifically focused on human–computer interaction (HCI) scenarios, as indicated in
Figure 3. The dialogue content of this dataset will comprehensively cover typical HCI domains such as information seeking, instruction execution, emotional support, problem troubleshooting, chit-chat interaction, and policy information retrieval. We have innovatively proposed a dual-track dialogue script generation strategy to achieve this goal: the LLM–Human Collaboration Paradigm and the Real-World Corpus-Based Paradigm. These strategies form a synergistic mechanism that complements each other in the data generation process:
(1) LLM–Human Collaboration Paradigm: This paradigm employs a hybrid approach combining large language model generation with human optimization. Specifically, we utilize GPT-4.0 ((
https://openai.com/index/gpt-4/) accessed on 24 December 2024) as the core dialogue generation engine, guiding the model to generate initial dialogue scripts through systematic prompt engineering techniques, as depicted in
Figure 4.
Subsequently, the research team rigorously evaluates and optimizes the generated content across multiple dimensions, including fluency, contextual appropriateness, logical coherence, and task relevance. Through human intervention, we effectively mitigate potential issues such as hallucination, semantic deviation, and task drift, ensuring the reliability and practicality of the generated content.
(2) Real-World Corpus-Based Paradigm: To further enhance the ecological validity of the dataset and capture the spontaneous expressions and complex emotional interactions unique to real-world human–machine interaction, we systematically collected multi-source real interaction data. These data encompass customer service dialogue records and virtual assistant interaction logs. Subsequently, through a rigorous anonymization and data-cleansing process, we extracted representative dialogue structures, user intents, and emotional expression patterns.
3.2. Recording
To ensure the multimodal data’s logical validity and representational richness, we engaged professional performers to enact the dialogue scripts. We established and implemented a rigorous quality control protocol during data collection. First, we mandated that the actors’ facial expressions and vocal characteristics remain clearly discernible in the video footage. Furthermore, key emotional segments were required to exceed a sufficient duration threshold to fully capture the spatiotemporal characteristics of emotional information. Second, the actors were required to strictly adhere to predefined emotional categories in their performances, utilizing multimodal cues such as micro-expressions, intonation variations, and body language to ensure the authenticity and discriminability of emotional expressions. The final dataset comprises dialogues recorded by 15 participants, with each contributing multiple conversations to ensure data diversity.
3.3. Annotation
As a paradigm of auxiliary knowledge infusion, the Chain-of-Thought (CoT) reasoning mechanism significantly enhances the predictive efficacy of deep learning architectures in emotion analysis tasks while also providing interpretable insights. This cognitively inspired framework establishes an epistemologically transparent pipeline that mirrors the hierarchical progression of human reasoning processes, thus laying an axiomatic foundation for model interpretability through fully traceable input–output mappings. The architectural foundation of our model is rooted in the seminal OCC (Ortony, Clore, and Collins) emotion taxonomy [
26], a theoretical cornerstone in affective computing that formalizes emotion generation through cognitively grounded appraisal rules. This framework establishes the following: (1) formalized Emotion-Cognitive Reasoning (ECR) heuristics and (2) machine-actionable Emotion-Cognitive Knowledge (EC-Knowledge) representations. Our principal methodological innovation lies in the pioneering integration of qualitative appraisal mechanisms with the latent space manipulation capabilities of Pre-trained Language Models (PLMs). This synergistic architecture leverages distributional semantics learned from massive corpora and injects psychologically grounded reasoning constraints, enabling deep cognitive modeling of user-generated emotional texts through theoretically informed neural adaptation. To effectively train and comprehensively evaluate our emotion recognition model, drawing upon both the Dialog Act Markup in Several Layers (DAMSL) annotation scheme [
27] and the Ortony, Clore, and Collins (OCC) model, we innovatively constructed a five-tuple annotation framework for emotional cognitive reasoning (as shown in
Figure 5).
This framework encompasses Role, Topic, Act, Sentiment, and Emotion. At the emotion annotation level, the system adopts a hierarchical classification approach, initially categorizing emotions into two broad classes: “Satisfaction” and “Dissatisfaction”. These are subsequently refined into a set of more nuanced emotion categories, including “Regret”, “Urgency”, “Confusion”, “Anger”, “Worry”, “Embarrassment”, and “Acceptance”. The annotation process is deeply integrated with theoretical frameworks of emotion-cognitive reasoning, ensuring theoretical consistency in annotation.
Role: Indicates the role of the dialogue participant, such as “Human” or “Machine”.
Topic: Describes the subject matter of the dialogue, such as “Chat-Oriented” or “Task-Oriented”.
Act: Represents the type of dialogue act, such as “Question”, “Statement”, or “Command”.
Sentiment: Expresses the polarity of the emotion, such as “Positive”, “Negative”, or “Neutral”.
Emotion: Specifies the particular emotion category, such as “Satisfaction”, “Anger”, or “Worry”.
4. Model Framework
When constructing the theoretical framework, we envision a scenario in which a human (denoted as H) interacts with a multimodal conversational system and a machine (denoted as M). This interaction is represented by a continuous conversational discourse sequence, denoted as n, alternately reflecting the communicative exchange between H and M. To establish a pattern sequence , p denotes the number of past utterances considered, and f denotes the number of future utterances considered. The dialogue system provides the necessary textual prompts, while the user inputs data in a multimodal setting, including audio (A), video (V), and text (T). In this paradigm, the customer service agent (machine, M) maintains a neutral emotional orientation, while the user’s (human, H) emotional state dynamically shifts as the conversation evolves. Thus, our goal extends beyond simply modeling the interaction environment and speaker dependency; it encompasses integrating multimodal data from users to reflect and express their emotional state accurately.
4.1. Feature Extraction
Text: In our dataset, each video is precisely paired with its corresponding Chinese script. To achieve efficient encoding of the quintuple annotations, we employ the pre-trained Chinese RoBERTa-cn model [
28] and fine-tune the encoder using the SentiLARE approach [
29]. Specifically, we extract sentence-level semantic embeddings from the fine-tuned model, transforming each sentence into a 768-dimensional dense vector representation. This high-dimensional vectorization not only captures the deep semantic features of the text but also provides a rich representational foundation for subsequent fine-grained emotion analysis.
Video: We utilize the OpenFace 2.0 toolkit [
30] to extract visual features, including 68 facial landmarks, 17 facial action units, head pose, gaze direction, and orientation measurements.
Audio: Using the Librosa toolkit [
31], we extract key audio characteristics such as the fundamental frequency (log scale), Mel-Frequency Cepstral Coefficients (MFCCs), and Constant Q Transform (CQT). These features, crucial for identifying emotion and intonation, are then combined to create a 33-dimensional set of acoustic features at the frame level for each video.
4.2. Multimodal Fusion
To achieve the multimodal data fusion objective within our proposed human–computer interaction framework, as shown in
Figure 6, we systematically conducted comparative experiments on cross-modal representation learning and fusion methods. In our experimental design, we selected baseline models that are representative and academically cutting-edge in the field of multimodal sentiment analysis. These served as reference systems for fusion strategies [
19,
20,
21,
32,
33], encompassing the main paradigms of modal interaction and feature fusion mechanisms. The construction of this benchmarking system aims to provide the Multi-HM framework with reproducible and comparable performance evaluation benchmarks while also offering an empirical research foundation for the methodological evolution of cross-modal representation learning.
5. Experiment
5.1. Baselines
This section provides a concise overview of the baseline models utilized in our experiments.
Multimodal Transformer (MulT) [
32]: MulT employs a Transformer network for cross-modal attention, capturing signals from aligned and unaligned pairs in asynchronous modalities.
MISA [
20]: MISA distinguishes between modality-specific and invariant features, using specialized loss functions to capture interactions within and between modalities.
Self-MM [
19]: Self-MM utilizes a self-supervised module to generate single-modal labels, thereby enhancing the distinction between representations of various modalities.
Decoupled Multimodal Distillation (DMD) [
21]: DMD introduces a decoupled multimodal distillation approach to reduce heterogeneity between modalities.
FDR-MSA [
33]: FDR-MSA utilizes a disentangled and dual feature reconstruction mechanism to effectively extract key information while simultaneously ensuring information integrity is preserved.
5.2. Data Split
The Multi-HM dataset was divided into training, validation, and test subsets using an 8:1:1 split ratio. Crucially, to prevent data leakage and maintain natural conversational flow, we performed this split at the dialogue level. This ensured that entire conversations were assigned to a single subset, rather than splitting dialogues across different sets.For detailed information regarding dataset partitioning, please refer to
Table 2.
5.3. Hyperparameter Selection
To optimize model performance, we employed a search strategy to precisely select the best hyperparameter settings. During this process, particular emphasis was placed on choosing parameters that could significantly enhance model performance.
5.4. Results and Discussion
The results are presented in two key tables:
Table 3 shows the overall classification performance, and
Table 4 focuses on assessing an individual’s ability to identify emotions.
Through detailed analysis of the model prediction results, it is found that there is a significant imbalance in the dataset in the binary emotion classification task. This phenomenon is closely related to the original purpose of our dataset, which is to accurately identify and reassure users with negative emotions. However, this has led to a partial lack of understanding of non-negative emotions, and this imbalance points to an important direction for future work that needs targeted improvements and adjustments to achieve a more balanced and comprehensive understanding of emotions. The Self-MM and FDR-MSA models performed exceptionally well in classifying seven emotions. This result highlights the importance of unimodal annotation assistance tasks and model robustness in processing indistinguishable emotions. These models can pick up subtle emotional changes more accurately, significantly improving their ability to discriminate in the face of ambiguous or mixed emotional expressions.
5.5. Ablation Study
To investigate the impact mechanisms of multimodal information fusion on emotion analysis tasks in human–machine interaction, we performed systematic comparative experiments in different configurations of modality within our dataset, with results shown in
Table 5.
The experimental results demonstrably show that in the Multi-HM dataset, the video modality provides more discriminative emotional representations and richer emotional semantics compared to the audio modality.This finding is significantly validated by the accuracy comparison between T&V (text–video bimodal) and T&A (text–audio bimodal) configurations.Specifically, the T&V combination significantly outperforms the T&A combination in emotion classification tasks, strongly confirming the advantage of the video modality in capturing non-verbal emotional cues such as facial expressions and body language. Furthermore, when fully integrating text, video, and audio modalities, the model achieves optimal performance across evaluation metrics, including emotion recognition accuracy and F1-score.
5.6. Visualization Analysis
Experimental analysis based on confusion matrices demonstrates that FDR-MSA and Self-MM exhibit significant effectiveness in multimodal emotion recognition tasks. However, they show distinct differences in performance focus and error patterns, and the results are presented in
Figure 7.
FDR-MSA performs better in most emotion categories, particularly demonstrating notable advantages in recognizing the emotion categories of “Angry” and “Lost”, highlighting its ability to capture intense emotional features. This advantage may stem from its single-label auxiliary mechanism, which more effectively models semantic consistency across modalities. In contrast, Self-MM shows better balance of performance across specific emotion categories, especially in the “Worried” category, where it achieves an accuracy of 70%, significantly higher than FDR-MSA’s 56%. This suggests that Self-MM may employ a more robust feature learning strategy, enabling it better at capturing subtle inter-modal interaction features in emotional expressions. It is important to note that the dataset construction in this study was tailored to the practical needs of human–machine interaction (HCI) scenarios, particularly emphasizing the importance of accurately recognizing extremely negative emotions such as anger. This intentional design in data distribution, while potentially affecting the generalization performance evaluation of the models, provides a more realistic benchmark for practical application scenarios. These findings offer valuable insights for the selection and scenario adaptation of multimodal emotion recognition models: FDR-MSA may be more advantageous in scenarios requiring the recognition of extreme emotions, whereas Self-MM may be more suitable for scenarios requiring balanced recognition of multiple emotions.