Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems

Fu, Yao; Liu, Qiong; Song, Qing; Zhang, Pengzhou; Liao, Gongdong

doi:10.3390/app15084509

Open AccessArticle

Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems

by

Yao Fu

,

Qiong Liu

,

Qing Song

^*

,

Pengzhou Zhang

and

Gongdong Liao

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4509; https://doi.org/10.3390/app15084509

Submission received: 25 February 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 19 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Sentiment analysis is pivotal in advancing human–computer interaction (HCI) systems as it enables emotionally intelligent responses. While existing models show potential for HCI applications, current conversational datasets exhibit critical limitations in real-world deployment, particularly in capturing domain-specific emotional dynamics and context-sensitive behavioral patterns—constraints that hinder semantic comprehension and adaptive capabilities in task-driven HCI scenarios. To address these gaps, we present Multi-HM, the first multimodal emotion recognition dataset explicitly designed for human–machine consultation systems. It contains 2000 professionally annotated dialogues across 10 major HCI domains. Our dataset employs a five-dimensional annotation framework that systematically integrates textual, vocal, and visual modalities while simulating authentic HCI workflows to encode pragmatic behavioral cues and mission-critical emotional trajectories. Experiments demonstrate that Multi-HM-trained models achieve state-of-the-art performance in recognizing task-oriented affective states. This resource establishes a crucial foundation for developing human-centric AI systems that dynamically adapt to users’ evolving emotional needs.

Keywords:

emotion recognition in conversation; multimodal emotion recognition; multimodal HCI dataset; human–machine consultation

1. Introduction

Sentiment analysis has emerged as a pivotal technology, playing an increasingly critical role in effectively enhancing human–computer interaction (HCI) experiences through the automated interpretation of human affective states [1]. Currently, the academic community widely adopts multimodal data fusion methods for sentiment analysis research [2,3], with commonly used data modalities including text, audio, and video. Distinct from traditional sentiment analysis, emotion recognition in conversations, as a significant branch within the field, necessitates not only an in-depth exploration of emotional information but also the integration of key elements such as speaker characteristics and dialogue context. This is crucial for accurately capturing the complex and nuanced emotional dynamics inherent in conversational settings. Within the domain of conversational emotion analysis, the emphasis shifts from “sentiment analysis”, which primarily assesses sentiment polarity, to “emotion recognition”. This transition is driven by the need to capture the subtle nuances and dynamic shifts inherent in human emotions during conversations. Consequently, emotion recognition extends beyond simplistic polarity detection, encompassing a richer spectrum of emotional categories, such as anger and joy. While existing Emotion Recognition in Conversation (ERC) datasets, such as MELD [4,5] and M3ED [6], have significantly advanced emotion understanding in human–human conversations (HHCs), they fundamentally fail to model human–machine conversational contexts (HMCs). This is a critical limitation given the proliferation of intelligent dialogue systems. This critical gap manifests in three dimensions: First, current benchmarks predominantly capture symmetrical emotional exchanges between humans, neglecting the asymmetrical interaction patterns characteristic of human–machine dialogues, where users exhibit task-contingent emotional responses. Second, multimodal data, in this context, refer to the integration of information from different modalities, such as visual information from facial expressions, auditory information from vocal intonation, and textual information from language. Existing multimodal annotations fail to account for machine-specific contextual factors, including system latency profiles, task completion trajectories, and interface affordance constraints. Third, traditional emotion taxonomies prove inadequate for capturing the attenuated affective expressions (e.g., suppressed frustration during system failures) prevalent in HMC scenarios. The methodological constraints of HHC-oriented approaches become particularly apparent when analyzing Figure 1 versus Figure 2. While hierarchical LSTMs [7] and graph neural networks [8] achieve strong performance on HHC benchmarks, their direct application to HMC contexts yields suboptimal results due to divergent emotional causality mechanisms. For instance, user affective responses in HMC exhibit stronger coupling with task progression states compared to HHC’s social alignment dynamics. This limitation stems from fundamental dataset deficiencies: Current resources neither record micro-temporal emotion transitions during system response delays nor annotate multimodal disfluency patterns unique to human–machine exchanges. While methodologically rigorous, the IEMOCAP corpus [9] employs laboratory-controlled HHC scenarios that poorly generalize to real-world HMC contexts. Furthermore, ecologically valid HHC collections like M3ED lack essential HMC metadata, including system performance metrics and user adaptation trajectories.

While existing ERC research predominantly confines its investigations to HHC [10,11]—empirically exemplified in Figure 1—the rapid proliferation of intelligent dialogue systems has precipitated a critical research imperative: developing robust emotion recognition frameworks for human–machine conversational contexts, a methodological necessity visually substantiated in Figure 2. In contrast to HHC, HMC exhibits distinct disparities in emotional expression and interactional patterns. First, users often manifest more nuanced and subdued emotional expressions when interacting with machines, which are intrinsically linked to task completion progress, the quality of system feedback, and interactional fluency. Second, under specific circumstances such as task failure, delayed system responses, or insufficient system support, the affective reactions of users can demonstrate divergent patterns compared to purely human–human interactions. For example, suppressed frustration and restraint may be evident during task failures, and apprehension and unease while awaiting responses, as well as perplexity and frustration stemming from system misunderstandings. Traditional ERC methodologies predicated on human–human interactions face significant challenges in this nascent domain. First, existing datasets do not sufficiently characterize the granular behavioral patterns and contextual variables inherent in human–machine interactions, hindering the capture of task-driven emotional evolution. Second, conventional models often neglect critical situational factors within human–machine interactions, such as conversational history, task execution progress, and system feedback efficacy. Third, current approaches do not comprehensively account for the unique modalities of emotional expression and their evolutionary trajectories when modeling user affect in human–machine interactions.

Therefore, the creation of a high-quality multimodal ERC dataset and corresponding methodologies tailored for human–machine dialogue scenarios has become a pressing research imperative. An ideal dataset should exhibit the following characteristics: (1) Comprehensive capture of key contextual factors in human–machine interactions, including dialogue history, task progress, and machine feedback quality. (2) Accurate recording of user multimodal emotional expression patterns within specific interaction contexts. (3) Reflection of the dynamic evolution of emotional states in task-driven dialogues. Concurrently, novel computational models must be developed to better model the unique emotional expression patterns and contextual dependencies inherent in human–machine interactions. This development is crucial for providing theoretical support and practical guidance in building more natural and intelligent dialogue systems.

(1) Overcoming the limitations of traditional emotion datasets restricted to human–human conversation scenarios, this work introduces a novel approach. By combining machine-generated and human-authored content, we constructed dialogue scripts that simulate authentic human–machine interactions. We collected multimodal data and provided structured annotations, including dialogue topic classification, speech act types, emotional directivity, emotion intensity, and contextual dependencies.

(2) We propose a novel multimodal emotion recognition framework specifically designed for human–machine dialogue scenarios. This framework integrates various multimodal fusion methods and effectively utilizes structured emotional analysis constraints, such as conversation context and speaker information derived from human–machine interactions. This approach leads to more accurate and robust dialogue emotion analysis.

(3) We conducted comprehensive benchmark experiments, ablation studies, and visualization analyses to rigorously evaluate our approach. The experimental results demonstrably show that our proposed multimodal emotion recognition framework significantly outperforms existing methods in human–machine interaction scenarios, achieving substantial improvements across key performance metrics.

2. Related Work

In this section, we provide a concise overview of research on ERC datasets and methods.

2.1. Multimodal ERC Datasets

Table 1 summarizes some of the most notable datasets relevant to multimodal sentiment analysis and emotion recognition in conversations, which are pertinent to this study. Early research in multimodal sentiment analysis often overlooked scenarios with contextual interactions, as exemplified by CMU-MOSEI [12], CMU-MOSI [13], and CH-SIMS [14]. These datasets primarily originate from non-dialogue contexts, such as TV show clips and short videos. Consequently, they are better suited for sentiment judgment in independent contexts, lacking conversational interactions. However, this data collection approach limits their utility for conversational sentiment analysis as it often necessitates considering contextual information and emotional dynamics during interactions. Therefore, subsequent research should explore developing multimodal datasets explicitly designed for conversation emotion recognition. This would enable more accurate capture and analysis of emotional changes in dialogues, enhance the application of sentiment analysis in dialogue settings, and better address our emotional interaction needs. Multimodal emotion recognition in conversation is a rapidly growing field that aims to predict emotions for each utterance in a dialogue by integrating multimodal data. Speakers convey emotions through various modalities, including textual expression, facial gestures, vocal tone, and other cues. Key datasets in this field include IEMOCAP, MELD, and M3ED. The IEMOCAP dataset, collected by the SAIL laboratory at the University of Southern California, comprises dyadic dialogues performed by ten professional actors, totaling 7433 utterances. It covers various dialogue roles and includes six emotion labels; however, typically four are widely used for training and recognition due to their ease of learning and distinctiveness. The MELD dataset expands on EmotionLines and is derived from the TV show Friends. It contains 1433 multi-party dialogues and 13,708 utterances, annotated with seven emotion labels. MELD offers explicit facial expressions and speech emotions, but its script-based content can complicate emotion recognition. M3ED encompasses 990 emotional dyadic dialogue video clips from 56 TV shows, containing 9082 utterances and annotated with seven emotion labels. Its inter-annotator agreement score is 0.59, significantly higher than those of MELD (0.43) and IEMOCAP (0.48). Moreover, M3ED is China’s first multimodal interactive dataset and represents a valuable contribution to emotion computing.

2.2. Multimodal Sentiment Analysis Methods

Multimodal Sentiment Analysis (MSA) utilizes diverse information sources, such as text, audio, and video, to determine emotions. In contrast, Emotion Recognition in Conversations (ERC) centers on identifying participants’ emotional states through the content of conversation, with the shared objective of more accurately understanding and analyzing human sentiment and emotions. Early MSA approaches, such as TFN [15], explored novel data fusion techniques but encountered challenges related to high spatial and temporal complexity. Subsequent research, exemplified by low-rank TFN [16], prioritized fusion efficiency, employing low-rank tensors to reduce parameters and enhance data processing. Technological advancements have shifted the focus towards non-verbal information within multimodal data. Innovative models like MAG-BERT [17] and BAFN [18] further investigated dynamic inter-modal relationships and task balancing. Self-MM [19] leverages self-supervised learning to facilitate deeper inter-modal interactions. Simultaneously, methods like MISA [20] and DMD [21] refine multimodal feature learning by decoupling multimodal characteristics, while EMT-DLFR [22] introduces innovative strategies for extracting semantic information from incomplete data. Emotion recognition in multimodal conversation emphasizes the identification of subtle emotions within context. Researchers have proposed methods that consider context construction or speaker dependency to model evolving relationships, including MuCDN [23], DialogueTRM [24], and GraphMFT [25].

3. Multi-HM Dataset

This work investigates the recognition and analysis of users’ complex emotional states in human–machine interaction scenarios. To address this research objective, we have innovatively employed a dual-layer annotation system. This system facilitates fine-grained corpus annotation across both sentiment and emotion categories. This research specifically focuses on the characteristics of emotional experiences and their evolutionary patterns during interactions between users and intelligent dialogue systems. Recognizing the unique nature and complexity of human–machine interaction scenarios, we developed a systematic three-stage process for constructing a multimodal emotional dataset. This process comprises: (1) a scenario-driven design of dialogue scripts; (2) a synchronous collection and recording of multimodal interaction data; and (3) the development of a structured annotation system for dialogue emotion inference.

3.1. Data Build

This study aims to construct a high-quality, diverse data build, specifically focused on human–computer interaction (HCI) scenarios, as indicated in Figure 3. The dialogue content of this dataset will comprehensively cover typical HCI domains such as information seeking, instruction execution, emotional support, problem troubleshooting, chit-chat interaction, and policy information retrieval. We have innovatively proposed a dual-track dialogue script generation strategy to achieve this goal: the LLM–Human Collaboration Paradigm and the Real-World Corpus-Based Paradigm. These strategies form a synergistic mechanism that complements each other in the data generation process:

(1) LLM–Human Collaboration Paradigm: This paradigm employs a hybrid approach combining large language model generation with human optimization. Specifically, we utilize GPT-4.0 ((https://openai.com/index/gpt-4/) accessed on 24 December 2024) as the core dialogue generation engine, guiding the model to generate initial dialogue scripts through systematic prompt engineering techniques, as depicted in Figure 4.

Subsequently, the research team rigorously evaluates and optimizes the generated content across multiple dimensions, including fluency, contextual appropriateness, logical coherence, and task relevance. Through human intervention, we effectively mitigate potential issues such as hallucination, semantic deviation, and task drift, ensuring the reliability and practicality of the generated content.

(2) Real-World Corpus-Based Paradigm: To further enhance the ecological validity of the dataset and capture the spontaneous expressions and complex emotional interactions unique to real-world human–machine interaction, we systematically collected multi-source real interaction data. These data encompass customer service dialogue records and virtual assistant interaction logs. Subsequently, through a rigorous anonymization and data-cleansing process, we extracted representative dialogue structures, user intents, and emotional expression patterns.

3.2. Recording

To ensure the multimodal data’s logical validity and representational richness, we engaged professional performers to enact the dialogue scripts. We established and implemented a rigorous quality control protocol during data collection. First, we mandated that the actors’ facial expressions and vocal characteristics remain clearly discernible in the video footage. Furthermore, key emotional segments were required to exceed a sufficient duration threshold to fully capture the spatiotemporal characteristics of emotional information. Second, the actors were required to strictly adhere to predefined emotional categories in their performances, utilizing multimodal cues such as micro-expressions, intonation variations, and body language to ensure the authenticity and discriminability of emotional expressions. The final dataset comprises dialogues recorded by 15 participants, with each contributing multiple conversations to ensure data diversity.

3.3. Annotation

As a paradigm of auxiliary knowledge infusion, the Chain-of-Thought (CoT) reasoning mechanism significantly enhances the predictive efficacy of deep learning architectures in emotion analysis tasks while also providing interpretable insights. This cognitively inspired framework establishes an epistemologically transparent pipeline that mirrors the hierarchical progression of human reasoning processes, thus laying an axiomatic foundation for model interpretability through fully traceable input–output mappings. The architectural foundation of our model is rooted in the seminal OCC (Ortony, Clore, and Collins) emotion taxonomy [26], a theoretical cornerstone in affective computing that formalizes emotion generation through cognitively grounded appraisal rules. This framework establishes the following: (1) formalized Emotion-Cognitive Reasoning (ECR) heuristics and (2) machine-actionable Emotion-Cognitive Knowledge (EC-Knowledge) representations. Our principal methodological innovation lies in the pioneering integration of qualitative appraisal mechanisms with the latent space manipulation capabilities of Pre-trained Language Models (PLMs). This synergistic architecture leverages distributional semantics learned from massive corpora and injects psychologically grounded reasoning constraints, enabling deep cognitive modeling of user-generated emotional texts through theoretically informed neural adaptation. To effectively train and comprehensively evaluate our emotion recognition model, drawing upon both the Dialog Act Markup in Several Layers (DAMSL) annotation scheme [27] and the Ortony, Clore, and Collins (OCC) model, we innovatively constructed a five-tuple annotation framework for emotional cognitive reasoning (as shown in Figure 5).

This framework encompasses Role, Topic, Act, Sentiment, and Emotion. At the emotion annotation level, the system adopts a hierarchical classification approach, initially categorizing emotions into two broad classes: “Satisfaction” and “Dissatisfaction”. These are subsequently refined into a set of more nuanced emotion categories, including “Regret”, “Urgency”, “Confusion”, “Anger”, “Worry”, “Embarrassment”, and “Acceptance”. The annotation process is deeply integrated with theoretical frameworks of emotion-cognitive reasoning, ensuring theoretical consistency in annotation.

Role: Indicates the role of the dialogue participant, such as “Human” or “Machine”.
Topic: Describes the subject matter of the dialogue, such as “Chat-Oriented” or “Task-Oriented”.
Act: Represents the type of dialogue act, such as “Question”, “Statement”, or “Command”.
Sentiment: Expresses the polarity of the emotion, such as “Positive”, “Negative”, or “Neutral”.
Emotion: Specifies the particular emotion category, such as “Satisfaction”, “Anger”, or “Worry”.

4. Model Framework

When constructing the theoretical framework, we envision a scenario in which a human (denoted as H) interacts with a multimodal conversational system and a machine (denoted as M). This interaction is represented by a continuous conversational discourse sequence, denoted as n, alternately reflecting the communicative exchange between H and M. To establish a pattern sequence

H_{n - p}, M_{n - p} \dots, H_{n}, M_{n}, \dots H_{n + f}, M_{n + f}

, p denotes the number of past utterances considered, and f denotes the number of future utterances considered. The dialogue system provides the necessary textual prompts, while the user inputs data in a multimodal setting, including audio (A), video (V), and text (T). In this paradigm, the customer service agent (machine, M) maintains a neutral emotional orientation, while the user’s (human, H) emotional state dynamically shifts as the conversation evolves. Thus, our goal extends beyond simply modeling the interaction environment and speaker dependency; it encompasses integrating multimodal data from users to reflect and express their emotional state accurately.

4.1. Feature Extraction

Text: In our dataset, each video is precisely paired with its corresponding Chinese script. To achieve efficient encoding of the quintuple annotations, we employ the pre-trained Chinese RoBERTa-cn model [28] and fine-tune the encoder using the SentiLARE approach [29]. Specifically, we extract sentence-level semantic embeddings from the fine-tuned model, transforming each sentence into a 768-dimensional dense vector representation. This high-dimensional vectorization not only captures the deep semantic features of the text but also provides a rich representational foundation for subsequent fine-grained emotion analysis.

Video: We utilize the OpenFace 2.0 toolkit [30] to extract visual features, including 68 facial landmarks, 17 facial action units, head pose, gaze direction, and orientation measurements.

Audio: Using the Librosa toolkit [31], we extract key audio characteristics such as the fundamental frequency (log scale), Mel-Frequency Cepstral Coefficients (MFCCs), and Constant Q Transform (CQT). These features, crucial for identifying emotion and intonation, are then combined to create a 33-dimensional set of acoustic features at the frame level for each video.

4.2. Multimodal Fusion

To achieve the multimodal data fusion objective within our proposed human–computer interaction framework, as shown in Figure 6, we systematically conducted comparative experiments on cross-modal representation learning and fusion methods. In our experimental design, we selected baseline models that are representative and academically cutting-edge in the field of multimodal sentiment analysis. These served as reference systems for fusion strategies [19,20,21,32,33], encompassing the main paradigms of modal interaction and feature fusion mechanisms. The construction of this benchmarking system aims to provide the Multi-HM framework with reproducible and comparable performance evaluation benchmarks while also offering an empirical research foundation for the methodological evolution of cross-modal representation learning.

5. Experiment

5.1. Baselines

This section provides a concise overview of the baseline models utilized in our experiments.

Multimodal Transformer (MulT) [32]: MulT employs a Transformer network for cross-modal attention, capturing signals from aligned and unaligned pairs in asynchronous modalities.

MISA [20]: MISA distinguishes between modality-specific and invariant features, using specialized loss functions to capture interactions within and between modalities.

Self-MM [19]: Self-MM utilizes a self-supervised module to generate single-modal labels, thereby enhancing the distinction between representations of various modalities.

Decoupled Multimodal Distillation (DMD) [21]: DMD introduces a decoupled multimodal distillation approach to reduce heterogeneity between modalities.

FDR-MSA [33]: FDR-MSA utilizes a disentangled and dual feature reconstruction mechanism to effectively extract key information while simultaneously ensuring information integrity is preserved.

5.2. Data Split

The Multi-HM dataset was divided into training, validation, and test subsets using an 8:1:1 split ratio. Crucially, to prevent data leakage and maintain natural conversational flow, we performed this split at the dialogue level. This ensured that entire conversations were assigned to a single subset, rather than splitting dialogues across different sets.For detailed information regarding dataset partitioning, please refer to Table 2.

5.3. Hyperparameter Selection

To optimize model performance, we employed a search strategy to precisely select the best hyperparameter settings. During this process, particular emphasis was placed on choosing parameters that could significantly enhance model performance.

5.4. Results and Discussion

The results are presented in two key tables: Table 3 shows the overall classification performance, and Table 4 focuses on assessing an individual’s ability to identify emotions.

Through detailed analysis of the model prediction results, it is found that there is a significant imbalance in the dataset in the binary emotion classification task. This phenomenon is closely related to the original purpose of our dataset, which is to accurately identify and reassure users with negative emotions. However, this has led to a partial lack of understanding of non-negative emotions, and this imbalance points to an important direction for future work that needs targeted improvements and adjustments to achieve a more balanced and comprehensive understanding of emotions. The Self-MM and FDR-MSA models performed exceptionally well in classifying seven emotions. This result highlights the importance of unimodal annotation assistance tasks and model robustness in processing indistinguishable emotions. These models can pick up subtle emotional changes more accurately, significantly improving their ability to discriminate in the face of ambiguous or mixed emotional expressions.

5.5. Ablation Study

To investigate the impact mechanisms of multimodal information fusion on emotion analysis tasks in human–machine interaction, we performed systematic comparative experiments in different configurations of modality within our dataset, with results shown in Table 5.

The experimental results demonstrably show that in the Multi-HM dataset, the video modality provides more discriminative emotional representations and richer emotional semantics compared to the audio modality.This finding is significantly validated by the accuracy comparison between T&V (text–video bimodal) and T&A (text–audio bimodal) configurations.Specifically, the T&V combination significantly outperforms the T&A combination in emotion classification tasks, strongly confirming the advantage of the video modality in capturing non-verbal emotional cues such as facial expressions and body language. Furthermore, when fully integrating text, video, and audio modalities, the model achieves optimal performance across evaluation metrics, including emotion recognition accuracy and F1-score.

5.6. Visualization Analysis

Experimental analysis based on confusion matrices demonstrates that FDR-MSA and Self-MM exhibit significant effectiveness in multimodal emotion recognition tasks. However, they show distinct differences in performance focus and error patterns, and the results are presented in Figure 7.

FDR-MSA performs better in most emotion categories, particularly demonstrating notable advantages in recognizing the emotion categories of “Angry” and “Lost”, highlighting its ability to capture intense emotional features. This advantage may stem from its single-label auxiliary mechanism, which more effectively models semantic consistency across modalities. In contrast, Self-MM shows better balance of performance across specific emotion categories, especially in the “Worried” category, where it achieves an accuracy of 70%, significantly higher than FDR-MSA’s 56%. This suggests that Self-MM may employ a more robust feature learning strategy, enabling it better at capturing subtle inter-modal interaction features in emotional expressions. It is important to note that the dataset construction in this study was tailored to the practical needs of human–machine interaction (HCI) scenarios, particularly emphasizing the importance of accurately recognizing extremely negative emotions such as anger. This intentional design in data distribution, while potentially affecting the generalization performance evaluation of the models, provides a more realistic benchmark for practical application scenarios. These findings offer valuable insights for the selection and scenario adaptation of multimodal emotion recognition models: FDR-MSA may be more advantageous in scenarios requiring the recognition of extreme emotions, whereas Self-MM may be more suitable for scenarios requiring balanced recognition of multiple emotions.

6. Conclusions

This research introduces a novel contribution: Multi-HM, a meticulously curated multimodal Chinese emotion recognition dataset specifically designed for human–machine dialogue contexts. Furthermore, we propose a holistic emotion recognition framework, which integrates both a comprehensive multimodal data annotation system and an emotion computing paradigm. This framework offers systematic methodological guidance for emotion analysis research within the domain of human–machine interaction. The construction of the Multi-HM dataset is driven by two core objectives. First, it aims to facilitate precise and in-depth recognition and analysis of user emotional states within dialogue systems, leveraging high-fidelity multimodal emotion data. Second, it seeks to establish a robust data infrastructure and algorithmic support to enhance the emotional empathy of dialogue systems. Furthermore, this study pioneers new research directions in multimodal emotion analysis for human–machine dialogue scenarios, offering significant theoretical insights and practical examples, particularly in areas such as dialogue context comprehension, emotional state transition modeling, and multimodal feature fusion. This research effectively addresses the current lacuna in Chinese multimodal emotion datasets for human–machine dialogue and establishes a crucial cornerstone for developing more human-like HCI systems.

Author Contributions

Conceptualization, Y.F. and P.Z.; writing—original draft, Y.F. and G.L.; validation, Q.S.; writing—review and editing, Y.F. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the National Key R&D Program of China under grant 2020AAA0108700.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are accessible to the public via a repository at: https://github.com/miticuc/Multi-HM (accessed on 24 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ERC	Emotion Recognition in Conversation
MSA	Multimodal Sentiment Analysis
HCI	Human–Machine Interaction
HHC	Human–Human Conversation
HMC	Human–Machine Conversation

References

Mao, Y.; Liu, Q.; Zhang, Y. Sentiment analysis methods, applications, and challenges: A systematic literature review. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102048. [Google Scholar] [CrossRef]
Wang, Y.; He, J.; Wang, D.; Wang, Q.; Wan, B.; Luo, X. Multimodal transformer with adaptive modality weighting for multimodal sentiment analysis. Neurocomputing 2024, 572, 127181. [Google Scholar] [CrossRef]
Xu, N.; Mao, W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
Zahiri, S.M.; Choi, J.D. Emotion Detection on TV Show Transcripts with Sequence-Based Convolutional Neural Networks. AAAI Work. 2018, 18, 44–52. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv 2018, arXiv:1810.02508. [Google Scholar]
Zhao, J.; Zhang, T.; Hu, J.; Liu, Y.; Jin, Q.; Wang, X.; Li, H. M3ED: Multi-modal multi-scene multi-label emotional dialogue database. arXiv 2022, arXiv:2205.10237. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Shou, Y.; Meng, T.; Ai, W.; Zhang, F.; Yin, N.; Li, K. Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations. Inf. Fusion 2024, 112, 102590. [Google Scholar] [CrossRef]
Dai, Y.; Li, J.; Li, Y.; Lu, G. Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation. Knowl.-Based Syst. 2024, 298, 111954. [Google Scholar] [CrossRef]
Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intell. Syst. 2016, 31, 82–88. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A Chinese Multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for Multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient low-rank Multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating Multimodal Information in Large Pretrained Transformers. arXiv 2020, arXiv:1908.05787. [Google Scholar]
Tang, J.; Liu, D.; Jin, X.; Peng, Y.; Zhao, Q.; Ding, Y.; Kong, W. BAFN: Bi-direction Attention Based Fusion Network for Multimodal Sentiment Analysis. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1966–1978. [Google Scholar] [CrossRef]
Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 10790–10797. [Google Scholar]
Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for Multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
Li, Y.; Wang, Y.; Cui, Z. Decoupled Multimodal Distilling for Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6631–6640. [Google Scholar]
Sun, L.; Lian, Z.; Liu, B.; Tao, J. Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis. IEEE Trans. Affect. Comput. 2023, 15, 309–325. [Google Scholar] [CrossRef]
Zhao, W.; Zhao, Y.; Qin, B. MuCDN: Mutual Conversational Detachment Network for Emotion Recognition in Multi-Party Conversations. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 7020–7030. [Google Scholar]
Mao, Y.; Liu, G.; Wang, X.; Gao, W.; Li, X. DialogueTRM: Exploring Multimodal emotional dynamics in a conversation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2694–2704. [Google Scholar]
Li, J.; Wang, X.; Lv, G.; Zeng, Z. GraphMFT: A Graph Network Based Multimodal Fusion Technique for Emotion Recognition in Conversation. Neurocomputing 2023, 550, 126427. [Google Scholar] [CrossRef]
Ortony, A.; Clore, G.L.; Collins, A. The Cognitive Structure of Emotions; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
Allen, J.; Core, M. Draft of DAMSL: Dialog Act Markup in Several Layers [EB/OL]. 18 October 1997. Available online: http://www.fb10.uni-bremen.de/anglistik/ling/ss07/discourse-materials/DAMSL97.pdf (accessed on 24 December 2024).
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Ke, P.; Ji, H.; Liu, S.; Zhu, X.; Huang, M. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge. arXiv 2019, arXiv:1911.02493. [Google Scholar]
Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.-P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 59–66. [Google Scholar]
McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–25. [Google Scholar]
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6558–6567. [Google Scholar]
Fu, Y.; Huang, B.; Wen, Y.; Zhang, P. FDR-MSA: Enhancing Multimodal sentiment analysis through feature disentanglement and reconstruction. Knowl.-Based Syst. 2024, 297, 111965. [Google Scholar] [CrossRef]

Figure 1. The human–human conversation dialogue from the MELD dataset.

Figure 2. The approach to emotion recognition within an AI-driven dialogue system (human–computer).

Figure 3. Conversation scenario between an intelligent machine and a human.

Figure 4. Dialogue script sample from our dataset.

Figure 5. Example of structured annotation in Multi-HM.

Figure 6. A concrete example to illustrate our research framework.

Figure 7. Comparison of confusion matrices for the Multi-HM on the Self-MM and FDR-MSA.

Table 1. Comparison with existing benchmark datasets.

Dataset	Dialogue	Multi-Label	Emotions	Language	Interaction Mode
CMU-MOSI	No	No	No	English	-
CMU-MOSEI	No	No	No	English	-
CH-SIMS	No	Yes	No	Chinese	Interpersonal
IEMOCAP	Yes	No	Yes	English	Interpersonal
MELD	Yes	Yes	Yes	English	Interpersonal
Muti-HM	Yes	Yes	Yes	Chinese	Human–Machine

Table 2. Dataset statistics.

	Total	Training Set	Validation Set	Test Set
Amount of dialogue	2000	1298	372	330
Turns of dialogue	11,912	7730	2216	1072
Average turns of answer	5.96	5.95	5.96	3.25
Average number of dialogue videos	5.956	5.955	5.956	3.248
Average length of dialogue	27.60	27.48	27.92	27.85

Table 3. Comprehensive test set evaluation results for classification on the Multi-HM dataset using weighted F1 scores and accuracy.

Fusion Methods	Acc-2	F1_2	Acc-7	F1_7
MISA	81.62	73.36	35.21	21.03
MulT	78.05	73.42	28.57	20.17
Self-MM	82.78	78.98	71.28	70.33
DMD	82.63	74.34	20.53	20.64
FDR-MSA	85.71	79.12	70.15	69.96

Table 4. Test set weighted F1 score results of emotion classification in Multi-HM.

Fusion Methods	Reject	Urgent	Lost	Angry	Worried	Calm	Acknowledged
MISA	18.13	22.57	16.32	52.12	9.08	17.11	11.51
MulT	24.01	34.29	22.27	17.19	8.01	24.12	16.03
Self-MM	65.33	74.81	63.02	91.21	75.23	63.25	78.81
DMD	21.28	8.32	14.05	37.13	25.08	9.01	11.32
FDR-MSA	75.87	72.41	64.23	89.03	67.38	56.01	71.08

Table 5. Evaluation of accuracy performance across different modalities on the Multi-HM dataset.

Modalities	T&V	T&A	T&A&V
MISA	78.03	79.95	81.62
MulT	74.21	75.76	78.05
SelfMM	79.01	77.81	82.78
DMD	77.36	79.13	82.63
FDR-MSA	82.01	82.96	85.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Y.; Liu, Q.; Song, Q.; Zhang, P.; Liao, G. Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems. Appl. Sci. 2025, 15, 4509. https://doi.org/10.3390/app15084509

AMA Style

Fu Y, Liu Q, Song Q, Zhang P, Liao G. Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems. Applied Sciences. 2025; 15(8):4509. https://doi.org/10.3390/app15084509

Chicago/Turabian Style

Fu, Yao, Qiong Liu, Qing Song, Pengzhou Zhang, and Gongdong Liao. 2025. "Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems" Applied Sciences 15, no. 8: 4509. https://doi.org/10.3390/app15084509

APA Style

Fu, Y., Liu, Q., Song, Q., Zhang, P., & Liao, G. (2025). Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems. Applied Sciences, 15(8), 4509. https://doi.org/10.3390/app15084509

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-HM: A Chinese Multimodal Dataset and Fusion Framework for Emotion Recognition in Human–Machine Dialogue Systems

Abstract

1. Introduction

2. Related Work

2.1. Multimodal ERC Datasets

2.2. Multimodal Sentiment Analysis Methods

3. Multi-HM Dataset

3.1. Data Build

3.2. Recording

3.3. Annotation

4. Model Framework

4.1. Feature Extraction

4.2. Multimodal Fusion

5. Experiment

5.1. Baselines

5.2. Data Split

5.3. Hyperparameter Selection

5.4. Results and Discussion

5.5. Ablation Study

5.6. Visualization Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI