MDPI - Publisher of Open Access Journals

16 pages, 1227 KB

Open AccessArticle

Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues

by Daniel Grabowski, Kamila Łuczaj and Khalid Saeed

Sensors 2025, 25(19), 6086; https://doi.org/10.3390/s25196086 - 2 Oct 2025

Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We [...] Read more.

Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We interpret neural architectures such as ViViT (for video) and HuBERT (for speech) as digital behavioral sensors that extract implicit emotional and cognitive cues, including micro-expressions, vocal stress, and timing irregularities. We further incorporate a GPT-5-based prompt-level fusion approach for video–language–emotion alignment and zero-shot inference. This method jointly processes visual frames, textual transcripts, and emotion recognition outputs, enabling the system to generate interpretable deception hypotheses without any task-specific fine-tuning. Facial expressions are treated as high-resolution affective signals captured via visual sensors, while audio encodes prosodic markers of stress. Our experimental setup is based on the DOLOS dataset, which provides high-quality multimodal recordings of deceptive and truthful behavior. We also evaluate a continual learning setup that transfers emotional understanding to deception classification. Results indicate that multimodal fusion and CoT-based reasoning increase classification accuracy and interpretability. The proposed system bridges the gap between raw behavioral data and semantic inference, laying a foundation for AI-driven lie detection with interpretable sensor analogues. Full article

(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)

22 pages, 8300 KB

Open AccessArticle

Multimodal Emotion Recognition via the Fusion of Mamba and Liquid Neural Networks with Cross-Modal Alignment

by Guoming Chen, Yuting Liao, Dong Zhang, Weikang Yang, Ziying Mai and Chenying Xu

Electronics 2025, 14(18), 3638; https://doi.org/10.3390/electronics14183638 - 14 Sep 2025

Viewed by 566

Abstract

This paper proposes a novel multimodal emotion recognition framework, termed Sparse Alignment and Liquid-Mamba (SALM), which effectively integrates the complementary strengths of Mamba networks and Liquid Neural Networks (LNNs). To capture neural dynamics, high-resolution EEG spectrograms are generated via Short-Time Fourier Transform (STFT), [...] Read more.

This paper proposes a novel multimodal emotion recognition framework, termed Sparse Alignment and Liquid-Mamba (SALM), which effectively integrates the complementary strengths of Mamba networks and Liquid Neural Networks (LNNs). To capture neural dynamics, high-resolution EEG spectrograms are generated via Short-Time Fourier Transform (STFT), while heatmap features from facial images, videos, speech, and text are extracted and aligned through entropy-regularized Sinkhorn and Greenkhorn optimal transport algorithms. These aligned representations are fused to mitigate semantic disparities across modalities. The proposed SALM model leverages sparse alignment for efficient cross-modal mapping and employs the Liquid-Mamba architecture to construct a robust and generalizable classifier. Extensive experiments on benchmark datasets demonstrate that SALM consistently outperforms state-of-the-art methods in both classification accuracy and generalization ability. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

21 pages, 3795 KB

Open AccessArticle

Rural Image Perception and Spatial Optimization Pathways Based on Social Media Data: A Case Study of Baishe Village—A Traditional Village

by Bingshu Zhao, Zhimin Gao, Meng Jiao, Ruiyao Weng, Tongyu Jia, Chenyu Xu, Xuhui Wang and Yuting Jiang

Land 2025, 14(9), 1860; https://doi.org/10.3390/land14091860 - 11 Sep 2025

Viewed by 394

Abstract

The sustainable development of traditional villages faces a core challenge stemming from the disconnect between public perception and spatial planning. To address this issue, this study, taking Baishe Village—a national-level traditional village—as a case study, constructs and applies a “Digital Humanities + Spatial [...] Read more.

The sustainable development of traditional villages faces a core challenge stemming from the disconnect between public perception and spatial planning. To address this issue, this study, taking Baishe Village—a national-level traditional village—as a case study, constructs and applies a “Digital Humanities + Spatial Analysis” research paradigm that integrates text mining, sentiment analysis, visual coding, and spatial analysis based on multimodal social media data (Sina Weibo and Xiaohongshu) from 2013 to 2023. It aims to conduct an in-depth analysis of tourists’ rural image perception structure, emotional tendencies, and their spatial differentiation characteristics, and subsequently propose spatial optimization pathways that promote the revitalization of its cultural landscape and sustainable land use. The main findings reveal the following: (1) In terms of cognitive structure, the rural image presents a ‘settlement-dominated’ four-dimensional structure, with settlement elements such as pit kilns (accounting for more than 70%) as the absolute core. (2) In terms of emotional tendencies, a cognitive tension is formed between the high recognition of architectural heritage value (positive sentiment: 57.44%) and significant dissatisfaction with service facilities. (3) In terms of spatial patterns, a “dual-core-driven” pattern of perceived hotspots emerges, with 83% of tourist activities concentrated in the central–southern main road area, revealing a “revitalization gap” in village spatial utilization. The contribution of this study lies in translating abstract public perceptions into quantifiable spatial insights, thereby constructing and validating a “Digital Humanities + Spatial Analysis” paradigm that fuses multimodal data and links abstract perception with concrete space. This provides a crucial theoretical basis and practical guidance for the living conservation of cultural landscapes, the enhancement of land use efficiency, and refined spatial governance. Full article

(This article belongs to the Special Issue Rural Space: Between Renewal Processes and Preservation)

► Show Figures

Figure 1

27 pages, 1845 KB

Open AccessReview

Technological Evolution and Research Trends of Intelligent Question-Answering Systems in Healthcare

by Bingyin Lei and Panpan Yin

Healthcare 2025, 13(18), 2269; https://doi.org/10.3390/healthcare13182269 - 11 Sep 2025

Viewed by 452

Abstract

Background/Objective: This study investigates the implementation and evolution of intelligent medical question-answering (QA) systems in healthcare to enhance service efficiency and quality. Methods: Through an integrated literature review and bibliometric analysis using CiteSpace 6.3.R1(64-bit) Basic software, we systematically evaluated core concepts, frameworks, and [...] Read more.

Background/Objective: This study investigates the implementation and evolution of intelligent medical question-answering (QA) systems in healthcare to enhance service efficiency and quality. Methods: Through an integrated literature review and bibliometric analysis using CiteSpace 6.3.R1(64-bit) Basic software, we systematically evaluated core concepts, frameworks, and applications within medical QA systems, analyzing literature from 2018 to 2025 to identify research trends. Results: Significant applications were revealed across clinical decision support, medical knowledge retrieval, traditional Chinese medicine (TCM) formulation development, medical imaging report analysis, medical record quality control, mental health monitoring, and emotion recognition, demonstrating optimized resource allocation and service efficiency. Persistent challenges include system accuracy limitations, multimodal interaction capabilities, user trust barriers, and privacy protection concerns. Conclusion: Future research should prioritize multimodal diagnostic imaging, TCM-specific AI agents, and virtual-reality-assisted surgical exploration. Contributions: This work consolidates current achievements while establishing theoretical–practical foundations for innovation and large-scale implementation, advancing intelligent healthcare transformation. Full article

(This article belongs to the Topic eHealth and mHealth: Challenges and Prospects, 2nd Edition)

► Show Figures

Figure 1

12 pages, 1419 KB

Open AccessProceeding Paper

A Real-Time Intelligent Surveillance System for Suspicious Behavior and Facial Emotion Analysis Using YOLOv8 and DeepFace

by Uswa Ihsan, Noor Zaman Jhanjhi, Humaira Ashraf, Farzeen Ashfaq and Fikri Arif Wicaksana

Eng. Proc. 2025, 107(1), 59; https://doi.org/10.3390/engproc2025107059 - 4 Sep 2025

Viewed by 3267

Abstract

This study describes the creation of an intelligent surveillance system based on deep learning that aims to improve real-time security monitoring by automatically identifying suspicious activity. By using cutting-edge computer vision techniques, the suggested system overcomes the drawbacks of conventional surveillance that depends [...] Read more.

This study describes the creation of an intelligent surveillance system based on deep learning that aims to improve real-time security monitoring by automatically identifying suspicious activity. By using cutting-edge computer vision techniques, the suggested system overcomes the drawbacks of conventional surveillance that depends on human observation to spot irregularities in public spaces. The system successfully completes motion detection, trajectory analysis, and emotion recognition by using the YOLOv8 model for object detection and DeepFace for facial emotion analysis. Roboflow is used for dataset annotation, model training with optimized parameters, and visualization of object trajectories and detection confidence. The findings show that abnormal behaviors can be accurately identified, with noteworthy observations made about the emotional expressions and movement patterns of those deemed to be threats. Even though the system performs well in real time, issues like misclassification, model explainability, and a lack of diversity in the dataset still exist. Future research will concentrate on integrating multimodal data fusion, deeper models, and temporal sequence analysis to further enhance detection robustness and system intelligence. Full article

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

► Show Figures

Figure 1

10 pages, 1081 KB

Open AccessProceeding Paper

Insights into the Emotion Classification of Artificial Intelligence: Evolution, Application, and Obstacles of Emotion Classification

by Marselina Endah Hiswati, Ema Utami, Kusrini Kusrini and Arief Setyanto

Eng. Proc. 2025, 103(1), 24; https://doi.org/10.3390/engproc2025103024 - 3 Sep 2025

Viewed by 342

Abstract

In this systematic literature review, we examined the integration of emotional intelligence into artificial intelligence (AI) systems, focusing on advancements, challenges, and opportunities in emotion classification technologies. Accurate emotion recognition in AI holds immense potential in healthcare, the IoT, and education. However, challenges [...] Read more.

In this systematic literature review, we examined the integration of emotional intelligence into artificial intelligence (AI) systems, focusing on advancements, challenges, and opportunities in emotion classification technologies. Accurate emotion recognition in AI holds immense potential in healthcare, the IoT, and education. However, challenges such as computational demands, limited dataset diversity, and real-time deployment complexity remain significant. In this review, we included research on emerging solutions like multimodal data processing, attention mechanisms, and real-time emotion tracking to address these issues. By overcoming these issues, AI systems enhance human–AI interactions and expand real-world applications. Recommendations for improving accuracy and scalability in emotion-aware AI are provided based on the review results. Full article

(This article belongs to the Proceedings of The 8th Eurasian Conference on Educational Innovation 2025)

► Show Figures

Figure 1

25 pages, 4433 KB

Open AccessArticle

Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset

by Zineddine Sarhani Kahhoul, Nadjiba Terki, Ilyes Benaissa, Khaled Aldwoah, E. I. Hassan, Osman Osman and Djamel Eddine Boukhari

Appl. Sci. 2025, 15(17), 9692; https://doi.org/10.3390/app15179692 - 3 Sep 2025

Viewed by 536

Abstract

Emotion recognition from speech is essential for human–computer interaction (HCI) and affective computing, with applications in virtual assistants, healthcare, and education. Although deep learning has made significant advancements in Automatic Speech Emotion Recognition (ASER), the challenge still exists in the task given variation [...] Read more.

Emotion recognition from speech is essential for human–computer interaction (HCI) and affective computing, with applications in virtual assistants, healthcare, and education. Although deep learning has made significant advancements in Automatic Speech Emotion Recognition (ASER), the challenge still exists in the task given variation in speakers, subtle emotional expressions, and environmental noise. Practical deployment in this context depends on a strong, fast, scalable recognition system. This work introduces a new framework combining DenseNet121, especially fine-tuned for the crowd-sourced emotional multimodal actors dataset (CREMA-D), with the convolutional block attention module (CBAM). While DenseNet121’s effective feature propagation captures rich, hierarchical patterns in the speech data, CBAM improves the focus of the model on emotionally significant elements by applying both spatial and channel-wise attention. Furthermore, enhancing the input spectrograms and strengthening resistance against environmental noise is an advanced preprocessing pipeline including log-Mel spectrogram transformation and normalization. The proposed model demonstrates superior performance. To make sure the evaluation is strong even if there is a class imbalance, we point out important metrics like an Unweighted Average Recall (UAR) of 71.01% and an F1 score of 71.25%. The model also gets a test accuracy of 71.26% and a precision of 71.30%. These results establish the model as a promising solution for real-world speech emotion detection, highlighting its strong generalization capabilities, computational efficiency, and focus on emotion-specific features compared to recent work. The improvements demonstrate practical flexibility, enabling the integration of established image recognition techniques and allowing for substantial adaptability in various application contexts. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

22 pages, 47099 KB

Open AccessArticle

Deciphering Emotions in Children’s Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications

by Bushra Asseri, Estabrag Abaker, Maha Al Mogren, Tayef Alhefdhi and Areej Al-Wabil

AI 2025, 6(9), 211; https://doi.org/10.3390/ai6090211 - 2 Sep 2025

Viewed by 729

Abstract

Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies yet remain underexplored for Arabic language contexts, where culturally appropriate learning tools are critically needed. This study evaluated the emotion recognition performance of two advanced multimodal large language [...] Read more.

Emotion recognition capabilities in multimodal AI systems are crucial for developing culturally responsive educational technologies yet remain underexplored for Arabic language contexts, where culturally appropriate learning tools are critically needed. This study evaluated the emotion recognition performance of two advanced multimodal large language models, GPT-4o and Gemini 1.5 Pro, when processing Arabic children’s storybook illustrations. We assessed both models across three prompting strategies (zero-shot, few-shot, and chain-of-thought) using 75 images from seven Arabic storybooks, comparing model predictions with human annotations based on Plutchik’s emotional framework. GPT-4o consistently outperformed Gemini across all conditions, achieving the highest macro F1-score of 59% with chain-of-thought prompting compared to Gemini’s best performance of 43%. Error analysis revealed systematic misclassification patterns, with valence inversions accounting for 60.7% of errors, while both models struggled with culturally nuanced emotions and ambiguous narrative contexts. These findings highlight fundamental limitations in current models’ cultural understanding and emphasize the need for culturally sensitive training approaches to develop effective emotion-aware educational technologies for Arabic-speaking learners. Full article

(This article belongs to the Special Issue Exploring the Use of Artificial Intelligence in Education)

► Show Figures

Figure 1

20 pages, 3439 KB

Open AccessArticle

Multimodal Emotion Recognition Based on Graph Neural Networks

by Zhongwen Tu, Raoxin Yan, Sihan Weng, Jiatong Li and Wei Zhao

Appl. Sci. 2025, 15(17), 9622; https://doi.org/10.3390/app15179622 - 1 Sep 2025

Viewed by 732

Abstract

Emotion recognition remains a challenging task in human–computer interaction. With advancements in multimodal computing, multimodal emotion recognition has become increasingly important and significant. To address the existing limitations in multimodal fusion efficiency, emotional–semantic association mining, and long-range context modeling, we propose an innovative [...] Read more.

Emotion recognition remains a challenging task in human–computer interaction. With advancements in multimodal computing, multimodal emotion recognition has become increasingly important and significant. To address the existing limitations in multimodal fusion efficiency, emotional–semantic association mining, and long-range context modeling, we propose an innovative graph neural network (GNN)-based framework. Our methodology integrates three key components: (1) a hierarchical sequential fusion (HSF) multimodal integration approach, (2) a sentiment–emotion enhanced joint learning framework, and (3) a context-similarity dual-layer graph architecture (CS-BiGraph). The experimental results demonstrate that our method achieves 69.1% accuracy on the IEMOCAP dataset, establishing a new state-of-the-art performance. For future work, we will explore robust extensions of our framework under real-world scenarios with higher noise levels and investigate the integration of emerging modalities for broader applicability. Full article

(This article belongs to the Special Issue Advanced Technologies and Applications of Emotion Recognition)

► Show Figures

Figure 1

16 pages, 1500 KB

Open AccessArticle

Emotion Recognition in Autistic Children Through Facial Expressions Using Advanced Deep Learning Architectures

by Petra Radočaj and Goran Martinović

Appl. Sci. 2025, 15(17), 9555; https://doi.org/10.3390/app15179555 - 30 Aug 2025

Viewed by 807

Abstract

Atypical and subtle facial expression patterns in individuals with autism spectrum disorder (ASD) pose a significant challenge for automated emotion recognition. This study evaluates and compares the performance of convolutional neural networks (CNNs) and transformer-based deep learning models for facial emotion recognition in [...] Read more.

Atypical and subtle facial expression patterns in individuals with autism spectrum disorder (ASD) pose a significant challenge for automated emotion recognition. This study evaluates and compares the performance of convolutional neural networks (CNNs) and transformer-based deep learning models for facial emotion recognition in this population. Using a labeled dataset of emotional facial images, we assessed eight models across four emotion categories: natural, anger, fear, and joy. Our results demonstrate that transformer models consistently outperformed CNNs in both overall and emotion-specific metrics. Notably, the Swin Transformer achieved the highest performance, with an accuracy of 0.8000 and an F1-score of 0.7889, significantly surpassing all CNN counterparts. While CNNs failed to detect the fear class, transformer models showed a measurable capability in identifying complex emotions such as anger and fear, suggesting an enhanced ability to capture subtle facial cues. Analysis of the confusion matrix further confirmed the transformers’ superior classification balance and generalization. Despite these promising results, the study has limitations, including class imbalance and its reliance solely on facial imagery. Future work should explore multimodal emotion recognition, model interpretability, and personalization for real-world applications. Research also demonstrates the potential of transformer architectures in advancing inclusive, emotion-aware AI systems tailored for autistic individuals. Full article

► Show Figures

Figure 1

18 pages, 919 KB

Open AccessArticle

KD-MSA: A Multimodal Implicit Sentiment Analysis Approach Based on KAN and Asymmetric Contribution-Aware Dynamic Fusion

by Zhiyuan Hou, Qiang Zhang, Ziwei Lei, Zheng Zeng and Ruijun Jia

Symmetry 2025, 17(9), 1401; https://doi.org/10.3390/sym17091401 - 28 Aug 2025

Viewed by 492

Abstract

Implicit emotions are often expressed through implicit and weak clues between modalities due to the lack of explicit emotional feature words, representing a significant challenge for multimodal sentiment analysis. In order to improve implicit emotion recognition, this paper proposes a multimodal sentiment analysis [...] Read more.

Implicit emotions are often expressed through implicit and weak clues between modalities due to the lack of explicit emotional feature words, representing a significant challenge for multimodal sentiment analysis. In order to improve implicit emotion recognition, this paper proposes a multimodal sentiment analysis method that integrates KAN and the modal dynamic fusion mechanism. This method first introduces the KAN structure to construct a modal feature encoder to enhance the emotional expression ability of features. Then, the emotional contribution weight of each modality is calculated using the difference between the unimodal and multimodal sentiment scores, and the cross-attention mechanism guided by the main modality is used for feature fusion. Experiments on four datasets, CH-SIMS, CH-SIMSv2, MOSI, and MOSEI, show that the proposed method significantly outperforms the mainstream model in multiple indicators, especially when dealing with samples with implicit or ambiguous emotional expressions. The results verify the effectiveness of enhancing feature encoding capabilities and utilizing modal asymmetry information in implicit sentiment analysis. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

23 pages, 3014 KB

Open AccessArticle

Multimodal Emotion Recognition for Seafarers: A Framework Integrating Improved D-S Theory and Calibration: A Case Study of a Real Navigation Experiment

by Liu Yang, Junzhang Yang, Chengdeng Cao, Mingshuang Li, Peng Fei and Qing Liu

Appl. Sci. 2025, 15(17), 9253; https://doi.org/10.3390/app15179253 - 22 Aug 2025

Viewed by 494

Abstract

The influence of seafarers’ emotions on work performance can lead to severe marine accidents. However, research on emotion recognition (ER) of seafarers remains insufficient, and existing studies only deploy single models and disregard the model’s uncertainty, which might lead to unreliable recognition. In [...] Read more.

The influence of seafarers’ emotions on work performance can lead to severe marine accidents. However, research on emotion recognition (ER) of seafarers remains insufficient, and existing studies only deploy single models and disregard the model’s uncertainty, which might lead to unreliable recognition. In this paper, a novel fusion framework for seafarer ER is proposed. Firstly, feature-level fusion using Electroencephalogram (EEG) and navigation data collected in a real navigation environment was conducted. Then, calibration is employed to mitigate the uncertainty of the outcomes. Secondly, a weight combination strategy for decision fusion was designed. Finally, we conduct a series of evaluations of the proposed model. The results showed that the average recognition performance across the three emotional dimensions, as measured by accuracy, precision, recall, and F1 score, reaches 85.14%, 84.43%, 86.27%, and 85.33%, respectively. The results demonstrate that the use of physiological and navigation data can effectively identify seafarers’ emotional states. Additionally, the fusion model compensates for the uncertainty of single models and enhances the performance of ER for seafarers, which provides a feasible path for the ER of seafarers. The findings of this study can be used to promptly identify the emotional state of seafarers and develop early warnings for bridge systems for shipping companies and help inform policy-making on human factors to enhance maritime safety. Full article

(This article belongs to the Section Marine Science and Engineering)

► Show Figures

Figure 1

27 pages, 4153 KB

Open AccessFeature PaperArticle

Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition

by Constantin-Bogdan Popescu, Laura Florea and Corneliu Florea

Electronics 2025, 14(16), 3311; https://doi.org/10.3390/electronics14163311 - 20 Aug 2025

Viewed by 945

Abstract

Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in images compared to traditional approaches. However, their reliance on contextual cues [...] Read more.

Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in images compared to traditional approaches. However, their reliance on contextual cues can introduce unintended biases, especially when the background does not align with the individual’s true emotional state. This raises concerns for the reliability of such models in real-world applications, where robustness and fairness are critical. In this work, we explore the limitations of current VLMs in emotionally ambiguous scenarios and propose a method to overcome contextual bias. Existing VLM-based captioning solutions tend to overweight background and contextual information when determining emotion, often at the expense of the individual’s actual expression. To study this phenomenon, we created synthetic datasets by automatically extracting people from the original images using YOLOv8 and placing them on randomly selected backgrounds from the Landscape Pictures dataset. This allowed us to reduce the correlation between emotional expression and background context while preserving body pose. Through discriminative analysis of VLM behavior on images with both correct and mismatched backgrounds, we find that in 93% of the cases, the predicted emotions vary based on the background—even when models are explicitly instructed to focus on the person. To address this, we propose a multimodal approach (named BECKI) that incorporates body pose, full image context, and a novel description stream focused exclusively on identifying the emotional discrepancy between the individual and the background. Our primary contribution is not just in identifying the weaknesses of existing VLMs, but in proposing a more robust and context-resilient solution. Our method achieves up to 96% accuracy, highlighting its effectiveness in mitigating contextual bias. Full article

(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

► Show Figures

Figure 1

23 pages, 10088 KB

Open AccessArticle

Development of an Interactive Digital Human with Context-Sensitive Facial Expressions

by Fan Yang, Lei Fang, Rui Suo, Jing Zhang and Mincheol Whang

Sensors 2025, 25(16), 5117; https://doi.org/10.3390/s25165117 - 18 Aug 2025

Viewed by 772

Abstract

With the increasing complexity of human–computer interaction scenarios, conventional digital human facial expression systems show notable limitations in handling multi-emotion co-occurrence, dynamic expression, and semantic responsiveness. This paper proposes a digital human system framework that integrates multimodal emotion recognition and compound facial expression [...] Read more.

With the increasing complexity of human–computer interaction scenarios, conventional digital human facial expression systems show notable limitations in handling multi-emotion co-occurrence, dynamic expression, and semantic responsiveness. This paper proposes a digital human system framework that integrates multimodal emotion recognition and compound facial expression generation. The system establishes a complete pipeline for real-time interaction and compound emotional expression, following a sequence of “speech semantic parsing—multimodal emotion recognition—Action Unit (AU)-level 3D facial expression control.” First, a ResNet18-based model is employed for robust emotion classification using the AffectNet dataset. Then, an AU motion curve driving module is constructed on the Unreal Engine platform, where dynamic synthesis of basic emotions is achieved via a state-machine mechanism. Finally, Generative Pre-trained Transformer (GPT) is utilized for semantic analysis, generating structured emotional weight vectors that are mapped to the AU layer to enable language-driven facial responses. Experimental results demonstrate that the proposed system significantly improves facial animation quality, with naturalness increasing from 3.54 to 3.94 and semantic congruence from 3.44 to 3.80. These results validate the system’s capability to generate realistic and emotionally coherent expressions in real time. This research provides a complete technical framework and practical foundation for high-fidelity digital humans with affective interaction capabilities. Full article

(This article belongs to the Special Issue Emotion Recognition Based on Sensors (3rd Edition))

► Show Figures

Figure 1

26 pages, 663 KB

Open AccessArticle

Multi-Scale Temporal Fusion Network for Real-Time Multimodal Emotion Recognition in IoT Environments

by Sungwook Yoon and Byungmun Kim

Sensors 2025, 25(16), 5066; https://doi.org/10.3390/s25165066 - 14 Aug 2025

Viewed by 908

Abstract

This paper introduces EmotionTFN (Emotion-Multi-Scale Temporal Fusion Network), a novel hierarchical temporal fusion architecture that addresses key challenges in IoT emotion recognition by processing diverse sensor data while maintaining accuracy across multiple temporal scales. The architecture integrates physiological signals (EEG, PPG, and GSR), [...] Read more.

This paper introduces EmotionTFN (Emotion-Multi-Scale Temporal Fusion Network), a novel hierarchical temporal fusion architecture that addresses key challenges in IoT emotion recognition by processing diverse sensor data while maintaining accuracy across multiple temporal scales. The architecture integrates physiological signals (EEG, PPG, and GSR), visual, and audio data using hierarchical temporal attention across short-term (0.5–2 s), medium-term (2–10 s), and long-term (10–60 s) windows. Edge computing optimizations, including model compression, quantization, and adaptive sampling, enable deployment on resource-constrained devices. Extensive experiments on MELD, DEAP, and G-REx datasets demonstrate 94.2% accuracy on discrete emotion classification and 0.087 mean absolute error on dimensional prediction, outperforming the best baseline (87.4%). The system maintains sub-200 ms latency on IoT hardware while achieving a 40% improvement in energy efficiency. Real-world deployment validation over four weeks achieved 97.2% uptime and user satisfaction scores of 4.1/5.0 while ensuring privacy through local processing. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

Search Results (195)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (195)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI