Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (44)

Search Parameters:
Keywords = contextual emotion recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
23 pages, 874 KB  
Article
School Belonging and STEM Career Interest in Chinese Adolescents: The Mediating Role of Science Identity
by Yuling Li and Yan Kong
Behav. Sci. 2025, 15(10), 1365; https://doi.org/10.3390/bs15101365 - 7 Oct 2025
Viewed by 40
Abstract
Adolescents’ sustained engagement in STEM fields is critical for cultivating future scientific talent. While school belonging—a key form of emotional support perceived by students within the school environment—has been widely studied, its specific influence on STEM career interest, particularly within non-Western educational systems, [...] Read more.
Adolescents’ sustained engagement in STEM fields is critical for cultivating future scientific talent. While school belonging—a key form of emotional support perceived by students within the school environment—has been widely studied, its specific influence on STEM career interest, particularly within non-Western educational systems, remains insufficiently understood. Drawing on Social Cognitive Career Theory (SCCT), this study examines how school belonging, as a contextual affordance, shapes STEM career interest among Chinese high school students, and whether science identity, as a person input, mediates this relationship. Utilizing data from 451 students in a science-focused Chinese high school, multiple regression analyses demonstrated that school belonging significantly predicts higher STEM career interest. Science identity partially mediated this relationship, with science interest emerging as the strongest mediating component, followed by competence/performance beliefs; external recognition had a comparatively weaker effect. These findings suggest that fostering school belonging in science-oriented environments may support adolescents’ interest in STEM careers, both directly and indirectly through the development of science identity. From a cultural perspective, the study further sheds light on the mechanisms underlying students’ interest in STEM careers, and highlights the potential of inclusive environments that support the development of students’ sense of belonging and identity in promoting their long-term engagement in STEM fields. Full article
(This article belongs to the Topic Educational and Health Development of Children and Youths)
Show Figures

Figure 1

31 pages, 1452 KB  
Article
A User-Centric Context-Aware Framework for Real-Time Optimisation of Multimedia Data Privacy Protection, and Information Retention Within Multimodal AI Systems
by Ndricim Topalli and Atta Badii
Sensors 2025, 25(19), 6105; https://doi.org/10.3390/s25196105 - 3 Oct 2025
Viewed by 226
Abstract
The increasing use of AI systems for face, object, action, scene, and emotion recognition raises significant privacy risks, particularly when processing Personally Identifiable Information (PII). Current privacy-preserving methods lack adaptability to users’ preferences and contextual requirements, and obfuscate user faces uniformly. This research [...] Read more.
The increasing use of AI systems for face, object, action, scene, and emotion recognition raises significant privacy risks, particularly when processing Personally Identifiable Information (PII). Current privacy-preserving methods lack adaptability to users’ preferences and contextual requirements, and obfuscate user faces uniformly. This research proposes a user-centric, context-aware, and ontology-driven privacy protection framework that dynamically adjusts privacy decisions based on user-defined preferences, entity sensitivity, and contextual information. The framework integrates state-of-the-art recognition models for recognising faces, objects, scenes, actions, and emotions in real time on data acquired from vision sensors (e.g., cameras). Privacy decisions are directed by a contextual ontology based in Contextual Integrity theory, which classifies entities into private, semi-private, or public categories. Adaptive privacy levels are enforced through obfuscation techniques and a multi-level privacy model that supports user-defined red lines (e.g., “always hide logos”). The framework also proposes a Re-Identifiability Index (RII) using soft biometric features such as gait, hairstyle, clothing, skin tone, age, and gender, to mitigate identity leakage and to support fallback protection when face recognition fails. The experimental evaluation relied on sensor-captured datasets, which replicate real-world image sensors such as surveillance cameras. User studies confirmed that the framework was effective, with over 85.2% of participants rating the obfuscation operations as highly effective, and the other 14.8% stating that obfuscation was adequately effective. Amongst these, 71.4% considered the balance between privacy protection and usability very satisfactory and 28% found it satisfactory. GPU acceleration was deployed to enable real-time performance of these models by reducing frame processing time from 1200 ms (CPU) to 198 ms. This ontology-driven framework employs user-defined red lines, contextual reasoning, and dual metrics (RII/IVI) to dynamically balance privacy protection with scene intelligibility. Unlike current anonymisation methods, the framework provides a real-time, user-centric, and GDPR-compliant method that operationalises privacy-by-design while preserving scene intelligibility. These features make the framework appropriate to a variety of real-world applications including healthcare, surveillance, and social media. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

19 pages, 5381 KB  
Article
Context_Driven Emotion Recognition: Integrating Multi_Cue Fusion and Attention Mechanisms for Enhanced Accuracy on the NCAER_S Dataset
by Merieme Elkorchi, Boutaina Hdioud, Rachid Oulad Haj Thami and Safae Merzouk
Information 2025, 16(10), 834; https://doi.org/10.3390/info16100834 - 26 Sep 2025
Viewed by 329
Abstract
In recent years, most conventional emotion recognition approaches have concentrated primarily on facial cues, often overlooking complementary sources of information such as body posture and contextual background. This limitation reduces their effectiveness in complex, real-world environments. In this work, we present a multi-branch [...] Read more.
In recent years, most conventional emotion recognition approaches have concentrated primarily on facial cues, often overlooking complementary sources of information such as body posture and contextual background. This limitation reduces their effectiveness in complex, real-world environments. In this work, we present a multi-branch emotion recognition framework that separately processes facial, bodily, and contextual information using three dedicated neural networks. To better capture contextual cues, we intentionally mask the face and body of the main subject within the scene, prompting the model to explore alternative visual elements that may convey emotional states. To further enhance the quality of the extracted features, we integrate both channel and spatial attention mechanisms into the network architecture. Evaluated on the challenging NCAER-S dataset, our model achieves an accuracy of 56.42%, surpassing the state-of-the-art GLAMOUR-Net. These results highlight the effectiveness of combining multi-cue representation and attention-guided feature extraction for robust emotion recognition in unconstrained settings. The findings also highlight the importance of accurate emotion recognition for human–computer interaction, where affect detection enables systems to adapt to users and deliver more effective experiences. Full article
(This article belongs to the Special Issue Multimodal Human-Computer Interaction)
Show Figures

Figure 1

25 pages, 663 KB  
Article
Exploring the Multifaceted Nature of Work Happiness: A Mixed-Method Study
by Rune Bjerke
Adm. Sci. 2025, 15(9), 351; https://doi.org/10.3390/admsci15090351 - 5 Sep 2025
Viewed by 601
Abstract
Work happiness is commonly described as an umbrella concept encompassing job satisfaction, engagement, and emotional attachment to the workplace. However, few studies have explored its underlying sources and emotional experiences, raising questions about its conceptual clarity and measurement. This exploratory inductive mixed-methods study [...] Read more.
Work happiness is commonly described as an umbrella concept encompassing job satisfaction, engagement, and emotional attachment to the workplace. However, few studies have explored its underlying sources and emotional experiences, raising questions about its conceptual clarity and measurement. This exploratory inductive mixed-methods study investigates whether work happiness can be better understood by distinguishing between its sources (antecedents) and emotional expressions (outcomes). In the qualitative phase, 23 part-time adult students from Norway’s public and private sectors reflected on moments of work happiness and the emotions involved. Thematic analysis identified five source-related themes, which informed the development of 49 items. These items were tested in a quantitative survey distributed to 4000 employees, yielding 615 usable responses. Exploratory factor analysis (EFA) revealed six conceptually coherent source dimensions—such as autonomy, recognition, and togetherness—and one emotional dimension. Regression analysis demonstrated statistically significant associations between source factors and emotional experiences, offering initial support for a dual-structure model of work happiness. Notably, the findings revealed a dialectical interplay between individual (“I”) and collective (“We”) sources, suggesting that work happiness emerges from both personal agency and social belonging. While promising, these findings are preliminary and require further validation. The study contributes to theory by proposing a grounded, multidimensional framework for work happiness and invites future research to examine its psychometric robustness and cross-contextual applicability. Full article
Show Figures

Figure 1

27 pages, 4153 KB  
Article
Mitigating Context Bias in Vision–Language Models via Multimodal Emotion Recognition
by Constantin-Bogdan Popescu, Laura Florea and Corneliu Florea
Electronics 2025, 14(16), 3311; https://doi.org/10.3390/electronics14163311 - 20 Aug 2025
Viewed by 1147
Abstract
Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in images compared to traditional approaches. However, their reliance on contextual cues [...] Read more.
Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in images compared to traditional approaches. However, their reliance on contextual cues can introduce unintended biases, especially when the background does not align with the individual’s true emotional state. This raises concerns for the reliability of such models in real-world applications, where robustness and fairness are critical. In this work, we explore the limitations of current VLMs in emotionally ambiguous scenarios and propose a method to overcome contextual bias. Existing VLM-based captioning solutions tend to overweight background and contextual information when determining emotion, often at the expense of the individual’s actual expression. To study this phenomenon, we created synthetic datasets by automatically extracting people from the original images using YOLOv8 and placing them on randomly selected backgrounds from the Landscape Pictures dataset. This allowed us to reduce the correlation between emotional expression and background context while preserving body pose. Through discriminative analysis of VLM behavior on images with both correct and mismatched backgrounds, we find that in 93% of the cases, the predicted emotions vary based on the background—even when models are explicitly instructed to focus on the person. To address this, we propose a multimodal approach (named BECKI) that incorporates body pose, full image context, and a novel description stream focused exclusively on identifying the emotional discrepancy between the individual and the background. Our primary contribution is not just in identifying the weaknesses of existing VLMs, but in proposing a more robust and context-resilient solution. Our method achieves up to 96% accuracy, highlighting its effectiveness in mitigating contextual bias. Full article
(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)
Show Figures

Figure 1

17 pages, 554 KB  
Review
Post-Concussion Syndrome and Functional Neurological Disorder: Diagnostic Interfaces, Risk Mechanisms, and the Functional Overlay Model
by Ioannis Mavroudis, Foivos Petridis, Eleni Karantali, Alin Ciobica, Sotirios Papagiannopoulos and Dimitrios Kazis
Brain Sci. 2025, 15(7), 755; https://doi.org/10.3390/brainsci15070755 - 16 Jul 2025
Cited by 1 | Viewed by 2049
Abstract
Background: Post-concussion syndrome (PCS) and Functional Neurological Disorder (FND), including Functional Cognitive Disorder (FCD), are two frequently encountered but diagnostically complex conditions. While PCS is conceptualized as a sequela of mild traumatic brain injury (mTBI), FND/FCD encompasses symptoms incompatible with recognized neurological disease, [...] Read more.
Background: Post-concussion syndrome (PCS) and Functional Neurological Disorder (FND), including Functional Cognitive Disorder (FCD), are two frequently encountered but diagnostically complex conditions. While PCS is conceptualized as a sequela of mild traumatic brain injury (mTBI), FND/FCD encompasses symptoms incompatible with recognized neurological disease, often arising in the absence of structural brain damage. Yet, both conditions exhibit considerable clinical overlap—particularly in the domains of cognitive dysfunction, emotional dysregulation, and symptom persistence despite negative investigations. Objective: This review critically examines the shared and divergent features of PCS and FND/FCD. We explore their respective epidemiology, diagnostic criteria, and risk factors—including personality traits and trauma exposure—as well as emerging insights from neuroimaging and biomarkers. We propose the “Functional Overlay Model” as a clinical tool for navigating diagnostic ambiguity in patients with persistent post-injury symptoms. Results: PCS and FND/FCD frequently share features such as subjective cognitive complaints, fatigue, anxiety, and heightened somatic vigilance. High neuroticism, maladaptive coping, prior psychiatric history, and trauma exposure emerge as common risk factors. Neuroimaging studies show persistent network dysfunction in both PCS and FND, with overlapping disruption in fronto-limbic and default mode systems. The Functional Overlay Model helps to identify cases where functional symptomatology coexists with or replaces an initial organic insult—particularly in patients with incongruent symptoms and normal objective testing. Conclusions: PCS and FND/FCD should be conceptualized along a continuum of brain dysfunction, shaped by injury, psychology, and contextual factors. Early recognition of functional overlays and stratified psychological interventions may improve outcomes for patients with persistent, medically unexplained symptoms after head trauma. This review introduces the Functional Overlay Model as a novel framework to enhance diagnostic clarity and therapeutic planning in patients presenting with persistent post-injury symptoms. Full article
Show Figures

Figure 1

21 pages, 497 KB  
Article
Small Language Models for Speech Emotion Recognition in Text and Audio Modalities
by José L. Gómez-Sirvent, Francisco López de la Rosa, Daniel Sánchez-Reolid, Roberto Sánchez-Reolid and Antonio Fernández-Caballero
Appl. Sci. 2025, 15(14), 7730; https://doi.org/10.3390/app15147730 - 10 Jul 2025
Cited by 1 | Viewed by 1872
Abstract
Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller [...] Read more.
Speech emotion recognition has become increasingly important in a wide range of applications, driven by the development of large transformer-based natural language processing models. However, the large size of these architectures limits their usability, which has led to a growing interest in smaller models. In this paper, we evaluate nineteen of the most popular small language models for the text and audio modalities for speech emotion recognition on the IEMOCAP dataset. Based on their cross-validation accuracy, the best architectures were selected to create ensemble models to evaluate the effect of combining audio and text, as well as the effect of incorporating contextual information on model performance. The experiments conducted showed a significant increase in accuracy with the inclusion of contextual information and the combination of modalities. The results obtained were highly competitive, outperforming numerous recent approaches. The proposed ensemble model achieved an accuracy of 82.12% on the IEMOCAP dataset, outperforming several recent approaches. These results demonstrate the effectiveness of ensemble methods for improving speech emotion recognition performance, and highlight the feasibility of training multiple small language models on consumer-grade computers. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

24 pages, 5346 KB  
Article
Scene-Speaker Emotion Aware Network: Dual Network Strategy for Conversational Emotion Recognition
by Bingni Li, Yu Gu, Chenyu Li, He Zhang, Linsong Liu, Haixiang Lin and Shuang Wang
Electronics 2025, 14(13), 2660; https://doi.org/10.3390/electronics14132660 - 30 Jun 2025
Viewed by 475
Abstract
Incorporating external knowledge has been shown to improve emotion understanding in dialogues by enriching contextual information, such as character motivations, psychological states, and causal relations between events. Filtering and categorizing this information can significantly enhance model performance. In this paper, we present an [...] Read more.
Incorporating external knowledge has been shown to improve emotion understanding in dialogues by enriching contextual information, such as character motivations, psychological states, and causal relations between events. Filtering and categorizing this information can significantly enhance model performance. In this paper, we present an innovative Emotion Recognition in Conversation (ERC) framework, called the Scene-Speaker Emotion Awareness Network (SSEAN), which employs a dual-strategy modeling approach. SSEAN uniquely incorporates external commonsense knowledge describing speaker states into multimodal inputs. Using parallel recurrent networks to separately capture scene-level and speaker-level emotions, the model effectively reduces the accumulation of redundant information within the speaker’s emotional space. Additionally, we introduce an attention-based dynamic screening module to enhance the quality of integrated external commonsense knowledge through three levels: (1) speaker-listener-aware input structuring, (2) role-based segmentation, and (3) context-guided attention refinement. Experiments show that SSEAN outperforms existing state-of-the-art models on two well-adopted benchmark datasets in both single-text modality and multimodal settings. Full article
(This article belongs to the Special Issue Image and Signal Processing Techniques and Applications)
Show Figures

Figure 1

28 pages, 1634 KB  
Review
AI-Powered Vocalization Analysis in Poultry: Systematic Review of Health, Behavior, and Welfare Monitoring
by Venkatraman Manikandan and Suresh Neethirajan
Sensors 2025, 25(13), 4058; https://doi.org/10.3390/s25134058 - 29 Jun 2025
Cited by 2 | Viewed by 2496
Abstract
Artificial intelligence and bioacoustics represent a paradigm shift in non-invasive poultry welfare monitoring through advanced vocalization analysis. This comprehensive systematic review critically examines the transformative evolution from traditional acoustic feature extraction—including Mel-Frequency Cepstral Coefficients (MFCCs), spectral entropy, and spectrograms—to cutting-edge deep learning architectures [...] Read more.
Artificial intelligence and bioacoustics represent a paradigm shift in non-invasive poultry welfare monitoring through advanced vocalization analysis. This comprehensive systematic review critically examines the transformative evolution from traditional acoustic feature extraction—including Mel-Frequency Cepstral Coefficients (MFCCs), spectral entropy, and spectrograms—to cutting-edge deep learning architectures encompassing Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, attention mechanisms, and groundbreaking self-supervised models such as wav2vec2 and Whisper. The investigation reveals compelling evidence for edge computing deployment via TinyML frameworks, addressing critical scalability challenges in commercial poultry environments characterized by acoustic complexity and computational constraints. Advanced applications spanning emotion recognition, disease detection, and behavioral phenotyping demonstrate unprecedented potential for real-time welfare assessment. Through rigorous bibliometric co-occurrence mapping and thematic clustering analysis, this review exposes persistent methodological bottlenecks: dataset standardization deficits, evaluation protocol inconsistencies, and algorithmic interpretability limitations. Critical knowledge gaps emerge in cross-species domain generalization and contextual acoustic adaptation, demanding urgent research prioritization. The findings underscore explainable AI integration as essential for establishing stakeholder trust and regulatory compliance in automated welfare monitoring systems. This synthesis positions acoustic AI as a cornerstone technology enabling ethical, transparent, and scientifically robust precision livestock farming, bridging computational innovation with biological relevance for sustainable poultry production systems. Future research directions emphasize multi-modal sensor integration, standardized evaluation frameworks, and domain-adaptive models capable of generalizing across diverse poultry breeds, housing conditions, and environmental contexts while maintaining interpretability for practical farm deployment. Full article
(This article belongs to the Special Issue Feature Papers in Smart Agriculture 2025)
Show Figures

Figure 1

37 pages, 2359 KB  
Article
CAG-MoE: Multimodal Emotion Recognition with Cross-Attention Gated Mixture of Experts
by Axel Gedeon Mengara Mengara and Yeon-kug Moon
Mathematics 2025, 13(12), 1907; https://doi.org/10.3390/math13121907 - 7 Jun 2025
Cited by 2 | Viewed by 2763
Abstract
Multimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary yet distinct aspects of emotion, requiring [...] Read more.
Multimodal emotion recognition faces substantial challenges due to the inherent heterogeneity of data sources, each with its own temporal resolution, noise characteristics, and potential for incompleteness. For example, physiological signals, audio features, and textual data capture complementary yet distinct aspects of emotion, requiring specialized processing to extract meaningful cues. These challenges include aligning disparate modalities, handling varying levels of noise and missing data, and effectively fusing features without diluting critical contextual information. In this work, we propose a novel Mixture of Experts (MoE) framework that addresses these challenges by integrating specialized transformer-based sub-expert networks, a dynamic gating mechanism with sparse Top-k activation, and a cross-modal attention module. Each modality is processed by multiple dedicated sub-experts designed to capture intricate temporal and contextual patterns, while the dynamic gating network selectively weights the contributions of the most relevant experts. Our cross-modal attention module further enhances the integration by facilitating precise exchange of information among modalities, thereby reinforcing robustness in the presence of noisy or incomplete data. Additionally, an auxiliary diversity loss encourages expert specialization, ensuring the fused representation remains highly discriminative. Extensive theoretical analysis and rigorous experiments on benchmark datasets—the Korean Emotion Multimodal Database (KEMDy20) and the ASCERTAIN dataset—demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition, setting new performance baselines in affective computing. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

29 pages, 2899 KB  
Article
Research on Tourist Satisfaction Evaluation of Macau’s Built Heritage Space Under the Genius Loci
by Pohsun Wang, Chenxi Li and Jing Liu
Buildings 2025, 15(10), 1701; https://doi.org/10.3390/buildings15101701 - 17 May 2025
Cited by 2 | Viewed by 1094
Abstract
As a typical World Cultural Heritage city, Macau has a special regional identity and outstanding cultural value in the built heritage and spatial identity of the city. As contemporary cultural industries have undergone rapid development to transform architectural heritage spaces into displays into [...] Read more.
As a typical World Cultural Heritage city, Macau has a special regional identity and outstanding cultural value in the built heritage and spatial identity of the city. As contemporary cultural industries have undergone rapid development to transform architectural heritage spaces into displays into showcases of heritage significance, an adaptive transformation strategy is not to be ignored. Although the current transformations demonstrate functional efficacy, they lack of a cultural environment, the Gestalt of the Genius Loci, and a limited visitor experience. For this research, we use the Genius Loci theory to identify constitutive spatial elements and derive theory-based evaluation criteria for the Mandarin’s House, which acts as the case study. It provides a comprehensive evaluation framework across four dimensions: spatial perception, cultural identity, emotional engagement, and functional attributes, each one comprising 20 specific indicators. This research reviews the factors that affect the recognition of cultural identity through quantitative analysis using Importance–Performance Analysis (IPA), evaluating the importance–performance relationships of these indicators. Critical gaps between visitor expectations and current spatial performance are found. Therefore, four optimization strategies are proposed accordingly. (1) Physical experience is enriched through reconstruction of spatial narrative; (2) spiritual experience is reinforced through cultural memory activation; (3) regional characteristics are strengthened through the contextualization of heritage values; and (4) sustainable development mechanisms for adaptive reuse are established. Taken as a systematic approach, it offers both theoretical and practical insight into the regeneration of the architectural heritage spaces in the World Cultural Heritage cities. Full article
(This article belongs to the Special Issue Built Heritage Conservation in the Twenty-First Century: 2nd Edition)
Show Figures

Figure 1

25 pages, 4884 KB  
Article
The Effect of Emotional Intelligence on the Accuracy of Facial Expression Recognition in the Valence–Arousal Space
by Yubin Kim, Ayoung Cho, Hyunwoo Lee and Mincheol Whang
Electronics 2025, 14(8), 1525; https://doi.org/10.3390/electronics14081525 - 9 Apr 2025
Cited by 2 | Viewed by 2372
Abstract
Facial expression recognition (FER) plays a pivotal role in affective computing and human–computer interaction by enabling machines to interpret human emotions. However, conventional FER models often overlook individual differences in emotional intelligence (EI), which may significantly influence how emotions are perceived and expressed. [...] Read more.
Facial expression recognition (FER) plays a pivotal role in affective computing and human–computer interaction by enabling machines to interpret human emotions. However, conventional FER models often overlook individual differences in emotional intelligence (EI), which may significantly influence how emotions are perceived and expressed. This study investigates the effect of EI on facial expression recognition accuracy within the valence–arousal space. Participants were divided into high and low EI groups based on a composite score derived from the Tromsø Social Intelligence Scale and performance-based emotion tasks. Five deep learning models (EfficientNetV2-L/S, MaxViT-B/T, and VGG16) were trained on the AffectNet dataset and evaluated using facial expression data collected from participants. Emotional states were predicted as continuous valence and arousal values, which were then mapped onto discrete emotion categories for interpretability. The results indicated that individuals with higher EI achieved significantly greater recognition accuracy, particularly for emotions requiring contextual understanding (e.g., anger, sadness, and happiness), while fear was better recognized by individuals with lower EI. These findings highlight the role of emotional intelligence in modulating FER performance and suggest that integrating EI-related features into valence–arousal-based models could enhance the adaptiveness of affective computing systems. Full article
(This article belongs to the Special Issue AI for Human Collaboration)
Show Figures

Figure 1

30 pages, 2781 KB  
Article
Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion
by Sathishkumar Moorthy and Yeon-Kug Moon
Mathematics 2025, 13(7), 1100; https://doi.org/10.3390/math13071100 - 27 Mar 2025
Cited by 5 | Viewed by 2982
Abstract
Multimodal emotion recognition involves leveraging complementary relationships across modalities to enhance the assessment of human emotions. Networks that integrate diverse information sources outperform single-modal approaches while offering greater robustness against noisy or missing data. Current emotion recognition approaches often rely on cross-modal attention [...] Read more.
Multimodal emotion recognition involves leveraging complementary relationships across modalities to enhance the assessment of human emotions. Networks that integrate diverse information sources outperform single-modal approaches while offering greater robustness against noisy or missing data. Current emotion recognition approaches often rely on cross-modal attention mechanisms, particularly audio and visual modalities; however, these methods do not assume the complementary nature of the data. Despite making this assumption, it is not uncommon to see non-complementary relationships arise in real-world data, reducing the effectiveness of feature integration that assumes consistent complementarity. While audio–visual co-learning provides a broader understanding of contextual information for practical implementation, discrepancies between audio and visual data, such as semantic inconsistencies, pose challenges and lay the groundwork for inaccurate predictions. In this way, they have limitations in modeling the intramodal and cross-modal interactions. In order to address these problems, we propose a multimodal learning framework for emotion recognition, called the Hybrid Multi-ATtention Network (HMATN). Specifically, we introduce a collaborative cross-attentional paradigm for audio–visual amalgamation, intending to effectively capture salient features over modalities while preserving both intermodal and intramodal relationships. The model calculates cross-attention weights by analyzing the relationship between combined feature illustrations and distinct modes. Meanwhile, the network employs the Hybrid Attention of Single and Parallel Cross-Modal (HASPCM) mechanism, comprising a single-modal attention component and a parallel cross-modal attention component, to harness complementary multimodal data and hidden features to improve representation. Additionally, these modules exploit complementary and concealed multimodal information to enhance the richness of feature representation. Finally, the efficiency of the proposed method is demonstrated through experiments on complex videos from the AffWild2 and AFEW-VA datasets. The findings of these tests show that the developed attentional audio–visual fusion model offers a cost-efficient solution that surpasses state-of-the-art techniques, even when the input data are noisy or missing modalities. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

20 pages, 6941 KB  
Article
EmoSDS: Unified Emotionally Adaptive Spoken Dialogue System Using Self-Supervised Speech Representations
by Jaehwan Lee, Youngjun Sim, Jinyou Kim and Young-Joo Suh
Future Internet 2025, 17(4), 143; https://doi.org/10.3390/fi17040143 - 25 Mar 2025
Cited by 2 | Viewed by 1105
Abstract
In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion [...] Read more.
In recent years, advancements in artificial intelligence, speech, and natural language processing technology have enhanced spoken dialogue systems (SDSs), enabling natural, voice-based human–computer interaction. However, discrete, token-based LLMs in emotionally adaptive SDSs focus on lexical content while overlooking essential paralinguistic cues for emotion expression. Existing methods use external emotion predictors to compensate for this but introduce computational overhead and fail to fully integrate paralinguistic features with linguistic context. Moreover, the lack of high-quality emotional speech datasets limits models’ ability to learn expressive emotional cues. To address these challenges, we propose EmoSDS, a unified SDS framework that integrates speech and emotion recognition by leveraging self-supervised learning (SSL) features. Our three-stage training pipeline enables the LLM to learn both discrete linguistic content and continuous paralinguistic features, improving emotional expressiveness and response naturalness. Additionally, we construct EmoSC, a dataset combining GPT-generated dialogues with emotional voice conversion data, ensuring greater emotional diversity and a balanced sample distribution across emotion categories. The experimental results show that EmoSDS outperforms existing models in emotional alignment and response generation, achieving a minimum 2.9% increase in text generation metrics, enhancing the LLM’s ability to interpret emotional and textual cues for more expressive and contextually appropriate responses. Full article
(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)
Show Figures

Figure 1

22 pages, 3887 KB  
Article
The Impact of Linguistic Variations on Emotion Detection: A Study of Regionally Specific Synthetic Datasets
by Fernando Henrique Calderón Alvarado
Appl. Sci. 2025, 15(7), 3490; https://doi.org/10.3390/app15073490 - 22 Mar 2025
Viewed by 938
Abstract
This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore [...] Read more.
This study examines the role of linguistic regional variations in synthetic dataset generation and their impact on emotion detection performance. Emotion detection is essential for natural language processing (NLP) applications such as social media analysis, customer service, and mental health monitoring. To explore this, synthetic datasets were generated using a state-of-the-art language model, incorporating English variations from the United States, United Kingdom, and India, alongside a general baseline dataset. Two levels of prompt specificity were employed to assess the influence of regional linguistic nuances. Statistical analyses—including frequency distribution, term frequency-inverse document frequency (TF-IDF), type–token ratio (TTR), hapax legomena, pointwise mutual information (PMI) scores, and key-phrase extraction—revealed significant linguistic diversity and regional distinctions in the generated datasets. To evaluate their effectiveness, classification experiments were conducted with two models using bidirectional encoder representations from transformers (BERT) and its de-noising sequence to sequence variation (BART), beginning with zero-shot classification on the contextualized affect representations for emotion recognition (CARER) dataset, followed by fine-tuning with both baseline and region-specific datasets. Results demonstrated that region-specific datasets, particularly those generated with detailed prompts, significantly improved classification accuracy compared to the baseline. These findings underscore the importance of incorporating global linguistic variations in synthetic dataset generation, offering insights into how regional adaptations can enhance emotion detection models for diverse NLP applications. Full article
(This article belongs to the Special Issue Application of Affective Computing)
Show Figures

Figure 1

Back to TopTop