Next Article in Journal
Multi-Agent Dynamic Fog Service Placement Approach
Previous Article in Journal
Achieving Accountability and Data Integrity in Message Queuing Telemetry Transport Using Blockchain and Interplanetary File System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Emotion Recognition from Videos Using Multimodal Large Language Models

Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy
*
Author to whom correspondence should be addressed.
Future Internet 2024, 16(7), 247; https://doi.org/10.3390/fi16070247 (registering DOI)
Submission received: 1 June 2024 / Revised: 9 July 2024 / Accepted: 11 July 2024 / Published: 13 July 2024
(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)

Abstract

The diffusion of Multimodal Large Language Models (MLLMs) has opened new research directions in the context of video content understanding and classification. Emotion recognition from videos aims to automatically detect human emotions such as anxiety and fear. It requires deeply elaborating multiple data modalities, including acoustic and visual streams. State-of-the-art approaches leverage transformer-based architectures to combine multimodal sources. However, the impressive performance of MLLMs in content retrieval and generation offers new opportunities to extend the capabilities of existing emotion recognizers. This paper explores the performance of MLLMs in the emotion recognition task in a zero-shot learning setting. Furthermore, it presents a state-of-the-art architecture extension based on MLLM content reformulation. The performance achieved on the Hume-Reaction benchmark shows that MLLMs are still unable to outperform the state-of-the-art average performance but, notably, are more effective than traditional transformers in recognizing emotions with an intensity that deviates from the average of the samples.
Keywords: video–language large language models; emotion recognition; emotional reaction intensity estimation; multimodal learning video–language large language models; emotion recognition; emotional reaction intensity estimation; multimodal learning

Share and Cite

MDPI and ACS Style

Vaiani, L.; Cagliero, L.; Garza, P. Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet 2024, 16, 247. https://doi.org/10.3390/fi16070247

AMA Style

Vaiani L, Cagliero L, Garza P. Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet. 2024; 16(7):247. https://doi.org/10.3390/fi16070247

Chicago/Turabian Style

Vaiani, Lorenzo, Luca Cagliero, and Paolo Garza. 2024. "Emotion Recognition from Videos Using Multimodal Large Language Models" Future Internet 16, no. 7: 247. https://doi.org/10.3390/fi16070247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop