This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Emotion Recognition from Videos Using Multimodal Large Language Models
by
Lorenzo Vaiani
Lorenzo Vaiani
,
Luca Cagliero
Luca Cagliero
and
Paolo Garza
Paolo Garza *
Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy
*
Author to whom correspondence should be addressed.
Future Internet 2024, 16(7), 247; https://doi.org/10.3390/fi16070247 (registering DOI)
Submission received: 1 June 2024
/
Revised: 9 July 2024
/
Accepted: 11 July 2024
/
Published: 13 July 2024
Abstract
The diffusion of Multimodal Large Language Models (MLLMs) has opened new research directions in the context of video content understanding and classification. Emotion recognition from videos aims to automatically detect human emotions such as anxiety and fear. It requires deeply elaborating multiple data modalities, including acoustic and visual streams. State-of-the-art approaches leverage transformer-based architectures to combine multimodal sources. However, the impressive performance of MLLMs in content retrieval and generation offers new opportunities to extend the capabilities of existing emotion recognizers. This paper explores the performance of MLLMs in the emotion recognition task in a zero-shot learning setting. Furthermore, it presents a state-of-the-art architecture extension based on MLLM content reformulation. The performance achieved on the Hume-Reaction benchmark shows that MLLMs are still unable to outperform the state-of-the-art average performance but, notably, are more effective than traditional transformers in recognizing emotions with an intensity that deviates from the average of the samples.
Share and Cite
MDPI and ACS Style
Vaiani, L.; Cagliero, L.; Garza, P.
Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet 2024, 16, 247.
https://doi.org/10.3390/fi16070247
AMA Style
Vaiani L, Cagliero L, Garza P.
Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet. 2024; 16(7):247.
https://doi.org/10.3390/fi16070247
Chicago/Turabian Style
Vaiani, Lorenzo, Luca Cagliero, and Paolo Garza.
2024. "Emotion Recognition from Videos Using Multimodal Large Language Models" Future Internet 16, no. 7: 247.
https://doi.org/10.3390/fi16070247
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.