Emotion Recognition from Videos Using Multimodal Large Language Models

Vaiani, Lorenzo; Cagliero, Luca; Garza, Paolo

doi:10.3390/fi16070247

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Emotion Recognition from Videos Using Multimodal Large Language Models

by

Lorenzo Vaiani

,

Luca Cagliero

and

Paolo Garza

^*

Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy

^*

Author to whom correspondence should be addressed.

Future Internet 2024, 16(7), 247; https://doi.org/10.3390/fi16070247 (registering DOI)

Submission received: 1 June 2024 / Revised: 9 July 2024 / Accepted: 11 July 2024 / Published: 13 July 2024

(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)

Download Versions Notes

Abstract

The diffusion of Multimodal Large Language Models (MLLMs) has opened new research directions in the context of video content understanding and classification. Emotion recognition from videos aims to automatically detect human emotions such as anxiety and fear. It requires deeply elaborating multiple data modalities, including acoustic and visual streams. State-of-the-art approaches leverage transformer-based architectures to combine multimodal sources. However, the impressive performance of MLLMs in content retrieval and generation offers new opportunities to extend the capabilities of existing emotion recognizers. This paper explores the performance of MLLMs in the emotion recognition task in a zero-shot learning setting. Furthermore, it presents a state-of-the-art architecture extension based on MLLM content reformulation. The performance achieved on the Hume-Reaction benchmark shows that MLLMs are still unable to outperform the state-of-the-art average performance but, notably, are more effective than traditional transformers in recognizing emotions with an intensity that deviates from the average of the samples.

Keywords: video–language large language models; emotion recognition; emotional reaction intensity estimation; multimodal learning

Share and Cite

MDPI and ACS Style

Vaiani, L.; Cagliero, L.; Garza, P. Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet 2024, 16, 247. https://doi.org/10.3390/fi16070247

AMA Style

Vaiani L, Cagliero L, Garza P. Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet. 2024; 16(7):247. https://doi.org/10.3390/fi16070247

Chicago/Turabian Style

Vaiani, Lorenzo, Luca Cagliero, and Paolo Garza. 2024. "Emotion Recognition from Videos Using Multimodal Large Language Models" Future Internet 16, no. 7: 247. https://doi.org/10.3390/fi16070247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Emotion Recognition from Videos Using Multimodal Large Language Models

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI