Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Mares, Alex; Diaz-Arango, Gerardo; Perez-Jacome-Friscione, Jorge; Vazquez-Leal, Hector; Hernandez-Martinez, Luis; Huerta-Chua, Jesus; Jaramillo-Alvarado, Andres Felipe; Dominguez-Chavez, Alfonso

doi:10.3390/app15084340

Open AccessArticle

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

by

Alex Mares

¹

,

Gerardo Diaz-Arango

¹

,

Jorge Perez-Jacome-Friscione

¹,

Hector Vazquez-Leal

^1,*

,

Luis Hernandez-Martinez

²

,

Jesus Huerta-Chua

³,

Andres Felipe Jaramillo-Alvarado

²

and

Alfonso Dominguez-Chavez

¹

Facultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, Mexico

²

Instituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, Mexico

³

Electronics Department, National Institute for Astrophysics, Optics and Electronics, Sta. María Tonantzintla, Puebla 72840, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340

Submission received: 1 March 2025 / Revised: 3 April 2025 / Accepted: 5 April 2025 / Published: 14 April 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.

Keywords:

Spanish speech emotion recognition; pre-trained models; Wav2Vec 2.0; TRILLsson; Spanish emotional speech databases; leave one speaker out; feature extraction

1. Introduction

Emotions are integral to human communication, profoundly influencing our interactions and behaviors. Among the various forms of expression, speech stands out as a crucial medium in which emotions are conveyed, making it a fundamental source for emotion recognition [1]. Speech emotion recognition (SER) aims to detect embedded emotions by processing and analyzing speech signals [2].

Traditional SER systems rely on acoustic features, such as intonation, rhythm, intensity, and duration, to identify patterns associated with specific emotional states [3]. These systems follow a process with a structure that includes signal acquisition, preprocessing [4], feature extraction (focusing on acoustic properties like pitch, energy, and spectral features) [1], feature selection, classification, and model evaluation [5].

The applications of SER are broad and cover different areas, within which human–computer interaction stands out where SER enables conversational agents and robots to respond empathetically, enhancing user experiences [6,7]. In healthcare, it supports the diagnosis of mental health conditions like depression and it also aids in monitoring emotional well-being [8,9,10]. SER systems are also applied in call centers to improve customer service, in automotive industries for stress monitoring, and in educational technologies for personalized learning experiences [1,11,12].

SER models are built on two emotion representation paradigms: categorical models and dimensional models [13,14,15]. The categorical model identifies basic emotions universally recognized across all cultures. Paul Ekman initially proposed six basic emotions—sadness, happiness, disgust, anger, fear, and surprise—emphasizing their expression through facial cues and their cross-cultural universality [16]. This concept converges with the idea of Darwinian evolution, which says that emotions are primitive responses shared among humans and animals [17].

In contrast, the dimensional model represents emotions on a continuous scale, mapping affective states onto dimensions like valence and arousal [13,18]. Valence reflects the positive or negative evaluation of emotions, while arousal denotes its intensity or activation level [19].

Over the past two decades, SER has advanced from traditional machine learning techniques relying on handcrafted features to complex deep learning models capable of automatically extracting information-rich representations from raw data [18,20,21]. The most common metrics for evaluating these systems are accuracy, precision, recall, and F1-score, with the F1-score being particularly useful for imbalanced datasets [14,15]. Techniques like leave-one-speaker-out (LOSO) provide robust benchmarking across different speakers and datasets [22].

Early approaches made use of features such as pitch, energy, and Mel-frequency cepstral coefficients (MFCCs), combined with classifiers like support vector machines (SVMs) [23,24,25], Hidden Markov Models (HMMs) [26,27,28], Gaussian Mixture Models (GMMs) [29,30], and Naïve Bayes classifiers [31,32]. While these methods laid the groundwork for SER, they often struggled to capture the intricate emotional nuances in speech, particularly in large-scale and high-dimensional datasets [13,33].

The emergence of deep learning transformed SER by introducing models like Convolutional Neural Networks (CNNs) [34,35,36], Recurrent Neural Networks (RNNs), and their variants such as Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs) [37,38,39]. These models demonstrated superior performance by automatically learning relevant features and capturing temporal dependencies [20,40]. Hybrid models combining CNNs with RNNs or implementing attention mechanisms, further enhanced SER capabilities by effectively modeling both spatial and temporal aspects of speech signals [40,41,42,43].

Recently, the use of pre-trained models (PTMs) and self-supervised learning (SSL) has marked a significant breakthrough in SER. Models like wav2vec 2.0 [44], HuBERT [45], WavLM [46], and Whisper [47] employ large volumes of unlabeled data to learn speech representations. These models can be fine-tuned for specific tasks, reducing the need for extensive labeled datasets [48,49,50]. By making use of such models, SER has achieved state-of-the-art (SOTA) performance, particularly enhancing recognition accuracy in cross-lingual and out-of-domain scenarios [51,52,53]. The integration of PTMs like wav2vec 2.0 with fine-tuning approaches, such as task-adaptive pre-training and multi-task learning, has consistently outperformed traditional methods, highlighting the potential of transfer learning to effectively capture emotional signals [54,55,56]. Furthermore, models like Whisper have challenged the conventional notion that automatic speech recognition models are suboptimal for SER, demonstrating that ASR models can perform exceptionally well in emotion recognition tasks [51].

Despite these achievements, SER still faces significant challenges, particularly in underrepresented languages like Spanish. Variability in emotional expression across speakers, languages, and cultures complicates model generalization and robustness [4,20,33]. The scarcity of large, diverse, and emotionally annotated speech corpora hinders the training of deep learning models, often resulting in overfitting and poor performance in real-world scenarios [20,51,57,58].

Cultural and linguistic differences further require the development of language-specific resources, as features like native language, accent, and pronunciation, significantly influence model performance, complicating cross-lingual generalization [59]. Additionally, capturing the subtle nuances of emotions and addressing the ambiguity and subjectivity inherent in emotional expression remain complex tasks [60]. The lack of consensus on optimal acoustic features and the significant differences in performance between acted and natural emotional speech databases complicate the development of universal SER systems [21,61]. Research on the performance and generalization of SER models and methodologies that use databases underrepresented in the current SOTA is therefore necessary.

While recent advances have been made in Spanish SER, several methodological gaps remain. Many systems still rely on handcrafted features such as MFCCs and spectral descriptors, limiting their ability to generalize across speakers and contexts [62,63,64,65,66]. Others integrate deep or multimodal models but conduct limited evaluations, often omitting speaker-independent protocols or cross-corpus validation [67,68]. Furthermore, while some proposals incorporate pretrained models, they do not fully exploit their representational capacity, overlooking layer-wise analysis and targeted fine-tuning [67,69,70,71]. These limitations highlight the need for broader benchmarking across Spanish emotional speech corpora and a deeper investigation of PTMs to build more robust and transferable SER systems.

This work focuses on six databases: EmoMatchSpanishDB [64], Mexican Emotional Speech Database [72], Spanish MEACorpus [68], EmoWisconsin [66], INTER1SP [73], and EmoFilm [74]. These databases were selected for their emphasis on categorical annotations spanning the full range of emotional representation elicitation, including acted, induced, and natural expressions collected from various demographics such as adults and children, professional actors, and non-actors. The diversity in data sources allows for a thorough benchmarking of model generalizability across varied real-world scenarios.

A wide range of advanced PTMs in the field of audio processing and SER were chosen, such as Wav2Vec 2.0 [44], Whisper [47], HuBERT [45], WavLM [46], TRILLsson [75], and CLAP [76]. These models include self-supervised architectures, sequence-to-sequence frameworks, and multimodal contrastive learning approaches. By using them as feature extractors, we generate embeddings that capture high-level representations of speech, allowing the identification of acoustic and paralinguistic features essential for emotion recognition. The objective is to systematically benchmark the performance of these PTMs using their hidden layers as feature extractors, specifically for emotion recognition in Spanish speech, thereby, a layer-wise and LOSO evaluation are introduced, contributing to a detailed understanding of the performance of various PTMs and their correlation with fine-tuning to the target language, while comparing with the SOTA of Spanish SER.

The main contributions of this work are as follows:

We present the first comprehensive benchmarking of PTMs for Spanish SER, covering six diverse databases and addressing the lack of evaluations in underrepresented languages.
We devise a robust experimental framework combining LOSO validation with layer-wise feature extraction, enabling accurate and interpretable comparisons across models.
Our proposed approach achieves superior results on multiple Spanish datasets, outperforming existing state-of-the-art baselines while highlighting the benefits of Spanish-focused fine-tuning.
We reveal how architectural depth and pretraining strategies influence emotional representations, guiding model selection in Spanish SER tasks through a detailed layer-wise and LOSO analysis.
We establish a reproducible benchmarking pipeline that advances research on feature extraction, generalization, and fine-tuning in Spanish SER, offering consistent metrics across diverse scenarios.

To guide the reader through our methods and results, this paper is structured as follows: Section 2 describes related work, emphasizing previous studies in the Spanish language and recent approaches in SER with PTMs. Section 3 presents the databases used and the methodology applied, including feature extraction from different layers of PTMs, the classifiers used, and the validation schemes, both cross-validation and LOSO. Section 4 shows and analyzes the results obtained for each model and database. Section 5 then compares these findings with the SOTA, discussing strengths, limitations, and notable trends. Finally, Section 6 presents the main conclusions, limitations, and points out possible directions for future work.

2. Related Works

2.1. State of the Art in Spanish SER

Recent progress in Spanish SER has leveraged diverse datasets and methodologies with notable success. Kerkeni et al. [62] used MFCC and MS features with classifiers like RNN, SVC, and MLR on the ELRA-S0329 dataset, reaching 90.05% accuracy with RNN and MFCC+MS. García-Cuesta et al. [64] applied ComParE and eGeMAPS features with SVC/XGBOOST, achieving 64.2% precision. Begazo et al. [65] combined CNN-1D, CNN-2D, and MLP architectures with spectral and spectrogram features, obtaining 96% accuracy, recall, and F1-score on MESD. Ortega-Beltrán et al. [63] employed DeepSpectrum with attention (DS-AM), achieving 98.4% on ELRA-S0329 and 68.3% on EmoMatchSpanishDB. Pan and García-Díaz [68] reported a 90.06% weighted F1-score on MEACorpus using Late Fusion and feature concatenation. In contrast, Pérez-Espinosa et al. [66] obtained only a 40.7% F1-score on the older EmoWisconsin dataset.

In the multimodal domain, IberLEF 2024 showcased the strength of combining audio and text. The best-performing model, proposed by BSC–UPC [67], reached an 86.69% F1-score using RoBERTa and XLSR-Wav2Vec 2.0, outperforming text-only approaches by nearly 10 percentage points. Table 1 provides a comparative overview of these studies.

Despite this progress, several limitations hinder the generalization and robustness of current approaches. Some studies still rely on handcrafted features such as MFCCs and spectral descriptors [62,63,64,65,66]. Others employ deep models but limit evaluation to few datasets, without considering speaker-independence or cross-corpus scenarios [67,68]. Additionally, while some use PTMs, they often overlook in-depth analyses, such as layer-wise exploration or selective fine-tuning [67,69,70,71]. These gaps highlight the need for broader benchmarking across acted, natural, and elicited Spanish corpora, and deeper PTM-based exploration to ensure consistent performance in real-world conditions.

2.2. PTMs for SER

The exploration of PTMs for SER has demonstrated significant achievements, particularly with self-supervised architectures like Wav2Vec 2.0, HuBERT, and Whisper. Triantafyllopoulos et al. [53] highlighted that fine-tuned PTMs excel in valence prediction by leveraging linguistic features, but they remain less effective for arousal and dominance, a limitation also noted by Wagner et al. [50]. Similarly, Osman et al. [51] emphasized the generalization capabilities of PTMs like Whisper in cross-lingual and out-of-domain scenarios, outperforming other models on diverse multilingual datasets. Phukan et al. [52] and Atmaja and Sasou [77] expanded this by demonstrating the robustness of paralinguistic PTMs such as TRILLsson in capturing non-semantic features critical for SER tasks across languages. Studies by Pepino et al. [54] and Chen and Rudnicky [55] explored layer-wise feature fusion and advanced fine-tuning strategies, showcasing improvements in accuracy, particularly with novel approaches like pseudo-label task adaptive pretraining (P-TAPT). The integration of multimodal approaches combining acoustic and linguistic features, as reported by Macary et al. [78], further supports the utility of PTMs in enhancing SER performance. Collectively, these works underline the versatility and adaptability of PTMs in addressing the complexities of SER tasks while identifying challenges such as robustness to domain shifts, imbalanced datasets, and the need for fine-grained feature tuning.

3. Materials and Methods

This section presents a detailed description of the methodologies employed to evaluate PTMs on Spanish emotional speech databases. The databases used cover a wide range of emotional representations in order to ensure a complete analysis of the sentimental variability inherent in speech. A thorough layer-by-layer analysis of several SOTA PTMs is carried out, in which embeddings are extracted from each, and mean pooling is applied to obtain fixed-dimensional representations suitable for classification tasks with general models, such as KNN, SVM, and MLP. The above-mentioned classifiers are trained and evaluated using cross-validation to determine their efficacy. Furthermore, to evaluate the generalization of the models to unseen speakers, LOSO cross-validation is implemented.

3.1. Databases

As part of this study, the focus was placed on six Spanish-language databases, EmoMatchSpanishDB, MESD, Spanish MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm, selected for their emphasis on categorical emotion classifications, which differ from continuous models. Acted, induced, and natural emotions are included in these databases, and a comprehensive basis for analyzing emotional phenomena is provided. The details and specific contributions of each database are explained below, and a summary of them is presented in Table 2.

EmoMatchSpanishDB
The EmoMatchSpanishDB (EM-SpDB) [64] database is part of EmoSpanishDB, which was created from recordings of 50 non-actor participants (30 men and 20 women) with Ekman’s seven basic emotions being simulated: surprise, disgust, fear, anger, happiness, sadness, and neutral. EmoSpanishDB initially contains 3550 audio files, free of intrinsic emotional load. Through a crowdsourcing process, perceived emotions were validated, and inconsistent samples were removed, leading to the development of EM-SpDB, with the set being reduced to 2020 audios with more reliable emotional labels.
Mexican Emotional Speech Database.
The Mexican Emotional Speech Database (MESD) [72] was created in 2021 and consists of 864 recordings in Mexican Spanish. Six emotions—anger, disgust, fear, happiness, sadness, and neutral—were simulated by sixteen individuals (4 men, 4 women, and 8 children). The recordings were conducted in a professional studio, and the audio was sampled at 48,000 Hz with 16-bit resolution. Each adult participant recorded 48 words per emotion, while children recorded 24 words per emotion. The database was validated in [72] using an SVM model, and an accuracy of 93.9% for men, 89.4% for women, and 83.3% for children was achieved
Spanish MEACorpus.
The Spanish MEACorpus [68] is a multimodal database created in 2022, containing 5129 audio segments extracted from 13.16 h of Spanish speech. The emotional distribution is based on Ekman’s basic emotions. The segments are derived from YouTube videos, with natural settings such as political speeches, sports events, and entertainment shows being captured and recorded spontaneously. The database includes 46% female voices and 54% male voices, though specific details on speaker ages are not provided. The audios are in WAV format at a sampling rate of 44,100 Hz, with an average length of 9.24 s, segmented using a noise and silence threshold-based algorithm. Manual annotation was performed by three annotators, with Ekman’s taxonomy being used.
EmoWisconsin.
EmoWisconsin [66] was created in 2011 in Mexican Spanish and contains 3098 segments of children’s speech, including 28 participants (17 boys and 11 girls) aged 7 to 13. The labeled emotions include six categories: Annoyed, Motivated, Nervous, Neutral, Doubtful, and Confident, along with the three continuous primitives of valence, arousal, and dominance. The recordings were made at 44,100 Hz in 16-bit PCM WAV format, totaling 11 h and 39 min of recordings across 56 sessions. Emotions were elicited using a modified version of the Wisconsin Card Sorting Test (WCST), with sessions divided into positive and negative interactions.
INTERFACE (INTER1SP).
The INTERFACE [73] database was created in 2002 and contains 5520 samples in Spanish (INTER1SP), along with recordings in English, Slovenian, and French. Six emotions are included: anger, sadness, joy, fear, disgust, and surprise. Additionally, neutral speech styles were recorded in Spanish, including variations like soft, loud, slow, and fast. Recordings were made by one professional actor and one actress in each language. For validation, 18 non-professional listeners evaluated 56 statements through subjective tests. Recognition accuracy in the first choice exceeded 80%, rising to 90% when a second option was included. This database is available in the ELRA repository [79].
EmoFilm.
EmoFilm [74] is a multilingual database created in 2018, designed to enrich underrepresented languages such as Spanish, Italian, and others. It consists of 1115 clips with an average length of 3.5 s, distributed as 360 clips in English, 413 in Italian, and 342 in Spanish. This dataset includes five emotions: anger, sadness, happiness, fear, and contempt. For this study, only the Spanish portion of EmoFilm was used ( ${EmoFilm}_{E S}$ ). Movie and TV series clips have been extracted, capturing emotional expressions in acted contexts. Emotion annotation was performed through evaluations by multiple annotators, achieving high perceptual reliability.

3.2. Pre-Trained Models

In this work, we make use of a selection of PTMs, including self-supervised architectures, sequence-by-sequence frameworks, and multimodal contrastive learning approaches. Each hidden layer of these models is applied as a feature extractor to generate embeddings that capture speech representations, relevant to the identification of both acoustic and paralinguistic features. The following sections detail the PTMs selected for this work, highlighting their architectures and pretraining methodologies.

Wav2Vec 2.0.
Wav2Vec 2.0 [44] is a self-supervised framework that learns representations from raw audio by solving a contrastive task on latent masked speech samples. The architecture is based on a convolutional feature encoder that processes waveform inputs into latent speech representations, followed by a transformer context network that captures contextual information. A quantization module discretizes the latent representations for contrastive learning. The training objective maximizes the agreement between the real latent representations and the quantized versions for the masked time steps, effectively capturing local and global acoustic features without the need for transcribed data.
We used several variants of Wav2Vec 2.0 in our experiments:
- facebook/wav2vec2-large-robust-ft-libri-960h (W2V2-L-R-Libri960h): Pre-trained on 60,000 h of audio, including noise and telephony data, and fine-tuned on 960 h of the LibriSpeech corpus. This model consists of 24 layers and contains 317 million parameters.
- jonatasgrosman/wav2vec2-xls-r-1b-spanish (W2V2-XLSR-ES): Based on the XLS-R architecture, this model was pre-trained on 436,000 h of multilingual data and fine-tuned on Spanish ASR tasks. It has 1 billion parameters and 48 transformer layers.
- facebook/wav2vec2-large-xlsr-53-spanish (W2V2-L-XLSR53-ES): Pre-trained with 56,000 h of speech data from 53 languages including Spanish, and fine-tuned with Spanish ASR data. This model has 317 million parameters and 24 hidden layers.
Whisper.
Whisper [47] is an encoder–decoder transformer model designed for ASR and speech translation tasks. Its training approach, using supervised data, allows the model to effectively generalize across languages and tasks, capturing both linguistic and paralinguistic features that can be essential for emotion recognition.
- openai/whisper-large-v3 (Whisper-L-v3): This model contains 1.55 billion parameters, 32 encoder and decoder layers, and is trained on 680,000 h of multilingual data.
- zuazo/whisper-large-v3-es (Whisper-L-v3-ES): This is an optimized version specifically tailored for Spanish ASR tasks, featuring 32 hidden layers.
HuBERT.
HuBERT [45] is a PTM developed by Facebook, with a fully transformer-based architecture. This model has been trained on 1,160,000 h of unlabeled audio data using a self-supervised learning paradigm similar to Wav2Vec 2.0, but optimized for mask prediction tasks within audio signals. In this study, a 24-layer, 316 million-parameter version has been employed facebook/hubert-large-ll60k (HuBERT-L-1160k).
WavLM.
WavLM [46] incorporates a controlled relative position bias to model long-range dependencies in the speech signal, which improves the model’s ability to capture global contextual information. Pre-training uses a masked speech prediction task with continuous inputs, similar to the HuBERT and Wav2Vec 2.0 methodologies. For our work, we use the microsoft/wavlm-large (WavLM-L) model, which consists of 24 transformer layers and 300 million parameters.
TRILLsson.
TRILLsson [75] is a compact and efficient model derived through the distillation of CAP12, a high-performance Conformer model trained on 900 million hours of audio from the YT-U dataset using self-supervised objectives. Tailored for non-semantic tasks, TRILLsson captures paralinguistic features critical for emotion recognition, such as tone, pitch, and rhythm. Although its size has been reduced from 6x to 100x compared to CAP12, it retains 90–96% of its performance. Its efficient architecture leverages local matching strategies to align smaller inputs with robust embeddings, enabling deployment on resource-constrained devices. From this model, it is only possible to obtain the final embeddings, so, unlike the others, it is not layer-wise analyzed.
CLAP.
CLAP [76] makes use of a dual-encoder architecture for audio and text modalities, trained through a contrastive loss that aligns audio and text embeddings in a shared latent space. The audio encoder processes audio inputs using the HTSAT architecture, while the text encoder processes textual descriptions. In our work, we use the laion/clap-htsat-unfused (L-CLAP-G) model, which, like TRILLsson, only yields final embeddings and not intermediate ones, so its analysis will not be layer-based either.

A summary of the PTMs evaluated in this work is provided in Table 3. This table highlights key details, including the training datasets, parameter counts, and language coverage.

3.3. Layer-Wise Evaluation of PTMs

A systematic layer-wise evaluation has been performed in this work in order to identify the most effective representations for SER. The approach examines each layer of the PTMs sequentially, extracting embeddings, starting from the earliest layers and moving toward the deepest layers. The complete flow of the layer-wise evaluation is presented in Figure 1.

3.4. Embedding Extraction and Mean Pooling

For each PTM and speech database, embeddings have been extracted sequentially, which are then mean pooled. This transformation condenses the temporal dynamics into a single vector while preserving the essential statistical properties of the features. Formally, let

E \in R^{T \times D}

be the embedding matrix of an audio sample, where T is the number of time steps and D the dimensionality of the feature. The mean pooled embedding

e \in R^{D}

is computed as:

e = \frac{1}{T} \sum_{t = 1}^{T} E [t],

(1)

where

E [t] \in R^{D}

corresponds to the feature vector at time step t. This operation compresses the

T \times D

sequence into a single

1 \times D

vector, facilitating compatibility with downstream classifiers. Mean pooling has been consistently employed in SER tasks involving PTM embeddings [52,54,80], offering a simple yet effective way to aggregate temporal information. Models like CLAP and TRILLsson are excluded from this step, as they already integrate internal mechanisms for temporal summarization or lack a sequential structure altogether.

3.5. Data Partitioning

The dataset is partitioned into training and test sets using 5-fold stratified cross-validation, ensuring that the distribution of emotion classes is preserved within each fold to address class imbalance issues [81]. Given a dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, where

x_{i}

is the feature vector and

y_{i}

its label, the data is partitioned into K folds

{D_{1}, \dots, D_{K}}

such that the class distribution within each fold matches the overall class distribution of

D

. For each iteration

k \in {1, \dots, K}

, the training and test sets are defined as:

D_{train}^{(k)} = ⋃_{i \neq k} D_{i}, D_{test}^{(k)} = D_{k} .

(2)

Partitioning is achieved using Scikit-learn’s ‘StratifiedKFold’ method [82], which splits the data such that for each fold k, the class proportions in

D_{test}^{(k)}

and

D_{train}^{(k)}

are consistent with the original dataset.

3.6. LOSO Validation

LOSO validation is employed to assess model generalization to unseen speakers. In this approach, data from a single speaker is reserved as the testing set while the model is trained on data from all remaining speakers. Let

D = {D_{1}, D_{2}, \dots, D_{S}}

represent the dataset, where

D_{i}

contains all samples from speaker i, and S is the total number of speakers. For the i-th iteration, the training and testing sets are defined as:

D_{train}^{(i)} = ⋃_{j \neq i} D_{j}, D_{test}^{(i)} = D_{i} .

(3)

Performance metrics are computed separately for each speaker and subsequently averaged to obtain an overall evaluation. To ensure reliable evaluation, speakers with fewer than 15 samples are excluded. The workflow of the LOSO evaluation is represented in Figure 2.

3.7. Classifier Training

The embeddings extracted from the PTMs serve as inputs for three classifiers: KNN, SVM, and MLP. These models are simple and well-established in the SER literature [21,23,25], allowing for interpretable results and isolating the contribution of embeddings without the added complexity of end-to-end systems. Additionally, although more complex models could be used, the focus of this study is not on maximizing computational power but rather on evaluating the quality of the embeddings. Therefore, using lightweight models reduces computational requirements and allows for a more focused analysis, while also ensuring reproducibility and mitigating overfitting risks.

Hyperparameter optimization is performed using a grid or randomized search with cross-validation. The classification algorithms used are described below.

K-Nearest Neighbors.
The KNN algorithm is a non-parametric and lazy learning method used for classification and regression. KNN classification is performed by identifying the K nearest neighbors to a query point using a distance metric, such as the Euclidean distance. The optimal value of K and the nearest neighbors are determined through a search process that can involve techniques like Ktree and K*tree, where neighbor calculations are optimized using tree-like structures. Advanced methods, such as the one-step KNN, reduce neighbor calculation to a single matrix operation that integrates both K adjustment and neighbor search, using least squares loss functions and sparse regularization (like group lasso) to generate a relationship matrix containing the optimal neighbors and their corresponding weights [83].
Support Vector Machine.
SVMs are supervised learning algorithms used for classification and regression designed to identify the optimal separating hyperplane in a high-dimensional space that maximizes the margin between points from different classes. For non-linear problems, SVM employs kernel functions to project data into higher-dimensional spaces where linear separation becomes feasible. The maximum margin is expressed as $\frac{1}{∥ w ∥}$ , where w is the weight vector. The optimization problem is formulated as minimizing ${| | w | |}^{2}$ under correct classification constraints, and in cases where the margin cannot be strictly maintained, slack variables are introduced to allow some classification flexibility [84,85].
Multi-Layer Perceptron.
MLPs are artificial neural networks consisting of interconnected layers of nodes, where each node in intermediate and output layers applies a non-linear activation function (such as sigmoid or hyperbolic tangent). MLP training is typically performed using backpropagation, which adjusts the weights via gradient descent to minimize the mean squared error (MSE) between predicted and expected outputs. The mathematical model of an MLP can be expressed as:

$y = f (W_{2} \cdot f (W_{1} \cdot x + b_{1}) + b_{2})$

where $W_{1}$ and $W_{2}$ are weight matrices, $b_{1}$ and $b_{2}$ are bias vectors, and f is the non-linear activation function [86].

3.8. Calculation of Performance Metrics

Three metrics—accuracy, F1-score, and unweighted average recall (UAR)—are used to evaluate the performance of the models. These metrics follow the definitions provided in Scikit-learn [82] and are defined as follows:

Accuracy = \frac{Number of Correct Predictions}{Total Number of Predictions} = \frac{1}{N} \sum_{i = 1}^{N} ⊮ (y_{i} = {\hat{y}}_{i}),

(4)

where N is the total number of samples,

y_{i}

is the true label, and

{\hat{y}}_{i}

is the predicted label. The function

⊮ (\cdot)

is the indicator function, which returns 1 if the condition inside is true, and 0 otherwise.

F 1 - Score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}, where Precision = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N} .

(5)

UAR (Unweighted Average Recall) = \frac{1}{C} \sum_{c = 1}^{C} \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}},

(6)

where C is the number of classes,

{TP}_{c}

is the number of true positives for class c, and

{FN}_{c}

is the number of false negatives for class c. UAR is equivalent to the macro-average recall in Scikit-learn.

4. Results

4.1. Layer-Wise Evaluation Results

Table 4 presents a summary of the performance results of the PTMs evaluated on each of the databases. This table shows the best score achieved (accuracy, F1-score, and UAR) by each classifier employed, where the subscript indicates the hidden layer corresponding to the obtained score.

Based on the highest values (highlighted in bold in Table 4), we observe that the INTER1SP database achieves the highest scores using the HuBERT-L-1160k PTM and SVM classifier, achieving an accuracy of 99.83%₆, an F1-score of 99.83%₆, and a UAR of 99.88%₆. The MESD database also has a high performance with the W2V2-XLSR-ES PTM and MLP classifier, achieving 95.38%₆ in accuracy, 95.40%₆ in F1-score, and 95.38%₆ in UAR. The

{EmoFilm}_{E S}

database achieves 91.67%₂₂ in accuracy, 91.52%₂₂ in F1-score, and 91.83%₂₂ in UAR, using Whisper-L-v3-ES PTM and the MLP classifier. In contrast, the EmoWisconsin database, due to its complexity and limited emotional range, shows lower and unstable performance. Its best accuracy is 62.50%₁₅ and is achieved by W2V2-XLSR-ES and SVM, while the highest F1-score is 58.92%₃₁ and is obtained by Whisper-L-v3-ES and MLP, and the highest UAR is 40.0%₁₄ and is observed by W2V2-XLSR-ES and MLP.

Table 5 provides a concise summary of the best-performing configurations for each database. It highlights the specific PTM, the layer used, and the corresponding classifier that achieved the highest performance. The table serves as a clear reference, showing the respective accuracy, F1-score, and UAR values on the datasets.

4.1.1. Average Metric Values by Layer per Database and PTM

The results in Figure 3 and Figure 4 reveal distinct layer-wise performance trends among the evaluated models. Models like WavLM-L, W2V2-L-XLSR53-ES, and W2V2-L-R-Libri960h, each with 24 layers, achieve their highest F1-scores and UAR in the earlier layers (e.g., layers 3–6), demonstrating early-layer dominance. In contrast, Whisper-L-v3 and Whisper-L-v3-ES (32 layers) exhibit peak performance in intermediate layers, with optimal F1-scores observed in layers 13–15 and UAR in layers 12–14. Similarly, the 48-layer W2V2-XLSR-ES achieves its best F1-scores in layers 9–11, while UAR performance shifts slightly to layers 5, 10, and 11, indicating metric-specific nuances. Models like TRILLsson and L-CLAP-G deviate from this trend, providing fixed embeddings rather than layer-specific outputs, represented as constant values in the figures.

For all models, the F1-score decreases significantly in the final layers. This suggests that as the network depth increases, the representations become more specialized and might be less effective for emotion classification tasks. Also, the fact that certain specific layers optimize classification indicates a sweet spot where the representation balances both abstraction and discrimination of emotional features.

4.1.2. Database Performance Comparison

The visualizations in Figure 5 and Figure 6 illustrate the maximum F1-score and UAR achieved by each PTM across the six databases, complementing the detailed numerical results in Table 4. These results provide a clear overview of the PTM-specific strengths and their adaptability to different datasets.

The three databases with the highest average performance, both in F1-score and UAR, are INTER1SP, MESD, and MEACorpus, while

{EmoFilm}_{E S}

and EM-SpDB show lower performance with greater variability between the results of different PTMs. On the other hand, EmoWisconsin is the database with the lowest performance, in addition to having an instability dependent on the chosen feature extractor and a greater variability between the results obtained for F1-score and UAR.

4.2. LOSO Validation Results

To assess the generalization capacity of the models, a LOSO validation was performed using the optimal layer determined by the F1-score analysis. The results, summarized in Table 6, provide accuracy, UAR, and F1-score metrics for each classifier across all databases and PTMs. The bold values in the table indicate the highest accuracy, F1-score, and UAR achieved for each database in the LOSO setup.

4.2.1. Performance Insights in LOSO Validation

The LOSO validation results, summarized in Figure 7 and Figure 8, reveal that the best F1-scores differ substantially across databases. For instance, while EM-SpDB reaches its highest F1 of 78.48% with TRILLsson,

{EmoFilm}_{ES}

achieves 83.86% using W2V2-XLSR-ES. These two datasets, generally larger and more diverse, appear to foster better generalization to unseen speakers. In contrast, other databases show notably lower peaks; EmoWisconsin, for example, reaches a maximum F1 of just 50.75% (W2V2-XLSR-ES), and MESD tops out at 50.59% (W2V2-L-XLSR53-ES).

A similar picture emerges for UAR. EM-SpDB obtains its best UAR of 73.24% with W2V2-L-XLSR53-ES, and

{EmoFilm}_{ES}

is close at 72.68% using W2V2-XLSR-ES. Meanwhile, EmoWisconsin shows a maximum UAR of only 34.71% (W2V2-XLSR-ES), highlighting its higher difficulty for speaker-independent modeling. In INTER1SP, the top F1 (67.44%) and UAR (65.99%) both stem from W2V2-L-XLSR53-ES, suggesting that certain PTMs can still adapt successfully to more constrained contexts.

These findings underscore the importance of dataset size and diversity in PTM generalization. EM-SpDB and

{EmoFilm}_{ES}

, featuring a broader emotional range and greater speaker variety, achieve higher F1-score and UAR ceilings. Meanwhile, smaller or highly focused corpora like MESD (peak F1 below 51%) or INTER1SP (below 68%) pose greater challenges. Notably, TRILLsson competes strongly (for example, 78.48% F1 in EM-SpDB) despite lacking explicit Spanish fine-tuning, highlighting the power of robust paralinguistic representations. W2V2-L-XLSR53-ES and W2V2-XLSR-ES also stand out, reinforcing the benefits of language-specific training for Spanish SER tasks.

4.2.2. Average Performance Metrics

Figure 9 and Figure 10 depict the average F1-score and UAR in LOSO validation. Consistent with the maximum metrics, W2V2-L-XLSR53-ES stands out among the top across databases, while TRILLsson displays robust cross-linguistic adaptability. W2V2-XLSR-ES also demonstrates high average performance, further emphasizing the benefits of Spanish fine-tuning. By contrast, L-CLAP-G and Whisper-based models exhibit lower overall averages, although they can remain competitive in specific scenarios.

In general, the consistent performance of W2V2-L-XLSR53-ES suggests that Spanish-oriented training helps capture crucial emotional cues, while TRILLsson’s paralinguistic focus maintains remarkable versatility for Spanish datasets. L-CLAP-G and Whisper-based approaches, though potentially strong for ASR or multimodal tasks, appear less optimal for Spanish SER, especially under LOSO’s stringent speaker-independent conditions.

5. Discussion and Comparative Analysis

5.1. PTMs Against SOTA for SER in Spanish

The layer-wise results, summarized in Table 7 (the bold values in the table indicate the highest accuracy, F1-score, and UAR in the SOTA for each database), reveal substantial improvements over SOTA benchmarks across multiple datasets, driven by the use of Spanish fine-tuned PTMs. In INTER1SP, HuBERT-L-1160k surpasses the SOTA accuracy by 1.43%, and is joined by W2V2-L-XLSR53-ES, W2V2-XLSR-ES, WavLM-L, W2V2-L-R-Libri960h, Whisper-L-v3, and Whisper-L-v3-ES, all of which also exceed the SOTA benchmark. In EM-SpDB, W2V2-L-XLSR53-ES achieves a remarkable 19.98% accuracy gain over the SOTA, highlighting the benefits of fine-tuning for Spanish. Whisper-L-v3-ES delivers competitive results in

{EmoFilm}_{E S}

, excelling in datasets with natural emotional expressions, while MESD is the only dataset where the proposed method falls slightly short, with a minimal gap of 0.62% in F1-score. These findings demonstrate that feature extraction techniques with PTMs, in particular those fine-tuned to the target language, consistently outperform traditional approaches on diverse emotional datasets.

The LOSO validation results, summarized in Table 8, highlight the strong generalization capabilities of TRILLsson and the consistent superiority of Spanish fine-tuned PTMs in challenging evaluation settings. TRILLsson achieves the best accuracy (79.66%) and F1-score (78.48%) in EM-SpDB, outperforming the SOTA benchmark of 68.30%, even from non-LOSO evaluations, a remarkable feat given the stringent LOSO conditions. Similarly, W2V2-L-XLSR53-ES surpasses the SOTA in EmoWisconsin, achieving 54.85% accuracy and demonstrating the effectiveness of fine-tuned embeddings in datasets with induced emotions and high variability. Moreover, TRILLsson also leads in MEACorpus with 75.85% accuracy and 71.64% F1-score, demonstrating its good adaptation in diverse datasets, while W2V2-L-XLSR53-ES and W2V2-XLSR-ES maintain strong performance across other databases, including INTER1SP and

{EmoFilm}_{E S}

.

5.2. Performance Comparison Between PTMs

The 24-layer models (W2V2-L-XLSR-ES, HuBERT-L-1160k, WavLM-L, W2V2-L-R-Libri960) perform optimally at the early layers (4–6), effectively capturing crucial linguistic and prosodic features. This finding aligns with previous observations that lower transformer layers excel at extracting short-range acoustic cues, pitch, and formant structure [54,55], while deeper layers increasingly emphasize broader contextual abstractions [53,87]. Similarly, larger models (Whisper-L-v3, Whisper-L-v3-ES, W2V2-XLSR-ES) show their peak performance around mid-level layers (9–15), consistent with [51]. In none of these cases do final layers outperform earlier or middle layers consistently, indicating that the richest emotional representations often emerge before the last layers.

TRILLsson shows remarkable adaptability on speaker-independent tasks, such as LOSO validation, achieving SOTA results on datasets such as MEACorpus and EM-SpDB. This is in agreement with [52], who highlighted the effectiveness of TRILLsson in capturing critical paralinguistic features for SER. Furthermore, the improvements observed through fine-tuning, in particular for W2V2-L-XLSR53-ES, confirmed the importance of domain-specific adaptations, as emphasized by [55].

5.3. Impact of Database Nature on Emotion Recognition Performance

The nature of emotional databases—acted, natural, and induced—strongly affects within-database performance and generalization in speaker-independent scenarios. Acted databases, like INTER1SP, EM-SpDB, and MESD, often show high accuracy in layer-wise cross-validation due to expressive recordings. However, the LOSO protocol significantly lowers performance, as seen in MESD, where accuracy drops from 95.38% to 53.12%. This highlights the challenge of generalizing exaggerated emotions to unseen speakers.

Natural (MEACorpus) and induced (EmoWisconsin) databases show lower layer-wise scores but maintain more stable performance under LOSO. MEACorpus retains around 70–75% F1-score in LOSO despite reaching 92% in the layer-wise model, suggesting that genuine expressions generalize better. In contrast, EmoWisconsin struggles with LOSO, usually staying below a 60% F1-score, indicating that elicited emotions are more variable and speaker specific. These patterns confirm that acted data achieve high peak accuracy that drops in speaker-independent contexts, while natural or induced data have lower peaks but steadier generalization.

The results also indicate that the number of speakers alone is not the only variable impacting performance. Databases with few speakers, like INTER1SP (2) and MESD (3), perform well in the layer-wise model but decline in LOSO. EM-SpDB (50 speakers) shows consistent results (88.28% layer-wise, 79.66% LOSO), suggesting that diversity improves generalization. Intermediate datasets, like EmoWisconsin (27 speakers), vary more, with a 58.92% F1-score in the layer-wise model, dropping to 50.75% in LOSO. Thus, while speaker count matters, database nature and emotion variability are more critical for generalization.

5.4. Fine-Tuning and Practical Applications

Recent research indicates that fully fine-tuning large PTMs is not always essential to achieve competitive performance in classification tasks, even beyond speech-related domains [88]. In our experiments, the direct approach of freezing Spanish PTMs and using them solely as feature extractors already surpasses or matches existing SER benchmarks across multiple datasets. This observation aligns with the findings of [88], which indicate that freezing BERT can outperform fine-tuning it for classification, and that the best-performing models can be trained by freezing the main model, substantially reducing training time. Given the considerable resources required for large-scale fine-tuning, our results suggest that, particularly for Spanish SER tasks, the additional cost of exhaustive model adaptation is not strictly necessary to achieve state-of-the-art performance.

Furthermore, our study highlights the remarkable performance of TRILLsson in speaker-independent contexts (LOSO), consistent with previous conclusions about its robustness in capturing paralinguistic cues [52]. TRILLsson’s versatility has also been demonstrated in various clinical and diagnostic applications, such as cognitive impairment detection [89], clinical speech AI [90], and Parkinson’s disease detection [91]. This capability to extract nuanced emotional information is equally beneficial for real-world SER applications. For instance, emotion recognition can be integrated into call centers to automate customer support or implemented in healthcare systems to monitor the psychological state of patients over time and facilitate the early detection of mood disorders.

6. Conclusions

The findings of this study confirm that frozen PTMs used solely as feature extractors can achieve high performance in Spanish SER, matching and even surpassing SOTA work, without the need for extensive task-specific fine-tuning. In particular, 24-layer architectures tend to capture crucial acoustic and linguistic features in earlier layers (4–6), while deeper architectures peak at intermediate layers (9–15). Moreover, our approach outperforms previously reported techniques in Spanish SER, attaining F1-scores of 99.83% on INTER1SP, 88.32% on EM-SpDB, 92.53% on MEACorpus, and 58.92% on EmoWisconsin.

Spanish-specific fine-tuned models such as W2V2-L-XLSR53-ES (24 layers) and W2V2-XLSR-ES (48 layers) yielded the best overall scores, indicating the suitability of Wav2Vec 2.0-based architectures for SER regardless of model depth. Furthermore, TRILLsson’s notable generalization performance highlights its strong applicability for speaker-independent and real-world scenarios.

Equally important, our results underscore that performance heavily depends on dataset characteristics, including the number of speakers, emotional variability, and whether emotions are natural, induced, or acted, all of which substantially impact the generalization of SER systems in practical environments. Finally, through a multi-dataset, layer-wise, and LOSO-based evaluation, we demonstrated how specific layers provide the richest emotional representations based on architecture size, opening new avenues for efficient, scalable SER solutions in Spanish.

Despite the strong performance reported, several limitations remain. First, we did not systematically evaluate computational cost or energy consumption, both of which are critical for real-time or on-device SER. Second, our analysis focused on Ekman’s discrete emotional categories, potentially oversimplifying the spectrum of human affects. Third, we relied on the mean-pooling of embeddings, which might overlook salient temporal cues. Moreover, we did not apply formal significance tests (e.g., ANOVA or t-tests) to determine whether performance differences between PTMs are statistically meaningful.

Based on the above observations, we outline several key recommendations. First, carefully evaluate the trade-off between freezing and fully fine-tuning large PTMs, as well as the option of adopting more compact architectures (e.g., TRILLsson), which can ease computational overhead without critically sacrificing accuracy. Second, integrate statistical significance tests into system evaluations, thus enabling data-driven decisions about model configurations and ensuring that observed performance differences are both numerically and statistically meaningful. Third, go beyond discrete emotion taxonomies (e.g., Ekman’s) by considering more nuanced emotional constructs for a richer perspective on affective states. Finally, carefully plan or select SER databases by balancing the number of speakers, emotional variability, and the nature of the data (acted, induced, or natural), as these aspects could substantially influence performance in real-world scenarios.

For future work, we identify three promising directions. One is to extend these insights for truly multilingual SER, integrating or pre-training in languages beyond Spanish and refining cross-lingual generalization. Another direction is to adapt and validate SER systems in real-world scenarios—such as call centers or clinical settings—for practical, scalable solutions. Finally, the development of multi-output models capable of detecting multiple emotional states in the same audio file is needed to address the simultaneous nature of human emotions. Such advancements will propel SER research and broaden its impact across diverse application domains.

Author Contributions

Investigation: A.M. and G.D.-A.; supervision: A.D.-C., H.V.-L. and L.H.-M.; methodology: A.M.; validation; A.M., A.D.-C. and L.H.-M.; visualization: J.H.-C., A.F.J.-A. and J.P.-J.-F.; resources: G.D.-A., H.V.-L., A.F.J.-A. and J.P.-J.-F.; writing—original draft preparation: A.M.; writing—review and editing: A.M., G.D.-A., H.V.-L., J.H.-C. and L.H.-M.; All authors have read and agreed to the published version of the manuscript.

Funding

Alex Mares and Gerardo Diaz-Arango gratefully acknowledge the financial support provided by the Secretariat of Science, Humanities, Technology, and Innovation (SECIHTI) through academic scholarships under contracts 1282871 and 480421, respectively.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from multiple sources, including the EmoWisconsin dataset [66], EmoMatchSpanishDB [64], The Mexican Emotional Speech Database [72], Spanish MEACorpus [68], Interface Databases [73], and the Spanish subset of EmoFilm [74]. Access to these datasets is restricted and may require permission from the respective authors or institutions.

Acknowledgments

We like to thank Humberto Pérez-Espinosa, Carlos Alberto Reyes-García, and Luis Villaseñor-Pineda. “Emowisconsin: and emotional children speech database in mexican spanish”. In Affective Computing and Intelligent Interaction, pp. 62–71. Springer Berlin Heidelberg, 2011. ISNN 0302-9743. Additionally, we extend our gratitude to the authors and institutions responsible for the databases that made this research possible: EmoMatchSpanishDB [64], The Mexican Emotional Speech Database (MESD) [72], Spanish MEACorpus [68], Interface Databases [73], and the Spanish subset of EmoFilm [74].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNNs	Convolutional Neural Networks
DS-AM	DeepSpectrum model with Attention Mechanism
EM-SpDB	EmoMatchSpanishDB
GMMs	Gaussian Mixture Models
GRUs	Gated Recurrent Units
HMMs	Hidden Markov Models
KNN	K-Nearest Neighbor
LOSO	Leave-One-Speaker-Out
LSTM	Long Short-Term Memory
MESD	Mexican Emotional Speech Database
MFCCs	Mel-Frequency Cepstral Coefficients
MLP	Multi-Layer Perceptron
P-TAPT	Pseudo-label Task Adaptive Pretraining
PTMs	Pre-Trained Models
RNNs	Recurrent Neural Networks
SER	Speech Emotion Recognition
SSL	Self-Supervised Learning
SOTA	State-Of-The-Art
SVM	Support Vector Machine
UAR	Unwieghted Average Recall
WCST	Wisconsin Card Sorting Test

References

Geethu, V.; Vrindha, M.K.; Anurenjan, P.R.; Deepak, S.; Sreeni, K.G. Speech Emotion Recognition, Datasets, Features and Models: A Review. In Proceedings of the 2023 International Conference on Control, Communication and Computing (ICCC), Thiruvananthapuram, India, 19–21 May 2023; pp. 1–6. [Google Scholar]
Lee, C.M.; Narayanan, S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar] [CrossRef]
Shah Fahad, M.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar] [CrossRef]
Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
Luo, B.; Lau, R.Y.K.; Li, C.; Si, Y.W. A critical review of state-of-the-art chatbot designs and applications. WIREs Data Min. Knowl. Discov. 2022, 12, e1434. [Google Scholar] [CrossRef]
Chen, L.; Su, W.; Feng, Y.; Wu, M.; She, J.; Hirota, K. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Inf. Sci. 2020, 509, 150–163. [Google Scholar] [CrossRef]
Liu, Z.; Hu, B.; Li, X.; Liu, F.; Wang, G.; Yang, J. Detecting Depression in Speech Under Different Speaking Styles and Emotional Valences. In Proceedings of the Brain Informatics, Beijing, China, 16–18 November 2017. [Google Scholar]
France, D.; Shiavi, R.; Silverman, S.; Silverman, M.; Wilkes, M. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 2000, 47, 829–837. [Google Scholar] [CrossRef]
Li, H.C.; Pan, T.; Lee, M.H.; Chiu, H.W. Make Patient Consultation Warmer: A Clinical Application for Speech Emotion Recognition. Appl. Sci. 2021, 11, 4782. [Google Scholar] [CrossRef]
Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
Hansen, J.H.; Cairns, D.A. ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments. Speech Commun. 1995, 16, 391–422. [Google Scholar] [CrossRef]
Cai, Y.; Li, X.; Li, J. Emotion Recognition Using Different Sensors, Emotion Models, Methods and Datasets: A Comprehensive Review. Sensors 2023, 23, 2455. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Zhang, C.; Song, Y.; Cai, W. ICE-GAN: Identity-aware and Capsule-Enhanced GAN with Graph-based Reasoning for Micro-Expression Recognition and Synthesis. arXiv 2021, arXiv:2005.04370. [Google Scholar]
Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
Ekman, P.; Sorenson, E.R.; Friesen, W.V. Pan-Cultural Elements in Facial Displays of Emotion. Science 1969, 164, 86–88. [Google Scholar] [CrossRef]
Darwin, C. The Expression Of The Emotions In Man And Animals; Oxford University Press: Oxford, UK, 1998. [Google Scholar] [CrossRef]
Bălan, O.; Moise, G.; Petrescu, L.; Moldoveanu, A.; Leordeanu, M.; Moldoveanu, F. Emotion Classification Based on Biophysical Signals and Machine Learning Techniques. Symmetry 2020, 12, 21. [Google Scholar] [CrossRef]
Zhou, E.; Zhang, Y.; Duan, Z. Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12126–12130. [Google Scholar]
Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
Doğdu, C.; Kessler, T.; Schneider, D.; Shadaydeh, M.; Schweinberger, S.R. A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech. Sensors 2022, 22, 7561. [Google Scholar] [CrossRef]
Stuhlsatz, A.; Meyer, C.; Eyben, F.; Zielke, T.; Meier, H.G.; Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5688–5691. [Google Scholar]
Ke, X.; Zhu, Y.; Wen, L.; Zhang, W. Speech Emotion Recognition Based on SVM and ANN. Int. J. Mach. Learn. Comput. 2018, 8, 198–202. [Google Scholar] [CrossRef]
Bhavan, A.; Chauhan, P.; Hitkul; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
Jain, M.; Narayan, S.; Balaji, P.; P, B.K.; Bhowmick, A.; R, K.; Muthu, R.K. Speech Emotion Recognition using Support Vector Machine. arXiv 2020, arXiv:2002.07590. [Google Scholar]
Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
Nwe, T.L.; Foo, S.W.; Silva, L.C.D. Detection of stress and emotion in speech using traditional and FFT based log energy features. In Proceedings of the Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint, Singapore, 15–18 December 2003; Volume 3, pp. 1619–1623. [Google Scholar]
Mao, S.; Tao, D.; Zhang, G.; Ching, P.C.; Lee, T. Revisiting Hidden Markov Models for Speech Emotion Recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6715–6719. [Google Scholar] [CrossRef]
Neiberg, D.; Elenius, K.; Laskowski, K. Emotion recognition in spontaneous speech using GMMs. In Proceedings of the Interspeech, Brighton, UK, 6–10 September 2009. [Google Scholar]
Koolagudi, S.G.; Devliyal, S.; Chawla, B.; Barthwal, A.; Rao, K.S. Recognition of Emotions from Speech using Excitation Source Features. Procedia Eng. 2012, 38, 3409–3417. [Google Scholar] [CrossRef]
Bhakre, S.K.; Bang, A. Emotion recognition on the basis of audio signal using Naive Bayes classifier. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 2363–2367. [Google Scholar] [CrossRef]
Khan, A.; Roy, U.K. Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 22–24 March 2017; pp. 1017–1021. [Google Scholar] [CrossRef]
Younis, E.M.G.; Mohsen, S.; Houssein, E.H.; Ibrahim, O.A.S. Machine learning for human emotion recognition: A comprehensive review. Neural Comput. Appl. 2024, 36, 8901–8947. [Google Scholar] [CrossRef]
Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors 2020, 20, 183. [Google Scholar]
Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J. Direct Modelling of Speech Emotion from Raw Speech. arXiv 2020, arXiv:1904.03833. [Google Scholar]
Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech Emotion Classification Using Attention-Based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
Mustaqeem; Sajjad, M.; Kwon, S. Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
Lee, J.; Tashev, I.J. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
Lim, W.; Jang, D.; Lee, T. Speech emotion recognition using convolutional and Recurrent Neural Networks. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–15 December 2016; pp. 1–4. [Google Scholar] [CrossRef]
Zhao, Z.; Zheng, Y.; Zhang, Z.; Wang, H.; Zhao, Y.; Li, C. Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar]
Qin, C.; Schlemper, J.; Caballero, J.; Price, A.; Hajnal, J.V.; Rueckert, D. Convolutional Recurrent Neural Networks for Dynamic MR Image Reconstruction. arXiv 2018, arXiv:1712.01751. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv 2021, arXiv:2106.07447. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
Amiriparian, S.; Packań, F.; Gerczuk, M.; Schuller, B.W. ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets. arXiv 2024, arXiv:2406.10275. [Google Scholar]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
Osman, M.; Kaplan, D.Z.; Nadeem, T. SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition. arXiv 2024, arXiv:2408.07851. [Google Scholar]
Phukan, O.C.; Kashyap, G.S.; Buduru, A.B.; Sharma, R. Are Paralinguistic Representations all that is needed for Speech Emotion Recognition? In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4698–4702. [Google Scholar] [CrossRef]
Triantafyllopoulos, A.; Batliner, A.; Rampp, S.; Milling, M.; Schuller, B. INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 1585–1589. [Google Scholar] [CrossRef]
Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings. arXiv 2021, arXiv:2104.03502. [Google Scholar]
Chen, L.W.; Rudnicky, A. Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Gao, Y.; Chu, C.; Kawahara, T. Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 3637–3641. [Google Scholar] [CrossRef]
Jaiswal, M.; Provost, E.M. Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems. arXiv 2023, arXiv:2104.08806. [Google Scholar]
Schrüfer, O.; Milling, M.; Burkhardt, F.; Eyben, F.; Schuller, B. Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition. arXiv 2024, arXiv:2407.01143. [Google Scholar]
Ülgen Sönmez, Y.; Varol, A. In-depth investigation of speech emotion recognition studies from past to present—The importance of emotion recognition from speech signal for AI–. Intell. Syst. Appl. 2024, 22, 200351. [Google Scholar] [CrossRef]
Sahu, G. Multimodal Speech Emotion Recognition and Ambiguity Resolution. arXiv 2019, arXiv:1904.06022. [Google Scholar]
Kamble, K.; Sengupta, J. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals. Multimed. Tools Appl. 2023, 82, 27269–27304. [Google Scholar] [CrossRef]
Kerkeni, L.; Serrestou, Y.; Mbarki, M.; Raoof, K.; Mahjoub, M.A. Speech Emotion Recognition: Methods and Cases Study. In Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, Madeira, Portugal, 16–18January 2018; Volume 1, pp. 175–182. [Google Scholar] [CrossRef]
Ortega-Beltrán, E.; Cabacas-Maso, J.; Benito-Altamirano, I.; Ventura, C. Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis. arXiv 2024, arXiv:2409.05148. [Google Scholar]
Garcia-Cuesta, E.; Salvador, A.B.; Pãez, D.G. EmoMatchSpanishDB: Study of speech emotion recognition machine learning models in a new Spanish elicited database. Multimed. Tools Appl. 2023, 83, 13093–13112. [Google Scholar] [CrossRef]
Begazo, R.; Aguilera, A.; Dongo, I.; Cardinale, Y. A Combined CNN Architecture for Speech Emotion Recognition. Sensors 2024, 24, 5797. [Google Scholar] [CrossRef]
Pérez-Espinosa, H.; Reyes-García, C.A.; Villaseñor-Pineda, L. EmoWisconsin: An Emotional Children Speech Database in Mexican Spanish. In Affective Computing and Intelligent Interaction, Proceedings of the Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; D’Mello, S., Graesser, A., Schuller, B., Martin, J.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 62–71. [Google Scholar]
Casals-Salvador, M.; Costa, F.; India, M.; Hernando, J. BSC-UPC at EmoSPeech-IberLEF2024: Attention Pooling for Emotion Recognition. arXiv 2024, arXiv:2407.12467. [Google Scholar]
Pan, R.; García-Díaz, J.A.; Ángel Rodríguez-García, M.; Valencia-García, R. Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments. Comput. Stand. Interfaces 2024, 90, 103856. [Google Scholar] [CrossRef]
Paredes-Valverde, M.A.; del Pilar Salas-Zárate, M. Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech-text Emotion Recognition in Spanish Forum. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
Esteban-Romero, S.; Bellver-Soler, J.; Martín-Fernández, I.; Gil-Martín, M.; D’Haro, L.F.; Fernández-Martínez, F. THAU-UPMatEmoSPeech-IberLEF2024: Efficient Adaptation of Mono-modal and Multi-modal Large Language Models for Automatic Speech Emotion Recognition. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
Cedeño-Moreno, D.; Vargas-Lombardo, M.; Delgado-Herrera, A.; Caparrós-Láiz, C.; Bernal-Beltrán, T. UTPat EmoSPeech–IberLEF2024: Using Random Forest with FastText and Wav2Vec 2.0 for Emotion Detection. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
Duville, M.M.; Alonso-Valerdi, L.M.; Ibarra-Zárate, D.I. The Mexican Emotional Speech Database (MESD): Elaboration and assessment based on machine learning. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico City, Mexico, 1–5 November 2021; pp. 1644–1647. [Google Scholar]
Hozjan, V.; Kacic, Z.; Moreno, A.; Bonafonte, A.; Nogueiras, A. Interface Databases: Design and Collection of a Multilingual Emotional Speech Database. In Proceedings of the International Conference on Language Resources and Evaluation, Las Palmas, Spain, 29–31 May 2002. [Google Scholar]
Parada-Cabaleiro, E.; Costantini, G.; Batliner, A.; Baird, A.; Schuller, B. Categorical vs Dimensional Perception of Italian Emotional Speech. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar]
Shor, J.; Venugopalan, S. TRILLsson: Distilled Universal Paralinguistic Speech Representations. arXiv 2022, arXiv:2203.00236. [Google Scholar]
Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Nezhurina, M.; Berg-Kirkpatrick, T.; Dubnov, S. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. arXiv 2024, arXiv:2211.06687. [Google Scholar]
Atmaja, B.T.; Sasou, A. Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition. IEEE Access 2022, 10, 124396–124407. [Google Scholar] [CrossRef]
Macary, M.; Tahon, M.; Estève, Y.; Rousseau, A. On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. arXiv 2020, arXiv:2011.09212. [Google Scholar]
Hozjan, V.; Kacic, Z.; Moreno, A.; Bonafonte, A.; Nogueiras, A. Emotional Speech Synthesis Database, ELRA catalogue. Available online: https://catalog.elra.info/en-us/repository/browse/ELRA-S0329/ (accessed on 12 December 2024).
Chetia Phukan, O.; Balaji Buduru, A.; Sharma, R. Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 1903–1907. [Google Scholar] [CrossRef]
Szeghalmy, S.; Fazekas, A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Zhang, S.; Li, J. KNN Classification With One-Step Computation. IEEE Trans. Knowl. Data Eng. 2023, 35, 2711–2723. [Google Scholar] [CrossRef]
Somvanshi, M.; Chavan, P.; Tambade, S.; Shinde, S.V. A review of machine learning techniques using decision tree and support vector machine. In Proceedings of the 2016 International Conference on Computing Communication Control and automation (ICCUBEA), Pune, India, 12–13 August 2016; pp. 1–7. [Google Scholar] [CrossRef]
Gholami, R.; Fakhari, N. Chapter 27—Support Vector Machine: Principles, Parameters, and Applications. In Handbook of Neural Computation; Samui, P., Sekhar, S., Balas, V.E., Eds.; Academic Press: Cambridge, MA, USA, 2017; pp. 515–535. [Google Scholar] [CrossRef]
Du, K.L.; Leung, C.S.; Mow, W.H.; Swamy, M.N.S. Perceptron: Learning, Generalization, Model Selection, Fault Tolerance, and Role in the Deep Learning Era. Mathematics 2022, 10, 4730. [Google Scholar] [CrossRef]
Pasad, A.; Chou, J.C.; Livescu, K. Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 914–921. [Google Scholar] [CrossRef]
Galal, O.; Abdel-Gawad, A.H.; Farouk, M. Rethinking of BERT sentence embedding for text classification. Neural Comput. Appl. 2024, 36, 20245–20258. [Google Scholar] [CrossRef]
Botelho, C.; Gimeno-Gómez, D.; Teixeira, F.; Mendonça, J.; Pereira, P.; Nunes, D.A.P.; Rolland, T.; Pompili, A.; Solera-Ureña, R.; Ponte, M.; et al. Tackling Cognitive Impairment Detection from Speech: A submission to the PROCESS Challenge. arXiv 2024, arXiv:2501.00145. [Google Scholar]
Ng, S.I.; Xu, L.; Siegert, I.; Cummins, N.; Benway, N.R.; Liss, J.; Berisha, V. A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation. arXiv 2024, arXiv:2410.21640. [Google Scholar]
Gonçalves, T.; Reis, J.; Gonçalves, G.; Calejo, M.; Seco, M. Predictive Models in the Diagnosis of Parkinson’s Disease Through Voice Analysis. In Proceedings of the Intelligent Systems and Applications, 2024; Arai, K., Ed.; Springer: Cham, Switzerland, 2024; pp. 591–610. [Google Scholar]

Figure 1. Layer-wise evaluation flow.

Figure 2. LOSO evaluation flow.

Figure 3. Average F1-score per layer.

Figure 4. Average UAR per layer.

Figure 5. Maximum F1-score achieved by each PTM for every database in layer-wise evaluation.

Figure 6. Maximum UAR achieved by each PTM for every database in layer-wise evaluation.

Figure 7. Max F1-score by databases and PTMs in LOSO validation.

Figure 8. Max UAR by databases and PTMs in LOSO validation.

Figure 9. Average F1 by PTMs in LOSO validation.

Figure 10. Average UAR by PTMs in LOSO validation.

Table 1. Summary of recent SER studies in Spanish.

Authors	Dataset(s)	Techniques/Models	Best Result
Kerkeni et al. [62]	ELRA-S0329	MFCC, MS + RNN, SVC, MLR	90.05% Acc.
Ortega-Beltrán et al. [63]	ELRA-S0329, EmoMatchSpanishDB	DeepSpectrum with attention (DS-AM)	98.4% (ELRA) Acc., 68.3% (EmoMatch) Acc.
García-Cuesta et al. [64]	EmoMatchSpanishDB	ComParE, eGeMAPS + SVC/XGBoost	64.2% Precision
Pan & García-Díaz [68]	MEACorpus	Late Fusion + Feature Concatenation	90.06% F1-Score
Begazo et al. [65]	MESD	CNN-1D, CNN-2D, MLP + Spectral features	96% F1-Score
Pérez-Espinosa et al. [66]	EmoWisconsin	SVM + handcrafted features	40.7% F1-score
Casals-Salvador et al. [67]	MEACorpus	RoBERTa + XLSR-Wav2Vec 2.0 + Attention Pooling	86.69% F1-Score

Table 2. Databases used in this work.

Database	Samples	Emotions	Type
EM-SpDB	2020	Surprise, Disgust, Fear, Anger, Happiness, Sadness, Neutrality	Acted
MESD	864	Anger, Disgust, Fear, Happiness, Sadness, Neutrality	Acted
MEACorpus	5129	Disgust, Anger, Joy, Sadness, Fear, Neutrality	Natural
EmoWisconsin	3098	Annoyed, Motivated, Nervous, Neutral, Doubtful, Confident	Induced
INTER1SP	5520	Anger, Sadness, Joy, Fear, Disgust, Surprise	Acted
${EmoFilm}_{E S}$	342	Anger, Sadness, Happiness, Fear, Contempt	Acted

Table 3. Summary of PTMs used in the study.

PTMs (Acronym)	Training Data/h	Parameters	Languages
W2V2-L-R-Libri960h	60 K h + fine-tuning (960 h)	317 M	English
W2V2-XLSR-ES	436 K h + fine-tuning (Spanish)	1 B	Multilingual (Spanish-focused)
W2V2-L-XLSR53-ES	56 K h + fine-tuning (Spanish)	317 M	53 languages (Spanish-focused)
Whisper-L-v3	680 K h	1.55 B	96 languages
Whisper-L-v3-ES	Fine-tuned from Whisper-L-v3	1.55 B	Spanish
HuBERT-L-1160k	1160 K h	316 M	English
WavLM-L	94 K h	300 M	English
TRILLsson	Distilled from 900 M+ h (YT-U dataset)	N/A	Multilingual
L-CLAP-G	Multimodal audio-text pairs	N/A	Multilingual

Table 4. Performance comparison across PTMs, databases, and classifiers.

PTM	Database	KNN			MLP			SVM
PTM	Database	Accuracy	F1	UAR	Accuracy	F1	UAR	Accuracy	F1	UAR
HuBERT-L-1160k	${EmoFilm}_{E S}$	69.44₈	67.76₈	65.64₈	79.17₅	78.43₉	76.53₅	84.72₉	84.33₉	82.27₉
	EM-SpDB	66.33₇	64.87₇	60.91₃	86.78₅	86.75₅	85.2₆	86.03₅	85.93₅	84.3₃
	EmoWisconsin	54.9₁₂	48.83₁₂	27.17₁₅	56.13₁₀	54.37₁₀	38.19₁₀	57.84₁₃	54.07₈	32.9₇
	INTER1SP	97.6₃	97.59₃	98.07₃	99.75₅	99.75₅	99.81₅	99.83₆	99.83₆	99.88₆
	MESD	86.71₃	86.65₃	86.76₃	91.33₇	91.34₇	91.36₇	91.91₃	91.96₇	91.95₃
	MEACorpus	90.26₄	90.24₄	87.27₇	89.09₃	89.06₃	86.01₂	91.63₆	91.56₆	88.79₆
W2V2-L-XLSR53-ES	${EmoFilm}_{E S}$	76.39₆	75.26₆	75.28₆	86.11₄	85.68₈	84.97₄	86.11₆	85.67₆	84.39₆
	EM-SpDB	65.84₃	64.92₃	60.51₃	88.28₆	88.32₆	88.02₆	86.03₆	86.08₆	85.78₆
	EmoWisconsin	58.58₁₂	52.77₁₂	33.01₁₂	57.35₆	55.91₆	35.69₁₅	60.54₇	56.61₇	34.23₁₆
	INTER1SP	96.86₄	96.84₄	97.5₄	99.5₄	99.5₆	99.63₄	99.75₅	99.75₅	99.81₅
	MESD	87.28₄	86.63₄	87.34₄	93.64₄	93.51₄	93.68₄	94.22₃	94.25₄	94.25₃
	MEACorpus	90.35₄	90.23₄	89.08₅	89.96₃	89.93₃	87.43₆	91.72₃	91.56₃	90.8₃
W2V2-L-R-Libri960h	${EmoFilm}_{E S}$	70.83₃	69.97₄	69.18₄	81.94₈	81.35₂₀	81.88₈	83.33₁₉	82.79₁₉	81.72₁₉
	EM-SpDB	62.09₀	60.77₀	56.69₀	86.78₅	86.9₅	85.93₅	85.79₆	85.81₆	84.95₆
	EmoWisconsin	55.64₈	49.93₈	29.93₄	56.37₆	53.54₆	36.08₂	57.6₁₁	52.88₁₁	32.66₅
	INTER1SP	95.78₃	95.76₃	96.57₃	99.34₃	99.34₃	99.32₁	99.34₂	99.34₄	99.5₄
	MESD	85.55₃	85.4₃	85.57₃	91.33₂	91.33₄	91.38₄	91.91₃	91.83₃	91.95₃
	MEACorpus	76.24₂	77.31₂	80.18₆	87.15₂	87.07₂	86.85₅	90.46₃	90.38₃	86.22₃
WavLM-L	${EmoFilm}_{E S}$	80.56₃	79.96₃	81.33₃	86.11₆	85.85₆	85.2₇	87.5₉	87.47₁₁	86.96₉
	EM-SpDB	68.33₂	67.19₂	62.97₂	86.78₄	86.61₄	85.34₄	87.28₃	87.24₃	86.04₃
	EmoWisconsin	54.41₁₁	48.29₁₁	26.54₁₁	56.13₁₀	53.79₁₀	35.57₃	59.56₁₀	55.33₁₀	33.98₆
	INTER1SP	97.68₃	97.67₃	98.26₃	99.42₅	99.42₅	99.57₆	99.67₅	99.67₅	99.75₅
	MESD	89.02₃	88.83₃	89.08₃	93.06₄	93.04₄	93.08₄	94.22₆	94.25₆	94.25₆
	MEACorpus	90.75₆	90.65₆	89.92₆	90.17₄	90.1₄	89.95₄	92.02₃	91.96₃	89.83₃
Whisper-L-v3	${EmoFilm}_{E S}$	75.0₁₂	74.19₁₂	73.44₁₉	87.5₁₅	87.17₁₉	86.53₁₅	87.5₁₆	87.25₁₆	86.07₁₆
	EM-SpDB	61.6₇	59.6₇	54.76₇	85.29₂₀	85.29₂₀	83.87₁₄	83.79₂₀	83.72₂₀	82.16₁₅
	EmoWisconsin	55.39₁₈	50.36₁₈	28.38₁₈	58.33₁₈	56.44₁₈	35.59₇	60.05₁₈	55.5₂₀	31.5₅
	INTER1SP	91.15₁₄	91.14₁₄	92.27₁₄	99.17₁₃	99.17₁₆	99.38₁₆	99.34₁₄	99.34₁₄	99.5₁₄
	MESD	83.24₁₁	83.12₁₁	83.31₁₁	94.8₇	94.79₇	94.83₇	93.06₁₃	93.05₁₃	93.1₁₃
	MEACorpus	86.35₁₂	86.15₁₂	83.27₁₄	88.79₉	88.73₉	87.5₁₁	90.06₁₂	89.89₁₂	88.83₁₂
W2V2-XLSR-ES	${EmoFilm}_{E S}$	80.56₅	79.88₅	80.58₅	87.5₆	86.91₆	86.31₆	88.89₁₁	88.88₁₁	88.27₁₁
	EM-SpDB	66.08₅	64.71₅	59.79₃	86.78₁₅	86.59₁₅	84.66₁₅	85.04₁₁	84.88₁₁	83.53₅
	EmoWisconsin	58.09₁₅	51.81₁₆	28.56₁	60.78₁₄	58.41₁₄	40.0₁₄	62.5₁₅	54.75₂₅	31.15₅
	INTER1SP	96.03₁₁	96.0₁₁	97.02₁₁	99.67₈	99.67₈	99.75₈	99.75₁₂	99.75₁₂	99.81₁₂
	MESD	87.28₇	86.91₇	87.34₇	95.38₆	95.4₆	95.38₆	93.64₈	93.63₈	93.68₈
	MEACorpus	90.84₁₁	90.74₁₁	89.47₅	90.64₆	90.59₆	90.44₆	92.69₅	92.53₅	90.82₃
Whisper-L-v3-ES	${EmoFilm}_{E S}$	80.56₂₀	80.0₂₀	78.55₂₀	91.67₂₂	91.52₂₂	91.83₂₂	87.5₁₆	87.3₂₃	86.07₁₆
	EM-SpDB	61.1₆	59.39₇	55.05₇	85.54₂₄	85.54₂₆	83.74₂₆	84.29₂₆	84.22₂₆	82.96₃₀
	EmoWisconsin	55.88₁₇	50.53₁₇	28.63₁₇	60.78₃₁	58.92₃₁	36.97₈	60.78₁₉	56.0₂₇	31.92₁₆
	INTER1SP	92.39₁₄	92.38₁₄	92.94₁₄	99.09₉	99.09₁₅	99.32₉	99.17₁₄	99.17₁₇	99.38₁₆
	MESD	83.82₉	83.77₁₄	83.89₁₄	94.22₉	94.21₉	94.25₉	94.22₉	94.22₁₀	94.25₉
	MEACorpus	85.88₁₂	85.75₁₂	85.12₁₂	88.7₁₄	88.61₁₄	87.87₁₄	89.68₁₄	89.59₁₄	85.02₁₉
TRILLsson	${EmoFilm}_{E S}$	63.89	63.01	61.31	79.17	78.5	77.88	73.61	72.3	69.88
	EM-SpDB	57.11	54.86	50.14	80.8	80.75	79.11	81.05	80.9	78.12
	EmoWisconsin	56.13	48.98	23.41	56.37	54.27	35.31	60.29	52.96	26.09
	INTER1SP	88.34	88.31	88.27	97.77	97.77	98.19	97.93	97.93	98.31
	MESD	70.52	70.3	70.65	87.28	87.11	87.34	84.39	84.42	84.48
L-CLAP-G	${EmoFilm}_{E S}$	61.11	58.59	56.87	62.5	59.44	57.21	63.89	62.93	60.35
	EM-SpDB	53.37	49.77	44.76	63.59	62.89	58.73	66.08	65.31	60.72
	EmoWisconsin	50.49	42.43	19.68	53.19	49.94	25.26	50.25	41.02	18.96
	INTER1SP	80.4	80.29	82.58	91.07	91.06	91.4	89.66	89.65	90.75
	MESD	68.79	68.29	68.88	78.03	77.76	78.08	79.19	79.08	79.21
	MEACorpus	75.34	75.08	72.67	78.17	78.13	69.33	80.21	80.21	75.29

Note: The subscript indicates the layer from which the corresponding metric was obtained. Bold values represent the highest Accuracy, UAR, and F1-Score achieved by each PTM for each database.

Table 5. Summary of best results by database, PTM, and classifier.

Database	Metric	PTM	Classifier	Score (Layer)
INTER1SP	Accuracy	HuBERT-L-1160k	SVM	99.83% (6)
	F1-Score	HuBERT-L-1160k	SVM	99.83% (6)
	UAR	HuBERT-L-1160k	SVM	99.88% (6)
${EmoFilm}_{E S}$	Accuracy	Whisper-L-v3-ES	MLP	91.67% (22)
	F1-Score	Whisper-L-v3-ES	MLP	91.52% (22)
	UAR	Whisper-L-v3-ES	MLP	91.83% (22)
EM-SpDB	Accuracy	W2V2-L-XLSR53-ES	MLP	88.28% (6)
	F1-Score	W2V2-L-XLSR53-ES	MLP	88.32% (6)
	UAR	W2V2-L-XLSR53-ES	MLP	88.02% (6)
EmoWisconsin	Accuracy	W2V2-XLSR-ES	MLP	62.50% (15)
	F1-Score	Whisper-L-v3-ES	MLP	58.92% (31)
	UAR	W2V2-XLSR-ES	SVM	40.0% (14)
MESD	Accuracy	W2V2-XLSR-ES	MLP	95.38% (6)
	F1-Score	W2V2-XLSR-ES	MLP	95.40% (6)
	UAR	W2V2-XLSR-ES	MLP	95.38% (6)
MEACorpus	Accuracy	W2V2-XLSR-ES	MLP	92.69% (5)
	F1-Score	W2V2-XLSR-ES	MLP	92.53% (5)
	UAR	W2V2-XLSR-ES	SVM	90.82% (3)

Table 6. Performance comparison across databases and PTMs with LOSO validation.

Database	PTM (Layer)	KNN			MLP			SVM
Database	PTM (Layer)	UAR (%)	Accuracy (%)	F1 (%)	UAR (%)	Accuracy (%)	F1 (%)	UAR (%)	Accuracy (%)	F1 (%)
EM-SpDB	HuBERT-L-1160k (9)	51.42	61.45	58.11	67.88	75.92	74.38	68.46	75.80	74.34
	L-CLAP-G	41.32	51.79	48.07	50.59	58.94	56.99	49.87	59.58	57.59
	TRILLsson	49.76	60.03	56.26	70.82	78.17	76.99	71.42	79.66	78.48
	W2V2-L-R-Libri960h (3)	48.99	57.72	54.16	68.02	74.98	73.36	64.33	71.38	69.62
	W2V2-L-XLSR53-ES (6)	52.25	62.96	59.87	73.24	79.56	78.15	70.63	77.66	76.35
	W2V2-XLSR-ES (11)	51.68	60.66	56.96	70.94	78.21	76.68	70.37	77.40	75.89
	WavLM-L (6)	49.08	59.20	55.35	66.64	73.84	72.06	69.05	75.30	73.82
	Whisper-L-v3 (14)	41.57	50.71	46.47	65.23	73.41	71.96	65.53	72.34	70.81
	Whisper-L-v3-ES (14)	44.28	54.28	50.08	64.68	72.91	71.48	63.98	70.33	68.84
${EmoFilm}_{E S}$	HuBERT-L-1160k (9)	50.94	61.72	63.39	69.87	79.93	81.60	63.79	82.95	83.24
	L-CLAP-G	48.77	57.69	60.37	51.44	54.44	56.14	42.16	50.06	53.21
	TRILLsson	51.19	64.58	69.32	67.68	80.09	82.08	66.76	70.69	73.27
	W2V2-L-R-Libri960h (3)	56.34	73.74	74.71	65.56	74.54	74.80	57.36	68.88	70.45
	W2V2-L-XLSR53-ES (6)	62.45	76.05	76.96	69.09	83.68	83.01	65.13	79.13	80.66
	W2V2-XLSR-ES (11)	53.47	71.22	72.21	72.68	81.93	83.86	67.15	82.27	83.78
	WavLM-L (6)	57.29	70.85	72.33	66.78	81.15	81.86	62.01	74.41	76.13
	Whisper-L-v3 (14)	53.12	64.16	66.99	58.56	75.40	78.82	63.34	79.48	82.08
	Whisper-L-v3-ES (14)	59.35	68.96	71.15	57.67	73.37	77.24	71.72	77.60	79.61
EmoWisconsin	HuBERT-L-1160k (9)	25.22	43.07	40.53	30.19	47.46	45.47	29.35	43.97	43.21
	L-CLAP-G	22.89	41.38	38.93	28.37	48.23	44.66	25.68	43.39	42.54
	TRILLsson	28.40	47.60	43.81	34.48	52.65	50.33	32.66	54.03	49.72
	W2V2-L-R-Libri960h (3)	25.21	46.13	40.23	30.55	46.97	44.94	28.11	49.38	42.23
	W2V2-L-XLSR53-ES (6)	25.61	47.37	43.10	32.72	50.74	48.37	32.14	54.85	48.89
	W2V2-XLSR-ES (11)	25.19	44.61	42.36	34.71	53.08	50.75	30.47	43.67	43.66
	WavLM-L (6)	22.98	40.92	38.76	29.81	47.80	45.35	28.21	41.10	41.56
	Whisper-L-v3 (14)	25.60	43.76	41.09	30.22	50.92	49.15	29.01	43.51	43.70
	Whisper-L-v3-ES (14)	25.51	43.36	40.04	28.79	48.92	46.95	28.41	42.65	42.22
INTER1SP	HuBERT-L-1160k (9)	50.20	52.96	50.32	61.08	63.83	63.47	55.84	64.73	62.04
	L-CLAP-G	33.68	33.80	30.40	39.99	37.95	34.41	36.40	38.78	34.43
	TRILLsson	53.40	57.92	56.99	52.46	57.07	55.32	60.50	65.61	64.52
	W2V2-L-R-Libri960h (3)	49.66	49.38	45.54	55.32	55.45	51.88	52.25	59.67	55.57
	W2V2-L-XLSR53-ES (6)	58.55	59.89	57.45	65.99	67.34	65.45	60.98	70.04	67.44
	W2V2-XLSR-ES (11)	54.94	57.15	55.32	61.25	66.37	64.30	60.45	68.63	66.52
	WavLM-L (6)	51.61	53.83	50.41	50.94	54.43	49.97	50.56	57.83	53.35
	Whisper-L-v3 (14)	42.10	43.98	41.08	49.75	51.79	49.53	54.91	61.01	57.68
	Whisper-L-v3-ES (14)	46.46	48.18	43.97	42.43	44.67	41.23	52.69	59.75	55.97
MEACorpus	HuBERT-L-1160k (9)	16.33	41.25	47.63	44.82	68.37	59.36	53.22	70.13	62.83
	L-CLAP-G	13.80	31.36	37.54	24.37	49.59	46.35	18.83	46.17	49.64
	TRILLsson	33.28	59.67	58.38	54.92	75.85	71.64	50.19	66.62	59.82
	W2V2-L-R-Libri960h (3)	38.69	61.79	67.55	51.70	68.37	61.34	55.87	71.88	65.26
	W2V2-L-XLSR53-ES (6)	27.95	61.33	60.43	36.74	63.47	56.17	49.05	66.62	57.87
	W2V2-XLSR-ES (11)	21.56	51.80	54.20	50.38	70.49	64.61	36.77	66.62	61.62
	WavLM-L (6)	27.70	53.00	56.85	57.95	73.64	67.07	59.47	75.39	68.77
	Whisper-L-v3 (14)	23.54	57.85	63.76	53.79	70.13	63.36	42.55	63.11	57.06
	Whisper-L-v3-ES (14)	28.03	57.38	62.62	46.59	70.13	63.41	50.76	66.72	62.03
MESD	HuBERT-L-1160k (9)	33.73	33.75	32.13	45.20	45.24	41.60	47.17	47.21	42.86
	L-CLAP-G	33.59	33.65	28.67	37.75	37.82	33.29	39.60	39.68	35.38
	TRILLsson	35.70	35.72	32.73	47.30	47.31	44.71	48.20	48.23	46.30
	W2V2-L-R-Libri960h (3)	33.13	33.17	30.18	41.80	41.87	38.10	44.59	44.65	40.42
	W2V2-L-XLSR53-ES (6)	40.45	40.49	38.22	49.83	49.88	48.19	52.48	52.54	50.59
	W2V2-XLSR-ES (11)	36.86	36.88	34.18	48.21	48.26	44.59	53.09	53.12	50.25
	WavLM-L (6)	36.02	36.08	32.35	43.11	43.15	41.31	41.60	41.64	38.33
	Whisper-L-v3 (14)	30.24	30.28	26.49	39.51	39.56	36.63	37.42	37.47	35.64
	Whisper-L-v3-ES (14)	30.11	30.16	25.23	39.48	39.55	35.71	36.83	36.89	34.67

Bold values represent the highest Accuracy, UAR, and F1-Score achieved by each PTM for each database.

Table 7. Summary of best results (layer-wise) and SOTA comparison by database and metric.

Database	Metric	Our Result	Our PTM (Layer)	SOTA Result	SOTA Reference
INTER1SP	Accuracy	99.83%	HuBERT-L-1160k (6)	98.40%	[63]
	F1-Score	99.83%	HuBERT-L-1160k (6)	-	-
	UAR	99.88%	HuBERT-L-1160k (6)	-	-
${EmoFilm}_{E S}$	Accuracy	91.67%	Whisper-L-v3-ES (6)	-	-
	F1-Score	91.52%	Whisper-L-v3-ES (6)	-	-
	UAR	91.83%	Whisper-L-v3-ES (6)	-	-
EM-SpDB	Accuracy	88.28%	W2V2-L-XLSR53-ES (6)	68.30%	[63]
	F1-Score	88.32%	W2V2-L-XLSR53-ES (6)	-	-
	UAR	88.02%	W2V2-L-XLSR53-ES (6)	-	-
EmoWisconsin	Accuracy	62.50%	W2V2-XLSR-ES (15)	-	-
	F1-Score	58.92%	Whisper-L-v3-ES (31)	40.70%	[66]
	UAR	40.0%	W2V2-XLSR-ES (14)	-	-
MESD	Accuracy	95.38%	W2V2-XLSR-ES (6)	96.00%	[65]
	F1-Score	95.40%	W2V2-XLSR-ES (6)	96.00%
	UAR	95.38%	W2V2-XLSR-ES (6)	-	-
MEACorpus	Accuracy	92.69%	W2V2-XLSR-ES (5)	-	-
	F1-Score	92.53%	W2V2-XLSR-ES (5)	90.06%	[68]
	UAR	90.82%	W2V2-XLSR-ES (3)	-	-

Note: Bold values represent the highest obtained values.

Table 8. Summary of best results of LOSO validation by database and metric.

Database	Metric	Result	PTM (Layer)
EM-SpDB	Accuracy	79.66%	TRILLsson
	F1-Score	78.48%	TRILLsson
	UAR	73.24%	W2V2-L-XLSR53-ES (6)
${EmoFilm}_{E S}$	Accuracy	83.68%	W2V2-L-XLSR53-ES (6)
	F1-Score	83.86%	W2V2-XLSR-ES (11)
	UAR	72.68%	W2V2-XLSR-ES (11)
INTER1SP	Accuracy	70.04%	W2V2-L-XLSR53-ES (6)
	F1-Score	67.44%	W2V2-L-XLSR53-ES (6)
	UAR	65.99%	W2V2-L-XLSR53-ES (6)
EmoWisconsin	Accuracy	54.85%	W2V2-L-XLSR53-ES (6)
	F1-Score	50.75%	W2V2-XLSR-ES (11)
	UAR	34.71%	W2V2-XLSR-ES (11)
MEACorpus	Accuracy	75.85%	TRILLsson
	F1-Score	71.64%	TRILLsson
	UAR	59.47%	WavLM-L (6)
MESD	Accuracy	53.12%	W2V2-XLSR-ES (11)
	F1-Score	50.59%	W2V2-L-XLSR53-ES (6)
	UAR	53.09%	W2V2-XLSR-ES (11)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mares, A.; Diaz-Arango, G.; Perez-Jacome-Friscione, J.; Vazquez-Leal, H.; Hernandez-Martinez, L.; Huerta-Chua, J.; Jaramillo-Alvarado, A.F.; Dominguez-Chavez, A. Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Appl. Sci. 2025, 15, 4340. https://doi.org/10.3390/app15084340

AMA Style

Mares A, Diaz-Arango G, Perez-Jacome-Friscione J, Vazquez-Leal H, Hernandez-Martinez L, Huerta-Chua J, Jaramillo-Alvarado AF, Dominguez-Chavez A. Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Applied Sciences. 2025; 15(8):4340. https://doi.org/10.3390/app15084340

Chicago/Turabian Style

Mares, Alex, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado, and Alfonso Dominguez-Chavez. 2025. "Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models" Applied Sciences 15, no. 8: 4340. https://doi.org/10.3390/app15084340

APA Style

Mares, A., Diaz-Arango, G., Perez-Jacome-Friscione, J., Vazquez-Leal, H., Hernandez-Martinez, L., Huerta-Chua, J., Jaramillo-Alvarado, A. F., & Dominguez-Chavez, A. (2025). Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Applied Sciences, 15(8), 4340. https://doi.org/10.3390/app15084340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

Abstract

1. Introduction

2. Related Works

2.1. State of the Art in Spanish SER

2.2. PTMs for SER

3. Materials and Methods

3.1. Databases

3.2. Pre-Trained Models

3.3. Layer-Wise Evaluation of PTMs

3.4. Embedding Extraction and Mean Pooling

3.5. Data Partitioning

3.6. LOSO Validation

3.7. Classifier Training

3.8. Calculation of Performance Metrics

4. Results

4.1. Layer-Wise Evaluation Results

4.1.1. Average Metric Values by Layer per Database and PTM

4.1.2. Database Performance Comparison

4.2. LOSO Validation Results

4.2.1. Performance Insights in LOSO Validation

4.2.2. Average Performance Metrics

5. Discussion and Comparative Analysis

5.1. PTMs Against SOTA for SER in Spanish

5.2. Performance Comparison Between PTMs

5.3. Impact of Database Nature on Emotion Recognition Performance

5.4. Fine-Tuning and Practical Applications

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI