Next Article in Journal
Experimental and Numerical Study on Dynamic Response of High-Pier Ballastless Continuous Beam Bridge in Mountainous Area
Previous Article in Journal
An Intelligent Fault Diagnosis Model for Rolling Bearings Based on IGTO-Optimized VMD and LSTM Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models

by
Alex Mares
1,
Gerardo Diaz-Arango
1,
Jorge Perez-Jacome-Friscione
1,
Hector Vazquez-Leal
1,*,
Luis Hernandez-Martinez
2,
Jesus Huerta-Chua
3,
Andres Felipe Jaramillo-Alvarado
2 and
Alfonso Dominguez-Chavez
1
1
Facultad de Instrumentacion Electronica, Universidad Veracruzana, Cto. Gonzalo Aguirre Beltran S/N, Xalapa 91000, Mexico
2
Instituto Tecnologico Superior de Poza Rica, Tecnologico Nacional de Mexico, Luis Donaldo Colosio Murrieta S/N, Arroyo del Maiz, Poza Rica 93230, Mexico
3
Electronics Department, National Institute for Astrophysics, Optics and Electronics, Sta. María Tonantzintla, Puebla 72840, Mexico
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(8), 4340; https://doi.org/10.3390/app15084340
Submission received: 1 March 2025 / Revised: 3 April 2025 / Accepted: 5 April 2025 / Published: 14 April 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Feature extraction for speech emotion recognition (SER) has evolved from handcrafted techniques through deep learning methods to embeddings derived from pre-trained models (PTMs). This study presents the first comparative analysis focused on using PTMs for Spanish SER, evaluating six models—Whisper, Wav2Vec 2.0, WavLM, HuBERT, TRILLsson, and CLAP—across six emotional speech databases: EmoMatchSpanishDB, MESD, MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm. We propose a robust framework combining layer-wise feature extraction with Leave-One-Speaker-Out validation to ensure interpretable model comparisons. Our method significantly outperforms existing state-of-the-art benchmarks, notably achieving scores on metrics such as F1 on EmoMatchSpanishDB (88.32%), INTER1SP (99.83%), and MEACorpus (92.53%). Layer-wise analyses reveal optimal emotional representation extraction at early layers in 24-layer models and middle layers in larger architectures. Additionally, TRILLsson exhibits remarkable generalization in speaker-independent evaluations, highlighting the necessity of strategic model selection, fine-tuning, and language-specific adaptations to maximize SER performance for Spanish.

1. Introduction

Emotions are integral to human communication, profoundly influencing our interactions and behaviors. Among the various forms of expression, speech stands out as a crucial medium in which emotions are conveyed, making it a fundamental source for emotion recognition [1]. Speech emotion recognition (SER) aims to detect embedded emotions by processing and analyzing speech signals [2].
Traditional SER systems rely on acoustic features, such as intonation, rhythm, intensity, and duration, to identify patterns associated with specific emotional states [3]. These systems follow a process with a structure that includes signal acquisition, preprocessing [4], feature extraction (focusing on acoustic properties like pitch, energy, and spectral features) [1], feature selection, classification, and model evaluation [5].
The applications of SER are broad and cover different areas, within which human–computer interaction stands out where SER enables conversational agents and robots to respond empathetically, enhancing user experiences [6,7]. In healthcare, it supports the diagnosis of mental health conditions like depression and it also aids in monitoring emotional well-being [8,9,10]. SER systems are also applied in call centers to improve customer service, in automotive industries for stress monitoring, and in educational technologies for personalized learning experiences [1,11,12].
SER models are built on two emotion representation paradigms: categorical models and dimensional models [13,14,15]. The categorical model identifies basic emotions universally recognized across all cultures. Paul Ekman initially proposed six basic emotions—sadness, happiness, disgust, anger, fear, and surprise—emphasizing their expression through facial cues and their cross-cultural universality [16]. This concept converges with the idea of Darwinian evolution, which says that emotions are primitive responses shared among humans and animals [17].
In contrast, the dimensional model represents emotions on a continuous scale, mapping affective states onto dimensions like valence and arousal [13,18]. Valence reflects the positive or negative evaluation of emotions, while arousal denotes its intensity or activation level [19].
Over the past two decades, SER has advanced from traditional machine learning techniques relying on handcrafted features to complex deep learning models capable of automatically extracting information-rich representations from raw data [18,20,21]. The most common metrics for evaluating these systems are accuracy, precision, recall, and F1-score, with the F1-score being particularly useful for imbalanced datasets [14,15]. Techniques like leave-one-speaker-out (LOSO) provide robust benchmarking across different speakers and datasets [22].
Early approaches made use of features such as pitch, energy, and Mel-frequency cepstral coefficients (MFCCs), combined with classifiers like support vector machines (SVMs) [23,24,25], Hidden Markov Models (HMMs) [26,27,28], Gaussian Mixture Models (GMMs) [29,30], and Naïve Bayes classifiers [31,32]. While these methods laid the groundwork for SER, they often struggled to capture the intricate emotional nuances in speech, particularly in large-scale and high-dimensional datasets [13,33].
The emergence of deep learning transformed SER by introducing models like Convolutional Neural Networks (CNNs) [34,35,36], Recurrent Neural Networks (RNNs), and their variants such as Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs) [37,38,39]. These models demonstrated superior performance by automatically learning relevant features and capturing temporal dependencies [20,40]. Hybrid models combining CNNs with RNNs or implementing attention mechanisms, further enhanced SER capabilities by effectively modeling both spatial and temporal aspects of speech signals [40,41,42,43].
Recently, the use of pre-trained models (PTMs) and self-supervised learning (SSL) has marked a significant breakthrough in SER. Models like wav2vec 2.0 [44], HuBERT [45], WavLM [46], and Whisper [47] employ large volumes of unlabeled data to learn speech representations. These models can be fine-tuned for specific tasks, reducing the need for extensive labeled datasets [48,49,50]. By making use of such models, SER has achieved state-of-the-art (SOTA) performance, particularly enhancing recognition accuracy in cross-lingual and out-of-domain scenarios [51,52,53]. The integration of PTMs like wav2vec 2.0 with fine-tuning approaches, such as task-adaptive pre-training and multi-task learning, has consistently outperformed traditional methods, highlighting the potential of transfer learning to effectively capture emotional signals [54,55,56]. Furthermore, models like Whisper have challenged the conventional notion that automatic speech recognition models are suboptimal for SER, demonstrating that ASR models can perform exceptionally well in emotion recognition tasks [51].
Despite these achievements, SER still faces significant challenges, particularly in underrepresented languages like Spanish. Variability in emotional expression across speakers, languages, and cultures complicates model generalization and robustness [4,20,33]. The scarcity of large, diverse, and emotionally annotated speech corpora hinders the training of deep learning models, often resulting in overfitting and poor performance in real-world scenarios [20,51,57,58].
Cultural and linguistic differences further require the development of language-specific resources, as features like native language, accent, and pronunciation, significantly influence model performance, complicating cross-lingual generalization [59]. Additionally, capturing the subtle nuances of emotions and addressing the ambiguity and subjectivity inherent in emotional expression remain complex tasks [60]. The lack of consensus on optimal acoustic features and the significant differences in performance between acted and natural emotional speech databases complicate the development of universal SER systems [21,61]. Research on the performance and generalization of SER models and methodologies that use databases underrepresented in the current SOTA is therefore necessary.
While recent advances have been made in Spanish SER, several methodological gaps remain. Many systems still rely on handcrafted features such as MFCCs and spectral descriptors, limiting their ability to generalize across speakers and contexts [62,63,64,65,66]. Others integrate deep or multimodal models but conduct limited evaluations, often omitting speaker-independent protocols or cross-corpus validation [67,68]. Furthermore, while some proposals incorporate pretrained models, they do not fully exploit their representational capacity, overlooking layer-wise analysis and targeted fine-tuning [67,69,70,71]. These limitations highlight the need for broader benchmarking across Spanish emotional speech corpora and a deeper investigation of PTMs to build more robust and transferable SER systems.
This work focuses on six databases: EmoMatchSpanishDB [64], Mexican Emotional Speech Database [72], Spanish MEACorpus [68], EmoWisconsin [66], INTER1SP [73], and EmoFilm [74]. These databases were selected for their emphasis on categorical annotations spanning the full range of emotional representation elicitation, including acted, induced, and natural expressions collected from various demographics such as adults and children, professional actors, and non-actors. The diversity in data sources allows for a thorough benchmarking of model generalizability across varied real-world scenarios.
A wide range of advanced PTMs in the field of audio processing and SER were chosen, such as Wav2Vec 2.0 [44], Whisper [47], HuBERT [45], WavLM [46], TRILLsson [75], and CLAP [76]. These models include self-supervised architectures, sequence-to-sequence frameworks, and multimodal contrastive learning approaches. By using them as feature extractors, we generate embeddings that capture high-level representations of speech, allowing the identification of acoustic and paralinguistic features essential for emotion recognition. The objective is to systematically benchmark the performance of these PTMs using their hidden layers as feature extractors, specifically for emotion recognition in Spanish speech, thereby, a layer-wise and LOSO evaluation are introduced, contributing to a detailed understanding of the performance of various PTMs and their correlation with fine-tuning to the target language, while comparing with the SOTA of Spanish SER.
The main contributions of this work are as follows:
  • We present the first comprehensive benchmarking of PTMs for Spanish SER, covering six diverse databases and addressing the lack of evaluations in underrepresented languages.
  • We devise a robust experimental framework combining LOSO validation with layer-wise feature extraction, enabling accurate and interpretable comparisons across models.
  • Our proposed approach achieves superior results on multiple Spanish datasets, outperforming existing state-of-the-art baselines while highlighting the benefits of Spanish-focused fine-tuning.
  • We reveal how architectural depth and pretraining strategies influence emotional representations, guiding model selection in Spanish SER tasks through a detailed layer-wise and LOSO analysis.
  • We establish a reproducible benchmarking pipeline that advances research on feature extraction, generalization, and fine-tuning in Spanish SER, offering consistent metrics across diverse scenarios.
To guide the reader through our methods and results, this paper is structured as follows: Section 2 describes related work, emphasizing previous studies in the Spanish language and recent approaches in SER with PTMs. Section 3 presents the databases used and the methodology applied, including feature extraction from different layers of PTMs, the classifiers used, and the validation schemes, both cross-validation and LOSO. Section 4 shows and analyzes the results obtained for each model and database. Section 5 then compares these findings with the SOTA, discussing strengths, limitations, and notable trends. Finally, Section 6 presents the main conclusions, limitations, and points out possible directions for future work.

2. Related Works

2.1. State of the Art in Spanish SER

Recent progress in Spanish SER has leveraged diverse datasets and methodologies with notable success. Kerkeni et al. [62] used MFCC and MS features with classifiers like RNN, SVC, and MLR on the ELRA-S0329 dataset, reaching 90.05% accuracy with RNN and MFCC+MS. García-Cuesta et al. [64] applied ComParE and eGeMAPS features with SVC/XGBOOST, achieving 64.2% precision. Begazo et al. [65] combined CNN-1D, CNN-2D, and MLP architectures with spectral and spectrogram features, obtaining 96% accuracy, recall, and F1-score on MESD. Ortega-Beltrán et al. [63] employed DeepSpectrum with attention (DS-AM), achieving 98.4% on ELRA-S0329 and 68.3% on EmoMatchSpanishDB. Pan and García-Díaz [68] reported a 90.06% weighted F1-score on MEACorpus using Late Fusion and feature concatenation. In contrast, Pérez-Espinosa et al. [66] obtained only a 40.7% F1-score on the older EmoWisconsin dataset.
In the multimodal domain, IberLEF 2024 showcased the strength of combining audio and text. The best-performing model, proposed by BSC–UPC [67], reached an 86.69% F1-score using RoBERTa and XLSR-Wav2Vec 2.0, outperforming text-only approaches by nearly 10 percentage points. Table 1 provides a comparative overview of these studies.
Despite this progress, several limitations hinder the generalization and robustness of current approaches. Some studies still rely on handcrafted features such as MFCCs and spectral descriptors [62,63,64,65,66]. Others employ deep models but limit evaluation to few datasets, without considering speaker-independence or cross-corpus scenarios [67,68]. Additionally, while some use PTMs, they often overlook in-depth analyses, such as layer-wise exploration or selective fine-tuning [67,69,70,71]. These gaps highlight the need for broader benchmarking across acted, natural, and elicited Spanish corpora, and deeper PTM-based exploration to ensure consistent performance in real-world conditions.

2.2. PTMs for SER

The exploration of PTMs for SER has demonstrated significant achievements, particularly with self-supervised architectures like Wav2Vec 2.0, HuBERT, and Whisper. Triantafyllopoulos et al. [53] highlighted that fine-tuned PTMs excel in valence prediction by leveraging linguistic features, but they remain less effective for arousal and dominance, a limitation also noted by Wagner et al. [50]. Similarly, Osman et al. [51] emphasized the generalization capabilities of PTMs like Whisper in cross-lingual and out-of-domain scenarios, outperforming other models on diverse multilingual datasets. Phukan et al. [52] and Atmaja and Sasou [77] expanded this by demonstrating the robustness of paralinguistic PTMs such as TRILLsson in capturing non-semantic features critical for SER tasks across languages. Studies by Pepino et al. [54] and Chen and Rudnicky [55] explored layer-wise feature fusion and advanced fine-tuning strategies, showcasing improvements in accuracy, particularly with novel approaches like pseudo-label task adaptive pretraining (P-TAPT). The integration of multimodal approaches combining acoustic and linguistic features, as reported by Macary et al. [78], further supports the utility of PTMs in enhancing SER performance. Collectively, these works underline the versatility and adaptability of PTMs in addressing the complexities of SER tasks while identifying challenges such as robustness to domain shifts, imbalanced datasets, and the need for fine-grained feature tuning.

3. Materials and Methods

This section presents a detailed description of the methodologies employed to evaluate PTMs on Spanish emotional speech databases. The databases used cover a wide range of emotional representations in order to ensure a complete analysis of the sentimental variability inherent in speech. A thorough layer-by-layer analysis of several SOTA PTMs is carried out, in which embeddings are extracted from each, and mean pooling is applied to obtain fixed-dimensional representations suitable for classification tasks with general models, such as KNN, SVM, and MLP. The above-mentioned classifiers are trained and evaluated using cross-validation to determine their efficacy. Furthermore, to evaluate the generalization of the models to unseen speakers, LOSO cross-validation is implemented.

3.1. Databases

As part of this study, the focus was placed on six Spanish-language databases, EmoMatchSpanishDB, MESD, Spanish MEACorpus, EmoWisconsin, INTER1SP, and EmoFilm, selected for their emphasis on categorical emotion classifications, which differ from continuous models. Acted, induced, and natural emotions are included in these databases, and a comprehensive basis for analyzing emotional phenomena is provided. The details and specific contributions of each database are explained below, and a summary of them is presented in Table 2.
  • EmoMatchSpanishDB
    The EmoMatchSpanishDB (EM-SpDB) [64] database is part of EmoSpanishDB, which was created from recordings of 50 non-actor participants (30 men and 20 women) with Ekman’s seven basic emotions being simulated: surprise, disgust, fear, anger, happiness, sadness, and neutral. EmoSpanishDB initially contains 3550 audio files, free of intrinsic emotional load. Through a crowdsourcing process, perceived emotions were validated, and inconsistent samples were removed, leading to the development of EM-SpDB, with the set being reduced to 2020 audios with more reliable emotional labels.
  • Mexican Emotional Speech Database.
    The Mexican Emotional Speech Database (MESD) [72] was created in 2021 and consists of 864 recordings in Mexican Spanish. Six emotions—anger, disgust, fear, happiness, sadness, and neutral—were simulated by sixteen individuals (4 men, 4 women, and 8 children). The recordings were conducted in a professional studio, and the audio was sampled at 48,000 Hz with 16-bit resolution. Each adult participant recorded 48 words per emotion, while children recorded 24 words per emotion. The database was validated in [72] using an SVM model, and an accuracy of 93.9% for men, 89.4% for women, and 83.3% for children was achieved
  • Spanish MEACorpus.
    The Spanish MEACorpus [68] is a multimodal database created in 2022, containing 5129 audio segments extracted from 13.16 h of Spanish speech. The emotional distribution is based on Ekman’s basic emotions. The segments are derived from YouTube videos, with natural settings such as political speeches, sports events, and entertainment shows being captured and recorded spontaneously. The database includes 46% female voices and 54% male voices, though specific details on speaker ages are not provided. The audios are in WAV format at a sampling rate of 44,100 Hz, with an average length of 9.24 s, segmented using a noise and silence threshold-based algorithm. Manual annotation was performed by three annotators, with Ekman’s taxonomy being used.
  • EmoWisconsin.
    EmoWisconsin [66] was created in 2011 in Mexican Spanish and contains 3098 segments of children’s speech, including 28 participants (17 boys and 11 girls) aged 7 to 13. The labeled emotions include six categories: Annoyed, Motivated, Nervous, Neutral, Doubtful, and Confident, along with the three continuous primitives of valence, arousal, and dominance. The recordings were made at 44,100 Hz in 16-bit PCM WAV format, totaling 11 h and 39 min of recordings across 56 sessions. Emotions were elicited using a modified version of the Wisconsin Card Sorting Test (WCST), with sessions divided into positive and negative interactions.
  • INTERFACE (INTER1SP).
    The INTERFACE [73] database was created in 2002 and contains 5520 samples in Spanish (INTER1SP), along with recordings in English, Slovenian, and French. Six emotions are included: anger, sadness, joy, fear, disgust, and surprise. Additionally, neutral speech styles were recorded in Spanish, including variations like soft, loud, slow, and fast. Recordings were made by one professional actor and one actress in each language. For validation, 18 non-professional listeners evaluated 56 statements through subjective tests. Recognition accuracy in the first choice exceeded 80%, rising to 90% when a second option was included. This database is available in the ELRA repository [79].
  • EmoFilm.
    EmoFilm [74] is a multilingual database created in 2018, designed to enrich underrepresented languages such as Spanish, Italian, and others. It consists of 1115 clips with an average length of 3.5 s, distributed as 360 clips in English, 413 in Italian, and 342 in Spanish. This dataset includes five emotions: anger, sadness, happiness, fear, and contempt. For this study, only the Spanish portion of EmoFilm was used ( EmoFilm E S ). Movie and TV series clips have been extracted, capturing emotional expressions in acted contexts. Emotion annotation was performed through evaluations by multiple annotators, achieving high perceptual reliability.

3.2. Pre-Trained Models

In this work, we make use of a selection of PTMs, including self-supervised architectures, sequence-by-sequence frameworks, and multimodal contrastive learning approaches. Each hidden layer of these models is applied as a feature extractor to generate embeddings that capture speech representations, relevant to the identification of both acoustic and paralinguistic features. The following sections detail the PTMs selected for this work, highlighting their architectures and pretraining methodologies.
  • Wav2Vec 2.0.
    Wav2Vec 2.0 [44] is a self-supervised framework that learns representations from raw audio by solving a contrastive task on latent masked speech samples. The architecture is based on a convolutional feature encoder that processes waveform inputs into latent speech representations, followed by a transformer context network that captures contextual information. A quantization module discretizes the latent representations for contrastive learning. The training objective maximizes the agreement between the real latent representations and the quantized versions for the masked time steps, effectively capturing local and global acoustic features without the need for transcribed data.
    We used several variants of Wav2Vec 2.0 in our experiments:
    • facebook/wav2vec2-large-robust-ft-libri-960h (W2V2-L-R-Libri960h): Pre-trained on 60,000 h of audio, including noise and telephony data, and fine-tuned on 960 h of the LibriSpeech corpus. This model consists of 24 layers and contains 317 million parameters.
    • jonatasgrosman/wav2vec2-xls-r-1b-spanish (W2V2-XLSR-ES): Based on the XLS-R architecture, this model was pre-trained on 436,000 h of multilingual data and fine-tuned on Spanish ASR tasks. It has 1 billion parameters and 48 transformer layers.
    • facebook/wav2vec2-large-xlsr-53-spanish (W2V2-L-XLSR53-ES): Pre-trained with 56,000 h of speech data from 53 languages including Spanish, and fine-tuned with Spanish ASR data. This model has 317 million parameters and 24 hidden layers.
  • Whisper.
    Whisper [47] is an encoder–decoder transformer model designed for ASR and speech translation tasks. Its training approach, using supervised data, allows the model to effectively generalize across languages and tasks, capturing both linguistic and paralinguistic features that can be essential for emotion recognition.
    • openai/whisper-large-v3 (Whisper-L-v3): This model contains 1.55 billion parameters, 32 encoder and decoder layers, and is trained on 680,000 h of multilingual data.
    • zuazo/whisper-large-v3-es (Whisper-L-v3-ES): This is an optimized version specifically tailored for Spanish ASR tasks, featuring 32 hidden layers.
  • HuBERT.
    HuBERT [45] is a PTM developed by Facebook, with a fully transformer-based architecture. This model has been trained on 1,160,000 h of unlabeled audio data using a self-supervised learning paradigm similar to Wav2Vec 2.0, but optimized for mask prediction tasks within audio signals. In this study, a 24-layer, 316 million-parameter version has been employed facebook/hubert-large-ll60k (HuBERT-L-1160k).
  • WavLM.
    WavLM [46] incorporates a controlled relative position bias to model long-range dependencies in the speech signal, which improves the model’s ability to capture global contextual information. Pre-training uses a masked speech prediction task with continuous inputs, similar to the HuBERT and Wav2Vec 2.0 methodologies. For our work, we use the microsoft/wavlm-large (WavLM-L) model, which consists of 24 transformer layers and 300 million parameters.
  • TRILLsson.
    TRILLsson [75] is a compact and efficient model derived through the distillation of CAP12, a high-performance Conformer model trained on 900 million hours of audio from the YT-U dataset using self-supervised objectives. Tailored for non-semantic tasks, TRILLsson captures paralinguistic features critical for emotion recognition, such as tone, pitch, and rhythm. Although its size has been reduced from 6x to 100x compared to CAP12, it retains 90–96% of its performance. Its efficient architecture leverages local matching strategies to align smaller inputs with robust embeddings, enabling deployment on resource-constrained devices. From this model, it is only possible to obtain the final embeddings, so, unlike the others, it is not layer-wise analyzed.
  • CLAP.
    CLAP [76] makes use of a dual-encoder architecture for audio and text modalities, trained through a contrastive loss that aligns audio and text embeddings in a shared latent space. The audio encoder processes audio inputs using the HTSAT architecture, while the text encoder processes textual descriptions. In our work, we use the laion/clap-htsat-unfused (L-CLAP-G) model, which, like TRILLsson, only yields final embeddings and not intermediate ones, so its analysis will not be layer-based either.
A summary of the PTMs evaluated in this work is provided in Table 3. This table highlights key details, including the training datasets, parameter counts, and language coverage.

3.3. Layer-Wise Evaluation of PTMs

A systematic layer-wise evaluation has been performed in this work in order to identify the most effective representations for SER. The approach examines each layer of the PTMs sequentially, extracting embeddings, starting from the earliest layers and moving toward the deepest layers. The complete flow of the layer-wise evaluation is presented in Figure 1.

3.4. Embedding Extraction and Mean Pooling

For each PTM and speech database, embeddings have been extracted sequentially, which are then mean pooled. This transformation condenses the temporal dynamics into a single vector while preserving the essential statistical properties of the features. Formally, let E R T × D be the embedding matrix of an audio sample, where T is the number of time steps and D the dimensionality of the feature. The mean pooled embedding e R D is computed as:
e = 1 T t = 1 T E [ t ] ,
where E [ t ] R D corresponds to the feature vector at time step t. This operation compresses the T × D sequence into a single 1 × D vector, facilitating compatibility with downstream classifiers. Mean pooling has been consistently employed in SER tasks involving PTM embeddings [52,54,80], offering a simple yet effective way to aggregate temporal information. Models like CLAP and TRILLsson are excluded from this step, as they already integrate internal mechanisms for temporal summarization or lack a sequential structure altogether.

3.5. Data Partitioning

The dataset is partitioned into training and test sets using 5-fold stratified cross-validation, ensuring that the distribution of emotion classes is preserved within each fold to address class imbalance issues [81]. Given a dataset D = { ( x i , y i ) } i = 1 N , where x i is the feature vector and y i its label, the data is partitioned into K folds { D 1 , , D K } such that the class distribution within each fold matches the overall class distribution of D . For each iteration k { 1 , , K } , the training and test sets are defined as:
D train ( k ) = i k D i , D test ( k ) = D k .
Partitioning is achieved using Scikit-learn’s ‘StratifiedKFold’ method [82], which splits the data such that for each fold k, the class proportions in D test ( k ) and D train ( k ) are consistent with the original dataset.

3.6. LOSO Validation

LOSO validation is employed to assess model generalization to unseen speakers. In this approach, data from a single speaker is reserved as the testing set while the model is trained on data from all remaining speakers. Let D = { D 1 , D 2 , , D S } represent the dataset, where D i contains all samples from speaker i, and S is the total number of speakers. For the i-th iteration, the training and testing sets are defined as:
D train ( i ) = j i D j , D test ( i ) = D i .
Performance metrics are computed separately for each speaker and subsequently averaged to obtain an overall evaluation. To ensure reliable evaluation, speakers with fewer than 15 samples are excluded. The workflow of the LOSO evaluation is represented in Figure 2.

3.7. Classifier Training

The embeddings extracted from the PTMs serve as inputs for three classifiers: KNN, SVM, and MLP. These models are simple and well-established in the SER literature [21,23,25], allowing for interpretable results and isolating the contribution of embeddings without the added complexity of end-to-end systems. Additionally, although more complex models could be used, the focus of this study is not on maximizing computational power but rather on evaluating the quality of the embeddings. Therefore, using lightweight models reduces computational requirements and allows for a more focused analysis, while also ensuring reproducibility and mitigating overfitting risks.
Hyperparameter optimization is performed using a grid or randomized search with cross-validation. The classification algorithms used are described below.
  • K-Nearest Neighbors.
    The KNN algorithm is a non-parametric and lazy learning method used for classification and regression. KNN classification is performed by identifying the K nearest neighbors to a query point using a distance metric, such as the Euclidean distance. The optimal value of K and the nearest neighbors are determined through a search process that can involve techniques like Ktree and K*tree, where neighbor calculations are optimized using tree-like structures. Advanced methods, such as the one-step KNN, reduce neighbor calculation to a single matrix operation that integrates both K adjustment and neighbor search, using least squares loss functions and sparse regularization (like group lasso) to generate a relationship matrix containing the optimal neighbors and their corresponding weights [83].
  • Support Vector Machine.
    SVMs are supervised learning algorithms used for classification and regression designed to identify the optimal separating hyperplane in a high-dimensional space that maximizes the margin between points from different classes. For non-linear problems, SVM employs kernel functions to project data into higher-dimensional spaces where linear separation becomes feasible. The maximum margin is expressed as 1 w , where w is the weight vector. The optimization problem is formulated as minimizing | | w | | 2 under correct classification constraints, and in cases where the margin cannot be strictly maintained, slack variables are introduced to allow some classification flexibility [84,85].
  • Multi-Layer Perceptron.
    MLPs are artificial neural networks consisting of interconnected layers of nodes, where each node in intermediate and output layers applies a non-linear activation function (such as sigmoid or hyperbolic tangent). MLP training is typically performed using backpropagation, which adjusts the weights via gradient descent to minimize the mean squared error (MSE) between predicted and expected outputs. The mathematical model of an MLP can be expressed as:
    y = f ( W 2 · f ( W 1 · x + b 1 ) + b 2 )
    where W 1 and W 2 are weight matrices, b 1 and b 2 are bias vectors, and f is the non-linear activation function [86].

3.8. Calculation of Performance Metrics

Three metrics—accuracy, F1-score, and unweighted average recall (UAR)—are used to evaluate the performance of the models. These metrics follow the definitions provided in Scikit-learn [82] and are defined as follows:
Accuracy = Number of Correct Predictions Total Number of Predictions = 1 N i = 1 N ( y i = y ^ i ) ,
where N is the total number of samples, y i is the true label, and y ^ i is the predicted label. The function ( · ) is the indicator function, which returns 1 if the condition inside is true, and 0 otherwise.
F 1 - Score = 2 · Precision · Recall Precision + Recall , where Precision = T P T P + F P , Recall = T P T P + F N .
UAR ( Unweighted Average Recall ) = 1 C c = 1 C TP c TP c + FN c ,
where C is the number of classes, TP c is the number of true positives for class c, and FN c is the number of false negatives for class c. UAR is equivalent to the macro-average recall in Scikit-learn.

4. Results

4.1. Layer-Wise Evaluation Results

Table 4 presents a summary of the performance results of the PTMs evaluated on each of the databases. This table shows the best score achieved (accuracy, F1-score, and UAR) by each classifier employed, where the subscript indicates the hidden layer corresponding to the obtained score.
Based on the highest values (highlighted in bold in Table 4), we observe that the INTER1SP database achieves the highest scores using the HuBERT-L-1160k PTM and SVM classifier, achieving an accuracy of 99.83%6, an F1-score of 99.83%6, and a UAR of 99.88%6. The MESD database also has a high performance with the W2V2-XLSR-ES PTM and MLP classifier, achieving 95.38%6 in accuracy, 95.40%6 in F1-score, and 95.38%6 in UAR. The EmoFilm E S database achieves 91.67%22 in accuracy, 91.52%22 in F1-score, and 91.83%22 in UAR, using Whisper-L-v3-ES PTM and the MLP classifier. In contrast, the EmoWisconsin database, due to its complexity and limited emotional range, shows lower and unstable performance. Its best accuracy is 62.50%15 and is achieved by W2V2-XLSR-ES and SVM, while the highest F1-score is 58.92%31 and is obtained by Whisper-L-v3-ES and MLP, and the highest UAR is 40.0%14 and is observed by W2V2-XLSR-ES and MLP.
Table 5 provides a concise summary of the best-performing configurations for each database. It highlights the specific PTM, the layer used, and the corresponding classifier that achieved the highest performance. The table serves as a clear reference, showing the respective accuracy, F1-score, and UAR values on the datasets.

4.1.1. Average Metric Values by Layer per Database and PTM

The results in Figure 3 and Figure 4 reveal distinct layer-wise performance trends among the evaluated models. Models like WavLM-L, W2V2-L-XLSR53-ES, and W2V2-L-R-Libri960h, each with 24 layers, achieve their highest F1-scores and UAR in the earlier layers (e.g., layers 3–6), demonstrating early-layer dominance. In contrast, Whisper-L-v3 and Whisper-L-v3-ES (32 layers) exhibit peak performance in intermediate layers, with optimal F1-scores observed in layers 13–15 and UAR in layers 12–14. Similarly, the 48-layer W2V2-XLSR-ES achieves its best F1-scores in layers 9–11, while UAR performance shifts slightly to layers 5, 10, and 11, indicating metric-specific nuances. Models like TRILLsson and L-CLAP-G deviate from this trend, providing fixed embeddings rather than layer-specific outputs, represented as constant values in the figures.
For all models, the F1-score decreases significantly in the final layers. This suggests that as the network depth increases, the representations become more specialized and might be less effective for emotion classification tasks. Also, the fact that certain specific layers optimize classification indicates a sweet spot where the representation balances both abstraction and discrimination of emotional features.

4.1.2. Database Performance Comparison

The visualizations in Figure 5 and Figure 6 illustrate the maximum F1-score and UAR achieved by each PTM across the six databases, complementing the detailed numerical results in Table 4. These results provide a clear overview of the PTM-specific strengths and their adaptability to different datasets.
The three databases with the highest average performance, both in F1-score and UAR, are INTER1SP, MESD, and MEACorpus, while EmoFilm E S and EM-SpDB show lower performance with greater variability between the results of different PTMs. On the other hand, EmoWisconsin is the database with the lowest performance, in addition to having an instability dependent on the chosen feature extractor and a greater variability between the results obtained for F1-score and UAR.

4.2. LOSO Validation Results

To assess the generalization capacity of the models, a LOSO validation was performed using the optimal layer determined by the F1-score analysis. The results, summarized in Table 6, provide accuracy, UAR, and F1-score metrics for each classifier across all databases and PTMs. The bold values in the table indicate the highest accuracy, F1-score, and UAR achieved for each database in the LOSO setup.

4.2.1. Performance Insights in LOSO Validation

The LOSO validation results, summarized in Figure 7 and Figure 8, reveal that the best F1-scores differ substantially across databases. For instance, while EM-SpDB reaches its highest F1 of 78.48% with TRILLsson, EmoFilm ES achieves 83.86% using W2V2-XLSR-ES. These two datasets, generally larger and more diverse, appear to foster better generalization to unseen speakers. In contrast, other databases show notably lower peaks; EmoWisconsin, for example, reaches a maximum F1 of just 50.75% (W2V2-XLSR-ES), and MESD tops out at 50.59% (W2V2-L-XLSR53-ES).
A similar picture emerges for UAR. EM-SpDB obtains its best UAR of 73.24% with W2V2-L-XLSR53-ES, and EmoFilm ES is close at 72.68% using W2V2-XLSR-ES. Meanwhile, EmoWisconsin shows a maximum UAR of only 34.71% (W2V2-XLSR-ES), highlighting its higher difficulty for speaker-independent modeling. In INTER1SP, the top F1 (67.44%) and UAR (65.99%) both stem from W2V2-L-XLSR53-ES, suggesting that certain PTMs can still adapt successfully to more constrained contexts.
These findings underscore the importance of dataset size and diversity in PTM generalization. EM-SpDB and EmoFilm ES , featuring a broader emotional range and greater speaker variety, achieve higher F1-score and UAR ceilings. Meanwhile, smaller or highly focused corpora like MESD (peak F1 below 51%) or INTER1SP (below 68%) pose greater challenges. Notably, TRILLsson competes strongly (for example, 78.48% F1 in EM-SpDB) despite lacking explicit Spanish fine-tuning, highlighting the power of robust paralinguistic representations. W2V2-L-XLSR53-ES and W2V2-XLSR-ES also stand out, reinforcing the benefits of language-specific training for Spanish SER tasks.

4.2.2. Average Performance Metrics

Figure 9 and Figure 10 depict the average F1-score and UAR in LOSO validation. Consistent with the maximum metrics, W2V2-L-XLSR53-ES stands out among the top across databases, while TRILLsson displays robust cross-linguistic adaptability. W2V2-XLSR-ES also demonstrates high average performance, further emphasizing the benefits of Spanish fine-tuning. By contrast, L-CLAP-G and Whisper-based models exhibit lower overall averages, although they can remain competitive in specific scenarios.
In general, the consistent performance of W2V2-L-XLSR53-ES suggests that Spanish-oriented training helps capture crucial emotional cues, while TRILLsson’s paralinguistic focus maintains remarkable versatility for Spanish datasets. L-CLAP-G and Whisper-based approaches, though potentially strong for ASR or multimodal tasks, appear less optimal for Spanish SER, especially under LOSO’s stringent speaker-independent conditions.

5. Discussion and Comparative Analysis

5.1. PTMs Against SOTA for SER in Spanish

The layer-wise results, summarized in Table 7 (the bold values in the table indicate the highest accuracy, F1-score, and UAR in the SOTA for each database), reveal substantial improvements over SOTA benchmarks across multiple datasets, driven by the use of Spanish fine-tuned PTMs. In INTER1SP, HuBERT-L-1160k surpasses the SOTA accuracy by 1.43%, and is joined by W2V2-L-XLSR53-ES, W2V2-XLSR-ES, WavLM-L, W2V2-L-R-Libri960h, Whisper-L-v3, and Whisper-L-v3-ES, all of which also exceed the SOTA benchmark. In EM-SpDB, W2V2-L-XLSR53-ES achieves a remarkable 19.98% accuracy gain over the SOTA, highlighting the benefits of fine-tuning for Spanish. Whisper-L-v3-ES delivers competitive results in EmoFilm E S , excelling in datasets with natural emotional expressions, while MESD is the only dataset where the proposed method falls slightly short, with a minimal gap of 0.62% in F1-score. These findings demonstrate that feature extraction techniques with PTMs, in particular those fine-tuned to the target language, consistently outperform traditional approaches on diverse emotional datasets.
The LOSO validation results, summarized in Table 8, highlight the strong generalization capabilities of TRILLsson and the consistent superiority of Spanish fine-tuned PTMs in challenging evaluation settings. TRILLsson achieves the best accuracy (79.66%) and F1-score (78.48%) in EM-SpDB, outperforming the SOTA benchmark of 68.30%, even from non-LOSO evaluations, a remarkable feat given the stringent LOSO conditions. Similarly, W2V2-L-XLSR53-ES surpasses the SOTA in EmoWisconsin, achieving 54.85% accuracy and demonstrating the effectiveness of fine-tuned embeddings in datasets with induced emotions and high variability. Moreover, TRILLsson also leads in MEACorpus with 75.85% accuracy and 71.64% F1-score, demonstrating its good adaptation in diverse datasets, while W2V2-L-XLSR53-ES and W2V2-XLSR-ES maintain strong performance across other databases, including INTER1SP and EmoFilm E S .

5.2. Performance Comparison Between PTMs

The 24-layer models (W2V2-L-XLSR-ES, HuBERT-L-1160k, WavLM-L, W2V2-L-R-Libri960) perform optimally at the early layers (4–6), effectively capturing crucial linguistic and prosodic features. This finding aligns with previous observations that lower transformer layers excel at extracting short-range acoustic cues, pitch, and formant structure [54,55], while deeper layers increasingly emphasize broader contextual abstractions [53,87]. Similarly, larger models (Whisper-L-v3, Whisper-L-v3-ES, W2V2-XLSR-ES) show their peak performance around mid-level layers (9–15), consistent with [51]. In none of these cases do final layers outperform earlier or middle layers consistently, indicating that the richest emotional representations often emerge before the last layers.
TRILLsson shows remarkable adaptability on speaker-independent tasks, such as LOSO validation, achieving SOTA results on datasets such as MEACorpus and EM-SpDB. This is in agreement with [52], who highlighted the effectiveness of TRILLsson in capturing critical paralinguistic features for SER. Furthermore, the improvements observed through fine-tuning, in particular for W2V2-L-XLSR53-ES, confirmed the importance of domain-specific adaptations, as emphasized by [55].

5.3. Impact of Database Nature on Emotion Recognition Performance

The nature of emotional databases—acted, natural, and induced—strongly affects within-database performance and generalization in speaker-independent scenarios. Acted databases, like INTER1SP, EM-SpDB, and MESD, often show high accuracy in layer-wise cross-validation due to expressive recordings. However, the LOSO protocol significantly lowers performance, as seen in MESD, where accuracy drops from 95.38% to 53.12%. This highlights the challenge of generalizing exaggerated emotions to unseen speakers.
Natural (MEACorpus) and induced (EmoWisconsin) databases show lower layer-wise scores but maintain more stable performance under LOSO. MEACorpus retains around 70–75% F1-score in LOSO despite reaching 92% in the layer-wise model, suggesting that genuine expressions generalize better. In contrast, EmoWisconsin struggles with LOSO, usually staying below a 60% F1-score, indicating that elicited emotions are more variable and speaker specific. These patterns confirm that acted data achieve high peak accuracy that drops in speaker-independent contexts, while natural or induced data have lower peaks but steadier generalization.
The results also indicate that the number of speakers alone is not the only variable impacting performance. Databases with few speakers, like INTER1SP (2) and MESD (3), perform well in the layer-wise model but decline in LOSO. EM-SpDB (50 speakers) shows consistent results (88.28% layer-wise, 79.66% LOSO), suggesting that diversity improves generalization. Intermediate datasets, like EmoWisconsin (27 speakers), vary more, with a 58.92% F1-score in the layer-wise model, dropping to 50.75% in LOSO. Thus, while speaker count matters, database nature and emotion variability are more critical for generalization.

5.4. Fine-Tuning and Practical Applications

Recent research indicates that fully fine-tuning large PTMs is not always essential to achieve competitive performance in classification tasks, even beyond speech-related domains [88]. In our experiments, the direct approach of freezing Spanish PTMs and using them solely as feature extractors already surpasses or matches existing SER benchmarks across multiple datasets. This observation aligns with the findings of [88], which indicate that freezing BERT can outperform fine-tuning it for classification, and that the best-performing models can be trained by freezing the main model, substantially reducing training time. Given the considerable resources required for large-scale fine-tuning, our results suggest that, particularly for Spanish SER tasks, the additional cost of exhaustive model adaptation is not strictly necessary to achieve state-of-the-art performance.
Furthermore, our study highlights the remarkable performance of TRILLsson in speaker-independent contexts (LOSO), consistent with previous conclusions about its robustness in capturing paralinguistic cues [52]. TRILLsson’s versatility has also been demonstrated in various clinical and diagnostic applications, such as cognitive impairment detection [89], clinical speech AI [90], and Parkinson’s disease detection [91]. This capability to extract nuanced emotional information is equally beneficial for real-world SER applications. For instance, emotion recognition can be integrated into call centers to automate customer support or implemented in healthcare systems to monitor the psychological state of patients over time and facilitate the early detection of mood disorders.

6. Conclusions

The findings of this study confirm that frozen PTMs used solely as feature extractors can achieve high performance in Spanish SER, matching and even surpassing SOTA work, without the need for extensive task-specific fine-tuning. In particular, 24-layer architectures tend to capture crucial acoustic and linguistic features in earlier layers (4–6), while deeper architectures peak at intermediate layers (9–15). Moreover, our approach outperforms previously reported techniques in Spanish SER, attaining F1-scores of 99.83% on INTER1SP, 88.32% on EM-SpDB, 92.53% on MEACorpus, and 58.92% on EmoWisconsin.
Spanish-specific fine-tuned models such as W2V2-L-XLSR53-ES (24 layers) and W2V2-XLSR-ES (48 layers) yielded the best overall scores, indicating the suitability of Wav2Vec 2.0-based architectures for SER regardless of model depth. Furthermore, TRILLsson’s notable generalization performance highlights its strong applicability for speaker-independent and real-world scenarios.
Equally important, our results underscore that performance heavily depends on dataset characteristics, including the number of speakers, emotional variability, and whether emotions are natural, induced, or acted, all of which substantially impact the generalization of SER systems in practical environments. Finally, through a multi-dataset, layer-wise, and LOSO-based evaluation, we demonstrated how specific layers provide the richest emotional representations based on architecture size, opening new avenues for efficient, scalable SER solutions in Spanish.
Despite the strong performance reported, several limitations remain. First, we did not systematically evaluate computational cost or energy consumption, both of which are critical for real-time or on-device SER. Second, our analysis focused on Ekman’s discrete emotional categories, potentially oversimplifying the spectrum of human affects. Third, we relied on the mean-pooling of embeddings, which might overlook salient temporal cues. Moreover, we did not apply formal significance tests (e.g., ANOVA or t-tests) to determine whether performance differences between PTMs are statistically meaningful.
Based on the above observations, we outline several key recommendations. First, carefully evaluate the trade-off between freezing and fully fine-tuning large PTMs, as well as the option of adopting more compact architectures (e.g., TRILLsson), which can ease computational overhead without critically sacrificing accuracy. Second, integrate statistical significance tests into system evaluations, thus enabling data-driven decisions about model configurations and ensuring that observed performance differences are both numerically and statistically meaningful. Third, go beyond discrete emotion taxonomies (e.g., Ekman’s) by considering more nuanced emotional constructs for a richer perspective on affective states. Finally, carefully plan or select SER databases by balancing the number of speakers, emotional variability, and the nature of the data (acted, induced, or natural), as these aspects could substantially influence performance in real-world scenarios.
For future work, we identify three promising directions. One is to extend these insights for truly multilingual SER, integrating or pre-training in languages beyond Spanish and refining cross-lingual generalization. Another direction is to adapt and validate SER systems in real-world scenarios—such as call centers or clinical settings—for practical, scalable solutions. Finally, the development of multi-output models capable of detecting multiple emotional states in the same audio file is needed to address the simultaneous nature of human emotions. Such advancements will propel SER research and broaden its impact across diverse application domains.

Author Contributions

Investigation: A.M. and G.D.-A.; supervision: A.D.-C., H.V.-L. and L.H.-M.; methodology: A.M.; validation; A.M., A.D.-C. and L.H.-M.; visualization: J.H.-C., A.F.J.-A. and J.P.-J.-F.; resources: G.D.-A., H.V.-L., A.F.J.-A. and J.P.-J.-F.; writing—original draft preparation: A.M.; writing—review and editing: A.M., G.D.-A., H.V.-L., J.H.-C. and L.H.-M.; All authors have read and agreed to the published version of the manuscript.

Funding

Alex Mares and Gerardo Diaz-Arango gratefully acknowledge the financial support provided by the Secretariat of Science, Humanities, Technology, and Innovation (SECIHTI) through academic scholarships under contracts 1282871 and 480421, respectively.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from multiple sources, including the EmoWisconsin dataset [66], EmoMatchSpanishDB [64], The Mexican Emotional Speech Database [72], Spanish MEACorpus [68], Interface Databases [73], and the Spanish subset of EmoFilm [74]. Access to these datasets is restricted and may require permission from the respective authors or institutions.

Acknowledgments

We like to thank Humberto Pérez-Espinosa, Carlos Alberto Reyes-García, and Luis Villaseñor-Pineda. “Emowisconsin: and emotional children speech database in mexican spanish”. In Affective Computing and Intelligent Interaction, pp. 62–71. Springer Berlin Heidelberg, 2011. ISNN 0302-9743. Additionally, we extend our gratitude to the authors and institutions responsible for the databases that made this research possible: EmoMatchSpanishDB [64], The Mexican Emotional Speech Database (MESD) [72], Spanish MEACorpus [68], Interface Databases [73], and the Spanish subset of EmoFilm [74].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CNNsConvolutional Neural Networks
DS-AMDeepSpectrum model with Attention Mechanism
EM-SpDBEmoMatchSpanishDB
GMMsGaussian Mixture Models
GRUsGated Recurrent Units
HMMsHidden Markov Models
KNNK-Nearest Neighbor
LOSOLeave-One-Speaker-Out
LSTMLong Short-Term Memory
MESDMexican Emotional Speech Database
MFCCsMel-Frequency Cepstral Coefficients
MLPMulti-Layer Perceptron
P-TAPTPseudo-label Task Adaptive Pretraining
PTMsPre-Trained Models
RNNsRecurrent Neural Networks
SERSpeech Emotion Recognition
SSLSelf-Supervised Learning
SOTAState-Of-The-Art
SVMSupport Vector Machine
UARUnwieghted Average Recall
WCSTWisconsin Card Sorting Test

References

  1. Geethu, V.; Vrindha, M.K.; Anurenjan, P.R.; Deepak, S.; Sreeni, K.G. Speech Emotion Recognition, Datasets, Features and Models: A Review. In Proceedings of the 2023 International Conference on Control, Communication and Computing (ICCC), Thiruvananthapuram, India, 19–21 May 2023; pp. 1–6. [Google Scholar]
  2. Lee, C.M.; Narayanan, S. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 2005, 13, 293–303. [Google Scholar] [CrossRef]
  3. Shah Fahad, M.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
  4. Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art and research challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [Google Scholar] [CrossRef]
  5. Khare, S.K.; Blanes-Vidal, V.; Nadimi, E.S.; Acharya, U.R. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Inf. Fusion 2024, 102, 102019. [Google Scholar] [CrossRef]
  6. Luo, B.; Lau, R.Y.K.; Li, C.; Si, Y.W. A critical review of state-of-the-art chatbot designs and applications. WIREs Data Min. Knowl. Discov. 2022, 12, e1434. [Google Scholar] [CrossRef]
  7. Chen, L.; Su, W.; Feng, Y.; Wu, M.; She, J.; Hirota, K. Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction. Inf. Sci. 2020, 509, 150–163. [Google Scholar] [CrossRef]
  8. Liu, Z.; Hu, B.; Li, X.; Liu, F.; Wang, G.; Yang, J. Detecting Depression in Speech Under Different Speaking Styles and Emotional Valences. In Proceedings of the Brain Informatics, Beijing, China, 16–18 November 2017. [Google Scholar]
  9. France, D.; Shiavi, R.; Silverman, S.; Silverman, M.; Wilkes, M. Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Trans. Biomed. Eng. 2000, 47, 829–837. [Google Scholar] [CrossRef]
  10. Li, H.C.; Pan, T.; Lee, M.H.; Chiu, H.W. Make Patient Consultation Warmer: A Clinical Application for Speech Emotion Recognition. Appl. Sci. 2021, 11, 4782. [Google Scholar] [CrossRef]
  11. Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
  12. Hansen, J.H.; Cairns, D.A. ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments. Speech Commun. 1995, 16, 391–422. [Google Scholar] [CrossRef]
  13. Cai, Y.; Li, X.; Li, J. Emotion Recognition Using Different Sensors, Emotion Models, Methods and Datasets: A Comprehensive Review. Sensors 2023, 23, 2455. [Google Scholar] [CrossRef] [PubMed]
  14. Yu, J.; Zhang, C.; Song, Y.; Cai, W. ICE-GAN: Identity-aware and Capsule-Enhanced GAN with Graph-based Reasoning for Micro-Expression Recognition and Synthesis. arXiv 2021, arXiv:2005.04370. [Google Scholar]
  15. Wang, Y.; Song, W.; Tao, W.; Liotta, A.; Yang, D.; Li, X.; Gao, S.; Sun, Y.; Ge, W.; Zhang, W.; et al. A systematic review on affective computing: Emotion models, databases, and recent advances. Inf. Fusion 2022, 83–84, 19–52. [Google Scholar] [CrossRef]
  16. Ekman, P.; Sorenson, E.R.; Friesen, W.V. Pan-Cultural Elements in Facial Displays of Emotion. Science 1969, 164, 86–88. [Google Scholar] [CrossRef]
  17. Darwin, C. The Expression Of The Emotions In Man And Animals; Oxford University Press: Oxford, UK, 1998. [Google Scholar] [CrossRef]
  18. Bălan, O.; Moise, G.; Petrescu, L.; Moldoveanu, A.; Leordeanu, M.; Moldoveanu, F. Emotion Classification Based on Biophysical Signals and Machine Learning Techniques. Symmetry 2020, 12, 21. [Google Scholar] [CrossRef]
  19. Zhou, E.; Zhang, Y.; Duan, Z. Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12126–12130. [Google Scholar]
  20. Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
  21. Doğdu, C.; Kessler, T.; Schneider, D.; Shadaydeh, M.; Schweinberger, S.R. A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech. Sensors 2022, 22, 7561. [Google Scholar] [CrossRef]
  22. Stuhlsatz, A.; Meyer, C.; Eyben, F.; Zielke, T.; Meier, H.G.; Schuller, B. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5688–5691. [Google Scholar]
  23. Ke, X.; Zhu, Y.; Wen, L.; Zhang, W. Speech Emotion Recognition Based on SVM and ANN. Int. J. Mach. Learn. Comput. 2018, 8, 198–202. [Google Scholar] [CrossRef]
  24. Bhavan, A.; Chauhan, P.; Hitkul; Shah, R.R. Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 2019, 184, 104886. [Google Scholar] [CrossRef]
  25. Jain, M.; Narayan, S.; Balaji, P.; P, B.K.; Bhowmick, A.; R, K.; Muthu, R.K. Speech Emotion Recognition using Support Vector Machine. arXiv 2020, arXiv:2002.07590. [Google Scholar]
  26. Nwe, T.L.; Foo, S.W.; De Silva, L.C. Speech emotion recognition using hidden Markov models. Speech Commun. 2003, 41, 603–623. [Google Scholar] [CrossRef]
  27. Nwe, T.L.; Foo, S.W.; Silva, L.C.D. Detection of stress and emotion in speech using traditional and FFT based log energy features. In Proceedings of the Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint, Singapore, 15–18 December 2003; Volume 3, pp. 1619–1623. [Google Scholar]
  28. Mao, S.; Tao, D.; Zhang, G.; Ching, P.C.; Lee, T. Revisiting Hidden Markov Models for Speech Emotion Recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6715–6719. [Google Scholar] [CrossRef]
  29. Neiberg, D.; Elenius, K.; Laskowski, K. Emotion recognition in spontaneous speech using GMMs. In Proceedings of the Interspeech, Brighton, UK, 6–10 September 2009. [Google Scholar]
  30. Koolagudi, S.G.; Devliyal, S.; Chawla, B.; Barthwal, A.; Rao, K.S. Recognition of Emotions from Speech using Excitation Source Features. Procedia Eng. 2012, 38, 3409–3417. [Google Scholar] [CrossRef]
  31. Bhakre, S.K.; Bang, A. Emotion recognition on the basis of audio signal using Naive Bayes classifier. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; pp. 2363–2367. [Google Scholar] [CrossRef]
  32. Khan, A.; Roy, U.K. Emotion recognition using prosodie and spectral features of speech and Naïve Bayes Classifier. In Proceedings of the 2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India, 22–24 March 2017; pp. 1017–1021. [Google Scholar] [CrossRef]
  33. Younis, E.M.G.; Mohsen, S.; Houssein, E.H.; Ibrahim, O.A.S. Machine learning for human emotion recognition: A comprehensive review. Neural Comput. Appl. 2024, 36, 8901–8947. [Google Scholar] [CrossRef]
  34. Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In Proceedings of the 2017 International Conference on Platform Technology and Service (PlatCon), Busan, Republic of Korea, 13–15 February 2017; pp. 1–5. [Google Scholar]
  35. Mustaqeem; Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors 2020, 20, 183. [Google Scholar]
  36. Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J. Direct Modelling of Speech Emotion from Raw Speech. arXiv 2020, arXiv:1904.03833. [Google Scholar]
  37. Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech Emotion Classification Using Attention-Based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
  38. Mustaqeem; Sajjad, M.; Kwon, S. Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
  39. Lee, J.; Tashev, I.J. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015. [Google Scholar]
  40. Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [Google Scholar] [CrossRef]
  41. Lim, W.; Jang, D.; Lee, T. Speech emotion recognition using convolutional and Recurrent Neural Networks. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–15 December 2016; pp. 1–4. [Google Scholar] [CrossRef]
  42. Zhao, Z.; Zheng, Y.; Zhang, Z.; Wang, H.; Zhao, Y.; Li, C. Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar]
  43. Qin, C.; Schlemper, J.; Caballero, J.; Price, A.; Hajnal, J.V.; Rueckert, D. Convolutional Recurrent Neural Networks for Dynamic MR Image Reconstruction. arXiv 2018, arXiv:1712.01751. [Google Scholar] [CrossRef]
  44. Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
  45. Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv 2021, arXiv:2106.07447. [Google Scholar] [CrossRef]
  46. Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
  47. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
  48. Amiriparian, S.; Packań, F.; Gerczuk, M.; Schuller, B.W. ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets. arXiv 2024, arXiv:2406.10275. [Google Scholar]
  49. Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
  50. Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; Schuller, B.W. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10745–10759. [Google Scholar] [CrossRef]
  51. Osman, M.; Kaplan, D.Z.; Nadeem, T. SER Evals: In-domain and Out-of-domain Benchmarking for Speech Emotion Recognition. arXiv 2024, arXiv:2408.07851. [Google Scholar]
  52. Phukan, O.C.; Kashyap, G.S.; Buduru, A.B.; Sharma, R. Are Paralinguistic Representations all that is needed for Speech Emotion Recognition? In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 4698–4702. [Google Scholar] [CrossRef]
  53. Triantafyllopoulos, A.; Batliner, A.; Rampp, S.; Milling, M.; Schuller, B. INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition. In Proceedings of the Interspeech, Kos, Greece, 1–5 September 2024; pp. 1585–1589. [Google Scholar] [CrossRef]
  54. Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings. arXiv 2021, arXiv:2104.03502. [Google Scholar]
  55. Chen, L.W.; Rudnicky, A. Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
  56. Gao, Y.; Chu, C.; Kawahara, T. Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 3637–3641. [Google Scholar] [CrossRef]
  57. Jaiswal, M.; Provost, E.M. Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems. arXiv 2023, arXiv:2104.08806. [Google Scholar]
  58. Schrüfer, O.; Milling, M.; Burkhardt, F.; Eyben, F.; Schuller, B. Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition. arXiv 2024, arXiv:2407.01143. [Google Scholar]
  59. Ülgen Sönmez, Y.; Varol, A. In-depth investigation of speech emotion recognition studies from past to present—The importance of emotion recognition from speech signal for AI–. Intell. Syst. Appl. 2024, 22, 200351. [Google Scholar] [CrossRef]
  60. Sahu, G. Multimodal Speech Emotion Recognition and Ambiguity Resolution. arXiv 2019, arXiv:1904.06022. [Google Scholar]
  61. Kamble, K.; Sengupta, J. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals. Multimed. Tools Appl. 2023, 82, 27269–27304. [Google Scholar] [CrossRef]
  62. Kerkeni, L.; Serrestou, Y.; Mbarki, M.; Raoof, K.; Mahjoub, M.A. Speech Emotion Recognition: Methods and Cases Study. In Proceedings of the 10th International Conference on Agents and Artificial Intelligence, ICAART 2018, Madeira, Portugal, 16–18January 2018; Volume 1, pp. 175–182. [Google Scholar] [CrossRef]
  63. Ortega-Beltrán, E.; Cabacas-Maso, J.; Benito-Altamirano, I.; Ventura, C. Better Spanish Emotion Recognition In-the-wild: Bringing Attention to Deep Spectrum Voice Analysis. arXiv 2024, arXiv:2409.05148. [Google Scholar]
  64. Garcia-Cuesta, E.; Salvador, A.B.; Pãez, D.G. EmoMatchSpanishDB: Study of speech emotion recognition machine learning models in a new Spanish elicited database. Multimed. Tools Appl. 2023, 83, 13093–13112. [Google Scholar] [CrossRef]
  65. Begazo, R.; Aguilera, A.; Dongo, I.; Cardinale, Y. A Combined CNN Architecture for Speech Emotion Recognition. Sensors 2024, 24, 5797. [Google Scholar] [CrossRef]
  66. Pérez-Espinosa, H.; Reyes-García, C.A.; Villaseñor-Pineda, L. EmoWisconsin: An Emotional Children Speech Database in Mexican Spanish. In Affective Computing and Intelligent Interaction, Proceedings of the Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; D’Mello, S., Graesser, A., Schuller, B., Martin, J.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 62–71. [Google Scholar]
  67. Casals-Salvador, M.; Costa, F.; India, M.; Hernando, J. BSC-UPC at EmoSPeech-IberLEF2024: Attention Pooling for Emotion Recognition. arXiv 2024, arXiv:2407.12467. [Google Scholar]
  68. Pan, R.; García-Díaz, J.A.; Ángel Rodríguez-García, M.; Valencia-García, R. Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments. Comput. Stand. Interfaces 2024, 90, 103856. [Google Scholar] [CrossRef]
  69. Paredes-Valverde, M.A.; del Pilar Salas-Zárate, M. Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech-text Emotion Recognition in Spanish Forum. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
  70. Esteban-Romero, S.; Bellver-Soler, J.; Martín-Fernández, I.; Gil-Martín, M.; D’Haro, L.F.; Fernández-Martínez, F. THAU-UPMatEmoSPeech-IberLEF2024: Efficient Adaptation of Mono-modal and Multi-modal Large Language Models for Automatic Speech Emotion Recognition. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
  71. Cedeño-Moreno, D.; Vargas-Lombardo, M.; Delgado-Herrera, A.; Caparrós-Láiz, C.; Bernal-Beltrán, T. UTPat EmoSPeech–IberLEF2024: Using Random Forest with FastText and Wav2Vec 2.0 for Emotion Detection. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024) at SEPLN 2024, Salamanca, Spain, 24 September 2024. [Google Scholar]
  72. Duville, M.M.; Alonso-Valerdi, L.M.; Ibarra-Zárate, D.I. The Mexican Emotional Speech Database (MESD): Elaboration and assessment based on machine learning. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico City, Mexico, 1–5 November 2021; pp. 1644–1647. [Google Scholar]
  73. Hozjan, V.; Kacic, Z.; Moreno, A.; Bonafonte, A.; Nogueiras, A. Interface Databases: Design and Collection of a Multilingual Emotional Speech Database. In Proceedings of the International Conference on Language Resources and Evaluation, Las Palmas, Spain, 29–31 May 2002. [Google Scholar]
  74. Parada-Cabaleiro, E.; Costantini, G.; Batliner, A.; Baird, A.; Schuller, B. Categorical vs Dimensional Perception of Italian Emotional Speech. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar]
  75. Shor, J.; Venugopalan, S. TRILLsson: Distilled Universal Paralinguistic Speech Representations. arXiv 2022, arXiv:2203.00236. [Google Scholar]
  76. Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Nezhurina, M.; Berg-Kirkpatrick, T.; Dubnov, S. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. arXiv 2024, arXiv:2211.06687. [Google Scholar]
  77. Atmaja, B.T.; Sasou, A. Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition. IEEE Access 2022, 10, 124396–124407. [Google Scholar] [CrossRef]
  78. Macary, M.; Tahon, M.; Estève, Y.; Rousseau, A. On the use of Self-supervised Pre-trained Acoustic and Linguistic Features for Continuous Speech Emotion Recognition. arXiv 2020, arXiv:2011.09212. [Google Scholar]
  79. Hozjan, V.; Kacic, Z.; Moreno, A.; Bonafonte, A.; Nogueiras, A. Emotional Speech Synthesis Database, ELRA catalogue. Available online: https://catalog.elra.info/en-us/repository/browse/ELRA-S0329/ (accessed on 12 December 2024).
  80. Chetia Phukan, O.; Balaji Buduru, A.; Sharma, R. Transforming the Embeddings: A Lightweight Technique for Speech Emotion Recognition Tasks. In Proceedings of the Interspeech, Dublin, Ireland, 20–24 August 2023; pp. 1903–1907. [Google Scholar] [CrossRef]
  81. Szeghalmy, S.; Fazekas, A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
  82. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  83. Zhang, S.; Li, J. KNN Classification With One-Step Computation. IEEE Trans. Knowl. Data Eng. 2023, 35, 2711–2723. [Google Scholar] [CrossRef]
  84. Somvanshi, M.; Chavan, P.; Tambade, S.; Shinde, S.V. A review of machine learning techniques using decision tree and support vector machine. In Proceedings of the 2016 International Conference on Computing Communication Control and automation (ICCUBEA), Pune, India, 12–13 August 2016; pp. 1–7. [Google Scholar] [CrossRef]
  85. Gholami, R.; Fakhari, N. Chapter 27—Support Vector Machine: Principles, Parameters, and Applications. In Handbook of Neural Computation; Samui, P., Sekhar, S., Balas, V.E., Eds.; Academic Press: Cambridge, MA, USA, 2017; pp. 515–535. [Google Scholar] [CrossRef]
  86. Du, K.L.; Leung, C.S.; Mow, W.H.; Swamy, M.N.S. Perceptron: Learning, Generalization, Model Selection, Fault Tolerance, and Role in the Deep Learning Era. Mathematics 2022, 10, 4730. [Google Scholar] [CrossRef]
  87. Pasad, A.; Chou, J.C.; Livescu, K. Layer-Wise Analysis of a Self-Supervised Speech Representation Model. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 914–921. [Google Scholar] [CrossRef]
  88. Galal, O.; Abdel-Gawad, A.H.; Farouk, M. Rethinking of BERT sentence embedding for text classification. Neural Comput. Appl. 2024, 36, 20245–20258. [Google Scholar] [CrossRef]
  89. Botelho, C.; Gimeno-Gómez, D.; Teixeira, F.; Mendonça, J.; Pereira, P.; Nunes, D.A.P.; Rolland, T.; Pompili, A.; Solera-Ureña, R.; Ponte, M.; et al. Tackling Cognitive Impairment Detection from Speech: A submission to the PROCESS Challenge. arXiv 2024, arXiv:2501.00145. [Google Scholar]
  90. Ng, S.I.; Xu, L.; Siegert, I.; Cummins, N.; Benway, N.R.; Liss, J.; Berisha, V. A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation. arXiv 2024, arXiv:2410.21640. [Google Scholar]
  91. Gonçalves, T.; Reis, J.; Gonçalves, G.; Calejo, M.; Seco, M. Predictive Models in the Diagnosis of Parkinson’s Disease Through Voice Analysis. In Proceedings of the Intelligent Systems and Applications, 2024; Arai, K., Ed.; Springer: Cham, Switzerland, 2024; pp. 591–610. [Google Scholar]
Figure 1. Layer-wise evaluation flow.
Figure 1. Layer-wise evaluation flow.
Applsci 15 04340 g001
Figure 2. LOSO evaluation flow.
Figure 2. LOSO evaluation flow.
Applsci 15 04340 g002
Figure 3. Average F1-score per layer.
Figure 3. Average F1-score per layer.
Applsci 15 04340 g003
Figure 4. Average UAR per layer.
Figure 4. Average UAR per layer.
Applsci 15 04340 g004
Figure 5. Maximum F1-score achieved by each PTM for every database in layer-wise evaluation.
Figure 5. Maximum F1-score achieved by each PTM for every database in layer-wise evaluation.
Applsci 15 04340 g005
Figure 6. Maximum UAR achieved by each PTM for every database in layer-wise evaluation.
Figure 6. Maximum UAR achieved by each PTM for every database in layer-wise evaluation.
Applsci 15 04340 g006
Figure 7. Max F1-score by databases and PTMs in LOSO validation.
Figure 7. Max F1-score by databases and PTMs in LOSO validation.
Applsci 15 04340 g007
Figure 8. Max UAR by databases and PTMs in LOSO validation.
Figure 8. Max UAR by databases and PTMs in LOSO validation.
Applsci 15 04340 g008
Figure 9. Average F1 by PTMs in LOSO validation.
Figure 9. Average F1 by PTMs in LOSO validation.
Applsci 15 04340 g009
Figure 10. Average UAR by PTMs in LOSO validation.
Figure 10. Average UAR by PTMs in LOSO validation.
Applsci 15 04340 g010
Table 1. Summary of recent SER studies in Spanish.
Table 1. Summary of recent SER studies in Spanish.
AuthorsDataset(s)Techniques/ModelsBest Result
Kerkeni et al. [62]ELRA-S0329MFCC, MS + RNN, SVC, MLR90.05% Acc.
Ortega-Beltrán et al. [63]ELRA-S0329, EmoMatchSpanishDBDeepSpectrum with attention (DS-AM)98.4% (ELRA) Acc., 68.3% (EmoMatch) Acc.
García-Cuesta et al. [64]EmoMatchSpanishDBComParE, eGeMAPS + SVC/XGBoost64.2% Precision
Pan & García-Díaz [68]MEACorpusLate Fusion + Feature Concatenation90.06% F1-Score
Begazo et al. [65]MESDCNN-1D, CNN-2D, MLP + Spectral features96% F1-Score
Pérez-Espinosa et al. [66]EmoWisconsinSVM + handcrafted features40.7% F1-score
Casals-Salvador et al. [67]MEACorpusRoBERTa + XLSR-Wav2Vec 2.0 + Attention Pooling86.69% F1-Score
Table 2. Databases used in this work.
Table 2. Databases used in this work.
DatabaseSamplesEmotionsType
EM-SpDB2020Surprise, Disgust, Fear, Anger, Happiness, Sadness, NeutralityActed
MESD864Anger, Disgust, Fear, Happiness, Sadness, NeutralityActed
MEACorpus5129Disgust, Anger, Joy, Sadness, Fear, NeutralityNatural
EmoWisconsin3098Annoyed, Motivated, Nervous, Neutral, Doubtful, ConfidentInduced
INTER1SP5520Anger, Sadness, Joy, Fear, Disgust, SurpriseActed
EmoFilm E S 342Anger, Sadness, Happiness, Fear, ContemptActed
Table 3. Summary of PTMs used in the study.
Table 3. Summary of PTMs used in the study.
PTMs (Acronym)Training Data/hParametersLanguages
W2V2-L-R-Libri960h60 K h + fine-tuning (960 h)317 MEnglish
W2V2-XLSR-ES436 K h + fine-tuning (Spanish)1 BMultilingual (Spanish-focused)
W2V2-L-XLSR53-ES56 K h + fine-tuning (Spanish)317 M53 languages (Spanish-focused)
Whisper-L-v3680 K h1.55 B96 languages
Whisper-L-v3-ESFine-tuned from Whisper-L-v31.55 BSpanish
HuBERT-L-1160k1160 K h316 MEnglish
WavLM-L94 K h300 MEnglish
TRILLssonDistilled from 900 M+ h (YT-U dataset)N/AMultilingual
L-CLAP-GMultimodal audio-text pairsN/AMultilingual
Table 4. Performance comparison across PTMs, databases, and classifiers.
Table 4. Performance comparison across PTMs, databases, and classifiers.
PTMDatabaseKNNMLPSVM
AccuracyF1 UAR Accuracy F1 UAR Accuracy F1 UAR
HuBERT-L-1160k EmoFilm E S 69.44867.76865.64879.17578.43976.53584.72984.33982.279
EM-SpDB66.33764.87760.91386.78586.75585.2686.03585.93584.33
EmoWisconsin54.91248.831227.171556.131054.371038.191057.841354.07832.97
INTER1SP97.6397.59398.07399.75599.75599.81599.83699.83699.886
MESD86.71386.65386.76391.33791.34791.36791.91391.96791.953
MEACorpus90.26490.24487.27789.09389.06386.01291.63691.56688.796
W2V2-L-XLSR53-ES EmoFilm E S 76.39675.26675.28686.11485.68884.97486.11685.67684.396
EM-SpDB65.84364.92360.51388.28688.32688.02686.03686.08685.786
EmoWisconsin58.581252.771233.011257.35655.91635.691560.54756.61734.2316
INTER1SP96.86496.84497.5499.5499.5699.63499.75599.75599.815
MESD87.28486.63487.34493.64493.51493.68494.22394.25494.253
MEACorpus90.35490.23489.08589.96389.93387.43691.72391.56390.83
W2V2-L-R-Libri960h EmoFilm E S 70.83369.97469.18481.94881.352081.88883.331982.791981.7219
EM-SpDB62.09060.77056.69086.78586.9585.93585.79685.81684.956
EmoWisconsin55.64849.93829.93456.37653.54636.08257.61152.881132.665
INTER1SP95.78395.76396.57399.34399.34399.32199.34299.34499.54
MESD85.55385.4385.57391.33291.33491.38491.91391.83391.953
MEACorpus76.24277.31280.18687.15287.07286.85590.46390.38386.223
WavLM-L EmoFilm E S 80.56379.96381.33386.11685.85685.2787.5987.471186.969
EM-SpDB68.33267.19262.97286.78486.61485.34487.28387.24386.043
EmoWisconsin54.411148.291126.541156.131053.791035.57359.561055.331033.986
INTER1SP97.68397.67398.26399.42599.42599.57699.67599.67599.755
MESD89.02388.83389.08393.06493.04493.08494.22694.25694.256
MEACorpus90.75690.65689.92690.17490.1489.95492.02391.96389.833
Whisper-L-v3 EmoFilm E S 75.01274.191273.441987.51587.171986.531587.51687.251686.0716
EM-SpDB61.6759.6754.76785.292085.292083.871483.792083.722082.1615
EmoWisconsin55.391850.361828.381858.331856.441835.59760.051855.52031.55
INTER1SP91.151491.141492.271499.171399.171699.381699.341499.341499.514
MESD83.241183.121183.311194.8794.79794.83793.061393.051393.113
MEACorpus86.351286.151283.271488.79988.73987.51190.061289.891288.8312
W2V2-XLSR-ES EmoFilm E S 80.56579.88580.58587.5686.91686.31688.891188.881188.2711
EM-SpDB66.08564.71559.79386.781586.591584.661585.041184.881183.535
EmoWisconsin58.091551.811628.56160.781458.411440.01462.51554.752531.155
INTER1SP96.031196.01197.021199.67899.67899.75899.751299.751299.8112
MESD87.28786.91787.34795.38695.4695.38693.64893.63893.688
MEACorpus90.841190.741189.47590.64690.59690.44692.69592.53590.823
Whisper-L-v3-ES EmoFilm E S 80.562080.02078.552091.672291.522291.832287.51687.32386.0716
EM-SpDB61.1659.39755.05785.542485.542683.742684.292684.222682.9630
EmoWisconsin55.881750.531728.631760.783158.923136.97860.781956.02731.9216
INTER1SP92.391492.381492.941499.09999.091599.32999.171499.171799.3816
MESD83.82983.771483.891494.22994.21994.25994.22994.221094.259
MEACorpus85.881285.751285.121288.71488.611487.871489.681489.591485.0219
TRILLsson EmoFilm E S 63.8963.0161.3179.1778.577.8873.6172.369.88
EM-SpDB57.1154.8650.1480.880.7579.1181.0580.978.12
EmoWisconsin56.1348.9823.4156.3754.2735.3160.2952.9626.09
INTER1SP88.3488.3188.2797.7797.7798.1997.9397.9398.31
MESD70.5270.370.6587.2887.1187.3484.3984.4284.48
L-CLAP-G EmoFilm E S 61.1158.5956.8762.559.4457.2163.8962.9360.35
EM-SpDB53.3749.7744.7663.5962.8958.7366.0865.3160.72
EmoWisconsin50.4942.4319.6853.1949.9425.2650.2541.0218.96
INTER1SP80.480.2982.5891.0791.0691.489.6689.6590.75
MESD68.7968.2968.8878.0377.7678.0879.1979.0879.21
MEACorpus75.3475.0872.6778.1778.1369.3380.2180.2175.29
Note: The subscript indicates the layer from which the corresponding metric was obtained. Bold values represent the highest Accuracy, UAR, and F1-Score achieved by each PTM for each database.
Table 5. Summary of best results by database, PTM, and classifier.
Table 5. Summary of best results by database, PTM, and classifier.
DatabaseMetricPTMClassifierScore (Layer)
INTER1SPAccuracyHuBERT-L-1160kSVM99.83% (6)
F1-ScoreHuBERT-L-1160kSVM99.83% (6)
UARHuBERT-L-1160kSVM99.88% (6)
EmoFilm E S AccuracyWhisper-L-v3-ESMLP91.67% (22)
F1-ScoreWhisper-L-v3-ESMLP91.52% (22)
UARWhisper-L-v3-ESMLP91.83% (22)
EM-SpDBAccuracyW2V2-L-XLSR53-ESMLP88.28% (6)
F1-ScoreW2V2-L-XLSR53-ESMLP88.32% (6)
UARW2V2-L-XLSR53-ESMLP88.02% (6)
EmoWisconsinAccuracyW2V2-XLSR-ESMLP62.50% (15)
F1-ScoreWhisper-L-v3-ESMLP58.92% (31)
UARW2V2-XLSR-ESSVM40.0% (14)
MESDAccuracyW2V2-XLSR-ESMLP95.38% (6)
F1-ScoreW2V2-XLSR-ESMLP95.40% (6)
UARW2V2-XLSR-ESMLP95.38% (6)
MEACorpusAccuracyW2V2-XLSR-ESMLP92.69% (5)
F1-ScoreW2V2-XLSR-ESMLP92.53% (5)
UARW2V2-XLSR-ESSVM90.82% (3)
Table 6. Performance comparison across databases and PTMs with LOSO validation.
Table 6. Performance comparison across databases and PTMs with LOSO validation.
DatabasePTM (Layer)KNNMLPSVM
UAR (%) Accuracy (%) F1 (%) UAR (%) Accuracy (%) F1 (%) UAR (%) Accuracy (%) F1 (%)
EM-SpDBHuBERT-L-1160k (9)51.4261.4558.1167.8875.9274.3868.4675.8074.34
L-CLAP-G41.3251.7948.0750.5958.9456.9949.8759.5857.59
TRILLsson49.7660.0356.2670.8278.1776.9971.4279.6678.48
W2V2-L-R-Libri960h (3)48.9957.7254.1668.0274.9873.3664.3371.3869.62
W2V2-L-XLSR53-ES (6)52.2562.9659.8773.2479.5678.1570.6377.6676.35
W2V2-XLSR-ES (11)51.6860.6656.9670.9478.2176.6870.3777.4075.89
WavLM-L (6)49.0859.2055.3566.6473.8472.0669.0575.3073.82
Whisper-L-v3 (14)41.5750.7146.4765.2373.4171.9665.5372.3470.81
Whisper-L-v3-ES (14)44.2854.2850.0864.6872.9171.4863.9870.3368.84
EmoFilm E S HuBERT-L-1160k (9)50.9461.7263.3969.8779.9381.6063.7982.9583.24
L-CLAP-G48.7757.6960.3751.4454.4456.1442.1650.0653.21
TRILLsson51.1964.5869.3267.6880.0982.0866.7670.6973.27
W2V2-L-R-Libri960h (3)56.3473.7474.7165.5674.5474.8057.3668.8870.45
W2V2-L-XLSR53-ES (6)62.4576.0576.9669.0983.6883.0165.1379.1380.66
W2V2-XLSR-ES (11)53.4771.2272.2172.6881.9383.8667.1582.2783.78
WavLM-L (6)57.2970.8572.3366.7881.1581.8662.0174.4176.13
Whisper-L-v3 (14)53.1264.1666.9958.5675.4078.8263.3479.4882.08
Whisper-L-v3-ES (14)59.3568.9671.1557.6773.3777.2471.7277.6079.61
EmoWisconsinHuBERT-L-1160k (9)25.2243.0740.5330.1947.4645.4729.3543.9743.21
L-CLAP-G22.8941.3838.9328.3748.2344.6625.6843.3942.54
TRILLsson28.4047.6043.8134.4852.6550.3332.6654.0349.72
W2V2-L-R-Libri960h (3)25.2146.1340.2330.5546.9744.9428.1149.3842.23
W2V2-L-XLSR53-ES (6)25.6147.3743.1032.7250.7448.3732.1454.8548.89
W2V2-XLSR-ES (11)25.1944.6142.3634.7153.0850.7530.4743.6743.66
WavLM-L (6)22.9840.9238.7629.8147.8045.3528.2141.1041.56
Whisper-L-v3 (14)25.6043.7641.0930.2250.9249.1529.0143.5143.70
Whisper-L-v3-ES (14)25.5143.3640.0428.7948.9246.9528.4142.6542.22
INTER1SPHuBERT-L-1160k (9)50.2052.9650.3261.0863.8363.4755.8464.7362.04
L-CLAP-G33.6833.8030.4039.9937.9534.4136.4038.7834.43
TRILLsson53.4057.9256.9952.4657.0755.3260.5065.6164.52
W2V2-L-R-Libri960h (3)49.6649.3845.5455.3255.4551.8852.2559.6755.57
W2V2-L-XLSR53-ES (6)58.5559.8957.4565.9967.3465.4560.9870.0467.44
W2V2-XLSR-ES (11)54.9457.1555.3261.2566.3764.3060.4568.6366.52
WavLM-L (6)51.6153.8350.4150.9454.4349.9750.5657.8353.35
Whisper-L-v3 (14)42.1043.9841.0849.7551.7949.5354.9161.0157.68
Whisper-L-v3-ES (14)46.4648.1843.9742.4344.6741.2352.6959.7555.97
MEACorpusHuBERT-L-1160k (9)16.3341.2547.6344.8268.3759.3653.2270.1362.83
L-CLAP-G13.8031.3637.5424.3749.5946.3518.8346.1749.64
TRILLsson33.2859.6758.3854.9275.8571.6450.1966.6259.82
W2V2-L-R-Libri960h (3)38.6961.7967.5551.7068.3761.3455.8771.8865.26
W2V2-L-XLSR53-ES (6)27.9561.3360.4336.7463.4756.1749.0566.6257.87
W2V2-XLSR-ES (11)21.5651.8054.2050.3870.4964.6136.7766.6261.62
WavLM-L (6)27.7053.0056.8557.9573.6467.0759.4775.3968.77
Whisper-L-v3 (14)23.5457.8563.7653.7970.1363.3642.5563.1157.06
Whisper-L-v3-ES (14)28.0357.3862.6246.5970.1363.4150.7666.7262.03
MESDHuBERT-L-1160k (9)33.7333.7532.1345.2045.2441.6047.1747.2142.86
L-CLAP-G33.5933.6528.6737.7537.8233.2939.6039.6835.38
TRILLsson35.7035.7232.7347.3047.3144.7148.2048.2346.30
W2V2-L-R-Libri960h (3)33.1333.1730.1841.8041.8738.1044.5944.6540.42
W2V2-L-XLSR53-ES (6)40.4540.4938.2249.8349.8848.1952.4852.5450.59
W2V2-XLSR-ES (11)36.8636.8834.1848.2148.2644.5953.0953.1250.25
WavLM-L (6)36.0236.0832.3543.1143.1541.3141.6041.6438.33
Whisper-L-v3 (14)30.2430.2826.4939.5139.5636.6337.4237.4735.64
Whisper-L-v3-ES (14)30.1130.1625.2339.4839.5535.7136.8336.8934.67
Bold values represent the highest Accuracy, UAR, and F1-Score achieved by each PTM for each database.
Table 7. Summary of best results (layer-wise) and SOTA comparison by database and metric.
Table 7. Summary of best results (layer-wise) and SOTA comparison by database and metric.
DatabaseMetricOur ResultOur PTM (Layer)SOTA ResultSOTA Reference
INTER1SPAccuracy99.83%HuBERT-L-1160k (6)98.40% [63]
F1-Score99.83%HuBERT-L-1160k (6)--
UAR99.88%HuBERT-L-1160k (6)--
EmoFilm E S Accuracy91.67%Whisper-L-v3-ES (6)--
F1-Score91.52%Whisper-L-v3-ES (6)--
UAR91.83%Whisper-L-v3-ES (6)--
EM-SpDBAccuracy88.28%W2V2-L-XLSR53-ES (6)68.30% [63]
F1-Score88.32%W2V2-L-XLSR53-ES (6)--
UAR88.02%W2V2-L-XLSR53-ES (6)--
EmoWisconsinAccuracy62.50%W2V2-XLSR-ES (15)--
F1-Score58.92%Whisper-L-v3-ES (31)40.70% [66]
UAR40.0%W2V2-XLSR-ES (14)--
MESDAccuracy95.38%W2V2-XLSR-ES (6)96.00% [65]
F1-Score95.40%W2V2-XLSR-ES (6)96.00%
UAR95.38%W2V2-XLSR-ES (6)--
MEACorpusAccuracy92.69%W2V2-XLSR-ES (5)--
F1-Score92.53%W2V2-XLSR-ES (5)90.06% [68]
UAR90.82%W2V2-XLSR-ES (3)--
Note: Bold values represent the highest obtained values.
Table 8. Summary of best results of LOSO validation by database and metric.
Table 8. Summary of best results of LOSO validation by database and metric.
DatabaseMetricResultPTM (Layer)
EM-SpDBAccuracy79.66%TRILLsson
F1-Score78.48%TRILLsson
UAR73.24%W2V2-L-XLSR53-ES (6)
EmoFilm E S Accuracy83.68%W2V2-L-XLSR53-ES (6)
F1-Score83.86%W2V2-XLSR-ES (11)
UAR72.68%W2V2-XLSR-ES (11)
INTER1SPAccuracy70.04%W2V2-L-XLSR53-ES (6)
F1-Score67.44%W2V2-L-XLSR53-ES (6)
UAR65.99%W2V2-L-XLSR53-ES (6)
EmoWisconsinAccuracy54.85%W2V2-L-XLSR53-ES (6)
F1-Score50.75%W2V2-XLSR-ES (11)
UAR34.71%W2V2-XLSR-ES (11)
MEACorpusAccuracy75.85%TRILLsson
F1-Score71.64%TRILLsson
UAR59.47%WavLM-L (6)
MESDAccuracy53.12%W2V2-XLSR-ES (11)
F1-Score50.59%W2V2-L-XLSR53-ES (6)
UAR53.09%W2V2-XLSR-ES (11)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mares, A.; Diaz-Arango, G.; Perez-Jacome-Friscione, J.; Vazquez-Leal, H.; Hernandez-Martinez, L.; Huerta-Chua, J.; Jaramillo-Alvarado, A.F.; Dominguez-Chavez, A. Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Appl. Sci. 2025, 15, 4340. https://doi.org/10.3390/app15084340

AMA Style

Mares A, Diaz-Arango G, Perez-Jacome-Friscione J, Vazquez-Leal H, Hernandez-Martinez L, Huerta-Chua J, Jaramillo-Alvarado AF, Dominguez-Chavez A. Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Applied Sciences. 2025; 15(8):4340. https://doi.org/10.3390/app15084340

Chicago/Turabian Style

Mares, Alex, Gerardo Diaz-Arango, Jorge Perez-Jacome-Friscione, Hector Vazquez-Leal, Luis Hernandez-Martinez, Jesus Huerta-Chua, Andres Felipe Jaramillo-Alvarado, and Alfonso Dominguez-Chavez. 2025. "Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models" Applied Sciences 15, no. 8: 4340. https://doi.org/10.3390/app15084340

APA Style

Mares, A., Diaz-Arango, G., Perez-Jacome-Friscione, J., Vazquez-Leal, H., Hernandez-Martinez, L., Huerta-Chua, J., Jaramillo-Alvarado, A. F., & Dominguez-Chavez, A. (2025). Advancing Spanish Speech Emotion Recognition: A Comprehensive Benchmark of Pre-Trained Models. Applied Sciences, 15(8), 4340. https://doi.org/10.3390/app15084340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop