Next Article in Journal
Analysis of Controller-Caused Aviation Accidents Based on Association Rule Algorithm and Bayesian Network
Previous Article in Journal
Potential Use of BME Development Kit and Machine Learning Methods for Odor Identification: A Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset

1
IL3CUB Laboratory, University of Mohamed Khider Biskra, Biskra 07000, Algeria
2
VSC Laboratory, University of Mohamed Khider Biskra, Biskra 07000, Algeria
3
Department of Mathematics, Faculty of Science, Islamic University of Madinah, Medinah 42351, Saudi Arabia
4
Department of Mathematics and Statistics, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 13318, Saudi Arabia
5
Department of Mathematics, College of Science, Qassim University, Buraydah 51452, Saudi Arabia
6
Scientific and Technical Research Centre for Arid Areas, CRSTRA, Biskra 07000, Algeria
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9692; https://doi.org/10.3390/app15179692
Submission received: 28 May 2025 / Revised: 9 July 2025 / Accepted: 29 August 2025 / Published: 3 September 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

Emotion recognition from speech is essential for human–computer interaction (HCI) and affective computing, with applications in virtual assistants, healthcare, and education. Although deep learning has made significant advancements in Automatic Speech Emotion Recognition (ASER), the challenge still exists in the task given variation in speakers, subtle emotional expressions, and environmental noise. Practical deployment in this context depends on a strong, fast, scalable recognition system. This work introduces a new framework combining DenseNet121, especially fine-tuned for the crowd-sourced emotional multimodal actors dataset (CREMA-D), with the convolutional block attention module (CBAM). While DenseNet121’s effective feature propagation captures rich, hierarchical patterns in the speech data, CBAM improves the focus of the model on emotionally significant elements by applying both spatial and channel-wise attention. Furthermore, enhancing the input spectrograms and strengthening resistance against environmental noise is an advanced preprocessing pipeline including log-Mel spectrogram transformation and normalization. The proposed model demonstrates superior performance. To make sure the evaluation is strong even if there is a class imbalance, we point out important metrics like an Unweighted Average Recall (UAR) of 71.01% and an F1 score of 71.25%. The model also gets a test accuracy of 71.26% and a precision of 71.30%. These results establish the model as a promising solution for real-world speech emotion detection, highlighting its strong generalization capabilities, computational efficiency, and focus on emotion-specific features compared to recent work. The improvements demonstrate practical flexibility, enabling the integration of established image recognition techniques and allowing for substantial adaptability in various application contexts.

1. Introduction

Given its significant consequences for human–computer interaction (HCI) and affective computing [1,2], emotion recognition from audio data has become a focal point of study. Analyzing audio features and extracting complex patterns that correlate with emotional expressions [3,4] has been made possible by allowing machines to interpret and react to the use of deep learning techniques. Among these methods, ASER (Automatic Speech Emotion Recognition) has become rather popular since it uses speech data to forecast emotional states, a necessary step towards bettering affective interactions [5,6]. ASER commonly uses spectrograms—a two-dimensional time-frequency representation that converts the temporal data of audio into a visual form [7]. Convolutional neural networks (CNNs), which are ideal for extracting features from image-like inputs [8,9], let deep learning models originally intended for computer vision be applied. CNN-based models such as ResNet [10] and DenseNet [11] have been particularly suited to examine spectrogram features for audio-related uses. Particularly, DenseNet is a strong candidate for the ASER task since its feature concatenation mechanism allows it to acquire deeper feature representations. A common approach in ASER is to represent audio signals as spectrograms—a two-dimensional time-frequency representation that transforms the temporal information of audio into a visual format [7]. Spectrograms allow the application of deep learning models originally designed for computer vision, such as CNNs, which are well-suited for extracting features from image-like inputs [8,9]. For instance, CNN-based models like ResNet [10] and DenseNet [11] have been effectively adapted to analyze spectrogram features for audio-related applications. DenseNet, specifically, has shown the capability to capture deeper feature representations due to its feature concatenation mechanism, making it a strong candidate for ASER tasks. While CNNs excel in extracting detailed local features from spectrograms, recent advancements in transformer-based architectures [12] and attention mechanisms [13,14] provide complementary advantages. Hybrid designs that integrate attention modules into CNN backbones, such as the CBAM [4], combine the strengths of local and global feature extraction to enhance classification performance in tasks like ASER [15].
Despite these advancements, emotion recognition remains an inherently challenging task. Emotional expressions in speech signals are influenced by subtle and non-linear features, such as intonation, pitch, and rhythm, which can vary significantly across speakers and contexts [16,17]. External factors like background noise and recording conditions further complicate the problem, often leading to reduced model accuracy and generalization [18]. A critical challenge in modern ASER research lies in the computational trade-off between model performance and practical deployability. While large, complex architectures have pushed the boundaries of accuracy, their efficiency is often a primary bottleneck for real-world use. For example, transformer-based models like AST achieve strong global context modeling but are computationally intensive and can lack the inductive biases of CNNs required for fine-grained feature analysis [12,19]. Other state-of-the-art hybrid architectures achieve high accuracy but often at the expense of significant computational overhead, limiting their feasibility for real-time applications [4,15]. This situation presents a clear and compelling research gap: the need for an architecture that can deliver competitive accuracy without prohibitive computational demands. This work directly addresses this gap by proposing a framework specifically engineered to strike an optimal balance between robust performance and efficiency.
This work presents a novel ASER framework designed to address the aforementioned challenges. Our key contributions are summarized as follows:
  • To improve clarity and retain essential emotional cues, the raw audio signals are processed through log-Mel spectrogram conversion and normalization [20,21]. This preprocessing step minimizes the impact of environmental noise while ensuring the retention of the most relevant emotional features in the speech signal.
  • Our architecture closes the gap between transformer-like attention mechanisms, able to capture both local and global context, and convolutional feature extraction, which catches localized time-frequency characteristics. This hybrid architecture generates a strong framework that improves the accuracy and efficiency of emotion recognition tasks.
  • We use a strict data partitioning technique to evaluate the performance of the model consistently among several emotional expressions. This method reduces data leakage risk, so guaranteeing dependable and generalizable findings during training and validation [22,23].
  • We evaluate our model extensively in line with other state-of-the-art models, including ResNet-18 + self-paced ensemble learning (SPEL) [5], Audio Spectrogram Transformer [19], and SpectoResNet [15]. Underlining the efficiency of our proposed architecture in addressing the complexity of ASER, the experimental results reveal that our approach achieves outstanding accuracy and robustness.
  • The deliberate arrangement of attention mechanisms inside the bottleneck area helps our architecture to be more flexible, so enabling parameter optimization and task-specific adaptation over several ASER application environments.
The rest of this paper is arranged as follows: Related work in audio-based emotion recognition is covered in Section 2, underlining the rising relevance of spectrogram representations and deep learning models for ASER. Section 3 uses the suggested architecture described here. Section 4 presents the results and discussion, focusing on quantitative evaluation, comparisons with related works, and an analysis of the strengths, limitations, and potential future directions for the proposed framework. Finally, Section 5 concludes the paper, summarizing the contributions of this work and outlining perspectives for future advancements in ASER.

2. Related Work

Speech Emotion Recognition (ASER) has been a major area of study for decades because it is important for making human–computer interfaces more natural and understanding. The field has changed from using handcrafted features in traditional machine learning models to more advanced deep learning architectures that learn representations directly from audio data. As this evolution continues, researchers are always looking for better and more accurate ways to understand the complex emotional cues that are hidden in human speech.

2.1. Traditional ASER and Handcrafted Features

Machine learning models like Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and Support Vector Machines (SVMs) were the most common ways to do ASER in the past [24]. The quality of the handcrafted acoustic features taken from the speech signal played a big part in how well these models worked. People usually group these features into three groups: prosodic features (like pitch, energy, and duration); spectral features (like Mel-Frequency Cepstral Coefficients (MFCCs) and Linear Prediction Cepstral Coefficients (LPCC)); and voice quality features (like jitter, shimmer, and the harmonics-to-noise ratio) [25]. Even though these methods set a strong baseline for ASER, they were often limited because they relied on expert-driven feature engineering and had trouble capturing the high-level, non-linear relationships that are a part of emotional expression.

2.2. Deep Learning and Spectrogram-Based Recognition

The advent of deep learning changed the way ASER worked. Deep neural networks circumvented a lot of the problems with traditional models by automatically learning feature hierarchies. Using spectrograms as input representations was a big change that made this transition possible [7]. Spectrograms are a way to turn a one-dimensional audio signal into a two-dimensional image-like format by showing the signal’s frequency and time. This capability lets computer vision’s powerful tools, like Convolutional Neural Networks (CNNs), be used for audio tasks. This picture shows many different patterns related to pitch, formants, and energy distribution over time, which are important for telling emotions apart. For instance, SpectoResNet [26] shows how strong CNNs are at finding complex patterns in spectrograms to classify emotions. SpectoResNet shows how well residual connections can analyze complex emotional features by using advanced data augmentation methods like adding noise and changing pitch on the CREMA-D dataset. It gets a 65.20% accuracy rate.

2.3. Architectural Advancements in ASER

There have been several studies on deep learning architectures for ASER. CNNs are great at finding patterns in local time and frequency, but other models have their own strengths that work well with them. Recurrent Neural Networks (RNNs) and their more advanced versions, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), are good at modelling the temporal sequences in speech [27]. Hybrid models, like CNN-RNN architectures, were made to get the best of both worlds: CNN layers were used to pull out localized features from spectrogram frames, and the resulting sequence of feature vectors was fed into an RNN to model long-range temporal dependencies [28]. Attention mechanisms and transformer-based models have become more popular recently. The Audio Spectrogram Transformer (AST) [19] is one example of a pure transformer-based architecture for audio classification that does not need any convolutional layers at all. AST gets good results on a number of benchmarks by modelling global relationships in spectrograms. However, pure transformers like AST do not always have the strong inductive biases that CNNs do, which makes it hard for them to find small, localized patterns. It only identified 67.81% of ASER tasks correctly, which shows that hybrid approaches might work better. Hybrid attention models and specialized training strategies have been suggested as ways to close this gap. The separable transformer (SepTr) and its improved version with Learning Rate Curriculum (LeRaC) use axis-wise transformer attention and systematic curriculum learning. LeRaC + SepTr gets a competitive accuracy of 70.95%, but this method makes the training process much more complicated. The standalone SepTr model (70.47%) is still not quite as good as the best models, which shows that feature prioritization needs to be better.

2.4. Ensemble Learning and Current Challenges

Ensemble learning is another promising area of ASER. It combines several models to make them better at generalizing. For example, the ResNet-18 + SPEL method [29] uses pseudo-labeling and self-paced ensemble learning. Its innovative training strategy is limited by the relatively shallow ResNet-18 backbone, which makes it harder to pick up on subtle spectrogram patterns. This leads to an accuracy value of 68.12%.

2.5. Semi-Supervised Deep Learning

Recent ASER research also explores semi-supervised methods to enhance model fairness. Ref. [30] utilizes pseudo-labeling and unsupervised clustering to infer demographic attributes, aiming to mitigate performance disparities across subgroups without requiring explicit demographic labels. While this approach improved fairness metrics on the CREMA-D dataset, it contrasts with our fully supervised method, which focuses directly on emotion classification using the provided emotional labels. This direct optimization on ground-truth annotations explains the high accuracy achieved (82%) on the primary classification task.

2.6. Dataset Fusion

M. Alam et al. [31] presented a novel method termed TMNet (Transformer-fused Multimodal Network), which takes physiological data from EEG and audio data from speech as separate inputs to be fused. These inputs are later fed to a transformer architecture to integrate and find correlations between the features of the different modalities (EEG and speech). Their primary goal is to improve emotion recognition accuracy and robustness by leveraging complementary information from brain activity (EEG) and vocal expression (speech). However, this method requires datasets that have simultaneous EEG and speech recordings, such as DEAP, SEED, or MAHNOB-HCI. Despite significant progress, architectural paradigms still present inherent trade-offs. CNNs excel at extracting local features but struggle with long-range dependencies, while transformers capture global context but lack convolutional inductive biases for fine-grained analysis. Although hybrid architectures and complex training strategies can improve performance, they introduce computational overhead that limits real-time applicability. Our approach addresses these limitations by combining an efficient CNN backbone (DenseNet121) with a lightweight attention mechanism (CBAM) for unimodal speech emotion recognition. This architecture achieves high accuracy while maintaining computational efficiency, operating in a fully supervised manner without complex preprocessing or additional data labeling requirements. The model is designed to work directly with existing CREMA-D dataset labels, ensuring fair evaluation protocols within the unimodal domain.

3. Methodology

Starting with a thorough analysis of the DenseNet121 architecture—which forms the backbone for feature extraction—this section describes the technique used in our model. We then look at how the CBAM might improve these properties, so improving the focus of the model on pertinent trends. The core of our method is the integration of the inspired DenseNet121 with CBAM, producing a more accurate and efficient feature representation. At last, we discuss the classification stage in which the processed data is mapped to emotional categories. Every one of these elements is absolutely important for the whole architecture; the following sections go into great length regarding their purposes.

3.1. Audio Feature Extraction

The Short-Time Fourier Transform (STFT) takes each audio sample and turns it into a two-dimensional time-frequency representation. This change lets us make a matrix that looks like a picture, which is necessary for using deep learning methods. To find the discrete STFT, do the following:
STFT ( m , k ) = n = x [ n ] · w [ n a H ] · e j 2 π N k n
In this case, x [ n ] is the input discrete signal and w [ n ] is the window function (a Hamming window) with length L. H is the hop size, and N is the number of discrete Fourier transform (DFT) points, or frequency bins. The STFT ( m , k ) is the STFT coefficient for the k-th frequency bin and the m-th time-frame [32].
After the STFT is calculated, the spectrogram is log-transformed and Mel-scaled to match how people hear things. The Mel scale is an auditory scale that tries to mimic how people hear different frequencies. The human ear can easily tell the difference between frequencies below 1000 Hz, but it is less sensitive to differences above 10 kHz. So, the spectrogram is turned into the Mel scale, which is straight up to 1 kHz and logarithmic for frequencies above that.
We get the log-Mel spectrogram by passing the spectrogram from the STFT through a Mel-filter bank. This change makes the model better at picking up on emotional cues by making the feature extraction process more like how people hear things [32,33]. Mathematically, the Mel scale is defined as:
M e l ( f ) = 2595 log 1 + f 700
The frequency in Hertz is f. The log-Mel spectrogram is a compressed version that keeps the most important sound features for recognizing emotions.
We used the Librosa library with the following specific settings to create the log-Mel spectrograms for this study, making sure that they could be reproduced. We used a Short-Time Fourier Transform (STFT) to process the 16 kHz audio. The window size was 2048 points, the Fast Fourier Transform (FFT) window size was 2048 points, and the hop length was 512 points. This setup makes it so that there is a 75% overlap between frames that come one after the other. A filter bank with 128 Mel filters and a maximum frequency of 8000 Hz was then used to map the power spectrogram onto the Mel scale.
The flowchart in Figure 1 gives a clear picture of the whole feature extraction pipeline.

3.2. DenseNet121

At the core of the architecture, DenseNet121 offers a strong basis for feature extraction. Dense blocks, which are meant to effectively reuse features, form the basis of the model and guarantee best parameter use while so reducing problems including vanishing gradients. Four dense blocks make up DenseNet121; each generates feature maps with a specified growth rate as detailed in Figure 2. Transition blocks are used between these dense blocks to minimize feature map size by convolution and pooling processes, so preserving computational efficiency while conserving necessary information [11].
The DenseNet121 architecture comprises a sequence of convolutional blocks, each intended to progressively extract and refine feature representations; the main elements of the model are enumerated here.
A. 
Dense Block Function (Dense_Block) 
Leveraging its densely connected structure to maximize feature propagation, the dense block is essential for feature extraction within the DenseNet architecture. Denoted as xx, the dense block’s input is a sample. This input passes a sequence of convolutional transformations; the output is stacked feature maps produced by these processes.
Depending on the block size, the input feature map is first passed through a convolutional block comprising four filters in each layer, enabling the first stage of feature extraction. Following this, another convolutional block applies filters using a 3 × 3 kernel size, further improving the feature maps and enhancing their representation. The output of the second convolutional block is then concatenated with the input feature map to create a densely connected structure, ensuring flawless information flow across the layers.
This architecture guarantees that the dense block records local as well as global trends in data. Features from several layers concatenate to strengthen the representation, so increasing the network’s capacity to learn complex features. Consequently, the dense block greatly helps the DenseNet architecture to effectively manage difficult tasks and hence contributes to its performance.
B. 
Transition Layer Function (Transition_Layer) 
Downsampling the feature maps and enabling the flow between dense blocks depends critically on the transition layer. First, the input x undergoes a convolution block that halves the last dimension of the input, so lowering the number of filters. This method preserves pertinent features while helping to lower the computational load. Then, x undergoes an average pooling operation (Avg Pool 1D), with a pool size of 2, a stride of 2, and padding set to “same”. This pooling process preserves significant information while down-sampling the feature map, so lowering its spatial dimensions.
DenseNet121 is designed as follows:
Beginning with a 1D convolutional layer with 64 filters, a kernel size of 7, strides of 2, and “same” padding applied to the input, the model proceeds. To minimize the spatial dimensions of the feature map, a max pooling operation (Max Pool 1D) with a kernel size of 3, strides of 2, and “same” padding follows.
The model then works through several dense blocks, each with different block sizes ( [ 6 , 12 , 16 , 24 ] ). Every block size generates a dense block whose output is passed through the matching transition layer to enable downsampling and control of the feature map count.
Following processing through all the dense blocks, the last dense block’s final output undergoes a global average pooling operation (GlobalAverage Pooling 1D), so lowering the feature map’s dimensions to a single vector per channel.
At last, the output passes through a dense layer with a softmax activation function, which forecasts the class probabilities for the input sample, so finishing the classification process.
The model takes in 128 × 256 log-Mel spectrograms as input. Even though spectrograms are naturally single-channel (gray scale), we make three copies of this single channel. This step is needed to make sure that the input format of the DenseNet121 backbone matches. It was pre-trained on three-channel RGB images from ImageNet, which allows for transfer learning. The DenseNet121 architecture gradually shrinks the dimensions of the spectrogram data as it moves through it. This makes a 1024-channel feature map. This map, which now has the extracted features, is then sent to an enhancement module for more work.
The architectural design of DenseNet121 is especially suited for datasets like CREMA-D, which entail complex emotional recognition from audio-visual data. While reducing information loss across layers, the strong links help to learn subtle patterns across facial expressions and vocal cues. DenseNet121 is a perfect candidate for high-resolution, emotionally complex datasets such as CREMA-D because of its ability for fine-grained feature retention and effective gradient flow.

3.3. Convolutional Block Attention Module (CBAM)

Including a dual attention mechanism, CBAM improves the feature maps produced by the DenseNet121 backbone. Two main elements comprise this module: the Spatial Attention Module and the Channel Attention Module [4]. The CBAM runs sequentially from a feature map F R C × H × W .
Using global average pooling and max pooling, the Channel Attention Module finds significant channels inside the feature maps and then shares multi-layer perceptron (MLP) layers. Applying this generates a 1D channel attention map M c R C × 1 × 1 , which is then refined using element-wise multiplication:
F = M c ( F ) F
The Spatial Attention Module focuses on salient spatial regions by performing average and max pooling along the channel axis, followed by a convolution layer to generate spatial attention weights. This results in a 2D spatial attention map M s R 1 × H × W . The spatial attention map is then applied to the channel-refined feature map F via element-wise multiplication:
F = M s ( F ) F
This process produces the final refined feature map F , incorporating both channel-wise and spatial attention, enhancing performance for tasks such as ASER. The detailed computation of the attention maps is shown in Figure 3, with further explanations in Figure 4.
CBAM enhances a network’s representational power by sequentially applying channel and spatial attention mechanisms, allowing the model to focus on what and where to attend in feature maps. This selective emphasis enables the model to prioritize emotionally salient regions in both facial expressions and acoustic signals. When applied to tasks like emotion recognition on the CREMA-D dataset, CBAM refines intermediate feature maps by suppressing irrelevant information and amplifying critical cues—such as micro-expressions or tonal fluctuations. Its lightweight, plug-and-play nature allows for easy integration with backbones like DenseNet121, offering improved performance with minimal computational overhead.
A. 
Channel Attention Mechanisms: 
Woo et al. [4] developed the channel attention module to enhance the most important features of the input image by exploiting the inter-channel dependencies within the feature map. As each channel in the feature map functions as a detector [35], channel attention emphasizes the most informative channels. To efficiently compute this attention, we first reduce the spatial dimensions of the input feature map, commonly using average pooling to aggregate spatial information. This method is recommended by Zhou et al. [36] to capture target object regions effectively and has been incorporated by Hu et al. [37] into their attention mechanism for computing spatial statistics. Building on this, we argue that max-pooling can offer complementary insights into distinctive object features, further refining channel-specific attention. Therefore, we employ both average and max-pooled features simultaneously.
The process begins with the application of both average and max-pooling to the feature map, resulting in two distinct spatial context representations: F avg c for average-pooled features and F max c for max-pooled features. These descriptors are then processed through a shared multi-layer perceptron (MLP) network with a single hidden layer, yielding the channel attention map M c R C × 1 × 1 . To optimize parameter efficiency, the hidden layer’s activation is set to R C / r × 1 × 1 , where r represents the reduction ratio. After passing each descriptor through the shared network, the output vectors are combined using element-wise summation. The channel attention map is thus computed as follows:
M c ( F ) = σ ( MLP ( AvgPool ( F ) ) ) + MLP ( MaxPool ( F ) ) = σ ( W 1 ( W 0 ( F avg c ) ) ) + W 1 ( W 0 ( F max c ) ) )
where σ is the sigmoid function, and W 0 R C / r × C , W 1 R C × C / r are the learned weight matrices shared across inputs, with ReLU activation applied after W 0 .
B. 
Spatial Attention Mechanism: 
While channel attention focuses on the importance of the channels, the spatial attention mechanism, as introduced by Woo et al. [4], addresses the question of “where” to focus attention within the image. Spatial attention complements channel attention by identifying the most informative regions of the feature map. The process begins with the application of average-pooling and max-pooling operations along the channel axis, as these operations effectively highlight key spatial areas. The resulting pooled features are concatenated to form an efficient spatial descriptor, which is subsequently passed through a convolutional layer to generate the spatial attention map M s ( F ) R H × W , signaling regions where attention should be emphasized or suppressed.
The concatenated feature descriptor, consisting of the pooled features, is processed through a standard convolution operation with a 7 × 7 filter to produce the spatial attention map. This operation can be formalized as:
M s ( F ) = σ ( f 7 × 7 ( [ A v g P o o l ( F ) ; M a x P o o l ( F ) ] ) ) = σ ( f 7 × 7 ( [ F avg s ; F max s ] ) )
where σ represents the sigmoid function and f 7 × 7 denotes the convolution operation. This approach allows the network to focus on the most significant spatial regions of the input image, thereby improving its ability to localize critical information.

3.4. CBAM-DenseNet121

The flowchart in Figure 5 depicts the architecture of our classification pipeline, which begins with the input spectrogram—a time-frequency representation of an audio signal. Features are extracted using DenseNet-121, a pre-trained convolutional neural network known for its dense connectivity, which ensures feature reuse and efficient gradient propagation.
The extracted features are subsequently refined through the use of the CBAM. CBAM integrates two distinct attention mechanisms: the channel attention mechanism, which emphasizes the most informative feature channels, and the spatial attention mechanism, which identifies and prioritizes the most significant regions within the feature map. Once enhanced, the refined feature map is forwarded to the classifier, which assigns a label to the input. By combining DenseNet-121 with CBAM, the model’s capacity to focus on the most pertinent audio features is significantly improved, resulting in more precise and accurate classifications.

3.5. Final Classifier

Architectural final classifiers are meant to map the refined feature representations to the target emotion classes. Global Adaptive Average Pooling first reduces the spatial dimensions of the feature maps to 1 × 1, so aggregating the spatial information into a single vector. The features are then normalized by a BatchNorm layer normalizing, guaranteeing stable and effective convergence during training. One dropout layer with a probability of 0.5 is included as a regularizing mechanism to reduce overfitting. A fully connected (FC) layer then maps the 1024-dimensional feature vector to the six output classes corresponding to the emotions: happiness, sadness, anger, fear, disgust, and neutral, so producing the resulting features. At last, a softmax activation function is used to translate the outputs into class probabilities, thus allowing the model to generate predictions depending on the highest probability. This method guarantees strong classification and keeps computational efficiency by means of structure.
Using a well-crafted set of hyperparameters and optimization techniques catered to the difficulties of the emotion recognition task, the proposed CBAM-DenseNet121 architecture was trained on the CREMA-D dataset. We go over the training environment and associated methods below.

4. Experiments and Results

Focusing on the important elements of its evaluation, we present the experimental setup and results of the proposed model in this part. We start by going over the dataset CREMA-D used for training and testing, then go over the evaluation measures used to evaluate the model’s performance. The section then covers the implementation, including hyperparameters tuned during training, data augmentation methods, and materials used. We also discuss some of the constraints in current methods of ASER. We then show the performance of our model, compare it with state-of-the-art models, and underline the contributions of the CBAM-DenseNet121 architecture. At last, an ablation study is conducted to assess how important elements, such as dropout rates and the CBAM attention mechanism, affect model performance over several architectures. Using a well-crafted set of hyperparameters and optimization techniques catered to the difficulties of the emotion recognition task, the proposed CBAM-DenseNet121 architecture was trained on the CREMA-D dataset. We go over the training environment and associated methods below.

4.1. Dataset: CREMA-D

We applied the CREMA-D dataset [38], extensively known in the field of emotion recognition for its diversity and multimodal representations of human emotional expression, for this work. This dataset contains high-quality speech recordings from professional actors who were assigned to read 12 predefined sentences intended to evoke six basic emotions: happiness, sadness, anger, fear, disgust, and neutral.
Comprising 7442 audio recordings from a varied cohort of 91 actors—48 male and 43 female speakers—the dataset reflects a diversity of ethnic groups and covers a broad age range from 20 to 74 years, ensuring variation in both speech qualities and emotional expression. Every audio file comes with an emotional label, which has been validated by crowd-sourced annotations, guaranteeing consistency and dependability. To provide a clear picture of the dataset’s composition, the distribution of samples across the six emotion classes is detailed in Table 1.
Focusing on emotion classification from speech, the CREMA-D presents major benefits for our research. First, the excellent recording settings guarantee that the audio samples were gathered under minimum noise in a controlled environment, so producing accurate and clean data for processing. Second, the emotional variety of the dataset—six different emotions—allows a strong comparison over a broad spectrum of human experience. Third, the speech’s naturalness is a major advantage since the actors were directed to generate emotions organically; hence, the dataset is quite fit for practical uses. These characteristics make CREMA-D a perfect tool for forward research in ASER.
Speech signals were converted into Mel spectrogram representations to pre-process the dataset for our work. This time-frequency representation, extracted using the Librosa library [39], allows for capturing frequency-based and temporal features, which are crucial for understanding the underlying emotional content. Spectrograms capture rich acoustic patterns, such as pitch intensity, rhythm, and formant structures, all of which effectively differentiate emotions.
This splitting methodology ensures a balanced distribution of emotions across the subsets while maintaining a clear separation between training and evaluation. Such practices mitigate the risk of data leakage, promoting generalization and robust performance on new data. By leveraging CREMA-D, we maintained a realistic and challenging experiment setting, as the dataset’s diversity and rich emotional annotations align well with our task of developing an audio-based emotion recognition system. We employed the CREMA-D for this study, as it provides audio samples of human speech alongside emotion labels. The dataset includes six emotions: happiness, sadness, anger, fear, disgust, and neutral. The speech data was converted into spectrogram representations for our work to extract temporal and frequency features.
As shown in Table 2, the dataset was divided into three subsets: Training Set (70% of the dataset), Validation Set (20% of the dataset), and Testing Set (10% of the dataset). This split ensures a clear separation between training and evaluation, preventing data leakage.

4.2. Evaluation Metrics

In evaluating the performance of the proposed model for ASER, several key metrics were employed to provide a comprehensive assessment of its classification accuracy and emotional recognition capabilities. These metrics, which include accuracy, precision, recall, UAR, and the F1 score, are particularly relevant for evaluating the effectiveness of ASER models, considering both overall performance and emotion-specific classification results.
Accuracy calculates the general percentage of accurate forecasts for every class, but in the presence of class imbalance it can be deceptive. Consequently, also taken into account for a more balanced assessment are complementary metrics such as the F1 score and UAR [16,38].
Accuracy = Number of Correct Predictions Total Number of Predictions
Precision calculates the general percentage of accurate forecasts for every class, but, as above, in the presence of class imbalance it can be deceptive. As a result, complementary metrics such as the F1 score and UAR are also taken into account for a more balanced assessment [16,38].
Precision = True Positives True Positives + False Positives
Recall and UAR tests whether the model can accurately spot every real example of every emotion. High recall in ASER guarantees constant recognition of emotional expressions without ignoring pertinent cases [6].
Recall = True Positives True Positives + False Negatives
Ensuring that all emotional events are identified depends on recall, hence lowering the possibility of ignoring important emotional signals. High recall in ASER improves the dependability and efficacy of the model in real-world, variable speech environments [16]. The metric “UAR” provides a fair assessment by averaging recall over all emotion classes, independent of their frequency. In ASER especially, it helps to avoid bias toward dominant emotions and guarantee equitable evaluation over the whole emotional spectrum [38].
UAR = 1 N i = 1 N True Positives for Class i True Positives for Class i + False Negatives for Class i
where N is the number of emotion classes.
The F1 score balances precision and recall, making it a key metric for evaluating ASER models, especially under class imbalance. The average F1 score across emotions (F1 Score Total) reflects the model’s overall ability to detect and correctly classify emotional expressions.
F 1 Score = 2 × Precision × Recall Precision + Recall
By employing these metrics, we ensure a comprehensive evaluation of the model’s performance, highlighting both its overall classification accuracy and its effectiveness in handling emotion-specific features. Each of these metrics was carefully chosen to capture the nuances of ASER, especially considering the challenges posed by speaker variability and the complexity of emotional expression in the CREMA-D dataset.

4.3. Implementation Details and Material

As detailed in Table 3, training was conducted on a high-performance workstation featuring an Intel i7 10th generation CPU, 64GB RAM, and an NVIDIA RTX 3060 GPU paired with CUDA for accelerated computing. The following table shows more details on the working material.

4.4. Data Augmentation

Data augmentation techniques were employed to enhance the model’s generalization capabilities and mitigate the risk of overfitting. During training, spectrogram images were subjected to random color jittering, which included adjustments to brightness, contrast, saturation, and hue, applied with a 50% probability. These augmented images were then resized to a fixed dimension of 128 × 256 and normalized using dataset-specific mean and standard deviation values (mean = [0.488, 0.460, 0.486], std = [0.230, 0.244, 0.265]). To guarantee the integrity of the assessment process, validation spectrograms were resized and normalized without any further augmentation. This method lets the model consistently evaluate its performance on unaltered validation data and learn robust features from many training data.

4.5. Hyperparameters

Careful design of the training process guarantees effective convergence and maximizes the performance of the model. Using the AdamW optimizer—which combines decoupled weight decay regularization with Adam optimization—we obtained training deep neural networks which are suited for this method since they dynamically adjust the learning rate for every parameter. Starting at 1 × 10 3 , the learning rate (LR) was set; a weight decay of 1 × 10 4 was then used to avoid overfitting. To improve training stability even more, the learning rate was cyclically changed using a cosine annealing scheduler. This approach helps the model avoid local minima and facilitates more perfect convergence. With a batch size of thirty-two samples, training and validation were carried out, balancing computational efficiency with gradient stability. The model was trained for 100 epochs, providing sufficient iterations for convergence while ensuring that the training process remained computationally feasible. These configurations collectively contribute to the model’s robust training and generalization capabilities.
We used the cross-entropy loss, suitable for multi-class classification problems, as the objective function. It calculates the divergence between the predicted probability distribution (via Softmax) and the ground truth.

4.6. Results

As shown in Figure 6, from epochs 1 to 100, the model demonstrated significant improvement in training and validation accuracies. The training accuracy rapidly increased and reached almost 100% (99.98%), reflecting the model’s strong performance on the training dataset. In contrast, the validation accuracy plateaued around 67.49%. The model likely avoids severe overfitting, but the mid-training instability warrants further investigation into data quality.
The loss curves provide additional insight into this behavior. The training loss steadily decreased, indicating effective learning during training. The validation loss, however, plateaued after epoch 40, with slight fluctuations, which may reflect sensitivity to noise in the validation set or challenges in optimizing the generalization performance. Though not too great, the difference between training and validation accuracy points to some generalization improvement space. Data augmentation, regularization (e.g., dropout or L2), or fine-tuning hyperparameters could help to offset this effect. Moreover, maintaining stability of validation accuracy around epoch 40 suggests that early stopping could be used to cut pointless computational expenses. These curves show generally the model’s learning strengths as well as possible places to improve its capacity to generalize invisible data.

4.6.1. Model Architecture for Suction Performance Analysis

We present a thorough analysis of the neural network architecture in Table 4 to evaluate the performance of the proposed model for suction-related tasks. The table lists at every model stage the layer configurations, output forms, and trainable parameter count. Through dense blocks and transition layers, the architecture is meant to progressively extract and polish elements until a CBAM improves feature representation. A 6-dimensional output matching the target activity comes from the last classifier layer. Appropriate for suction-related uses, the model balances complexity and performance with 7,092,138 trainable parameters.

4.6.2. Test Set Performance

Six emotion classes—neutral (NEU), angry (ANG), sad (SAD), happy (HAP), disgust (DIS), and fear (FEA)—have their performance revealed by the quantitative evaluation of the proposed CBAM-DenseNet121 model on the CREMA-D spectrograms. The confusion matrix in Figure 7 summarizes this evaluation together with important performance measures for multi-class classification, including recall, precision, F1 score, and accuracy.
We focus on the F1 score and the Unweighted Average Recall (UAR) as the main performance indicators to give the model the most thorough evaluation, especially since the dataset has a small class imbalance. The model gets a great F1 score of 71.25%, which means that it has a good balance between precision and recall. It also gets a UAR of 71.01%, which shows that the model works well and consistently across all six emotion classes, giving each one equal weight. Along with these strong metrics, the model has an overall test accuracy of 71.26% and a precision of 71.30%. These results show that the model not only makes accurate predictions, but also strikes a balance between finding true positives and avoiding false alarms.
Despite the overall strong performance metrics, a detailed analysis of the confusion matrix reveals areas of strength and weakness. The model excels at classifying emotions such as angry (ANG) and happy (HAP), which achieve high correct predictions (104 and 95 samples, respectively), and neutral (NEU) emotions, which see minimal misclassifications. However, challenges arise with emotions like fear (FEA) and disgust (DIS), which exhibit significant overlaps with others in the feature space. For instance, fear is frequently misclassified as sad (21 samples) or happy (12 samples), while disgust shows confusion with sad (16 samples). These errors reduce class-level accuracy for fear and disgust, impacting the overall performance.
The CBAM-DenseNet121 model shows good quantitative results, and the high UAR and F1 score show that it can classify things in a balanced way. While the confusion matrix indicates strong performance across most emotion classes, the errors between overlapping categories—particularly fear and sad—highlight areas for improvement. These results establish a competitive baseline for emotion classification on CREMA-D spectrograms, but further refinements are needed to address confusion in ambiguous emotional states.

4.6.3. Computational Efficiency Analysis

To provide a quantitative justification for our claims of computational efficiency, we analyze the number of trainable parameters, which is a key indicator of a model’s complexity and its computational and memory requirements.
As detailed in our state-of-the-art comparison in Table 5, our proposed model is exceptionally lightweight, containing only ~7.1 M trainable parameters. This represents a significant reduction in complexity compared to other high-performing models like the Vision Transformer (~86 M parameters) or SepTr (~15 M parameters). This quantitative difference in model size is the primary driver of our model’s efficiency and directly supports our central contribution of balancing high performance with practical applicability.

4.7. Comparison with State-of-the-Art (SOTA)

To properly contextualize our model’s performance, we compare it against both established baselines and recent state-of-the-art (SOTA) models on the CREMA-D dataset. We focus not just on accuracy but also on model complexity, a critical factor for practical application. The results are summarized in Table 5.
As shown in the table, our proposed CBAM-DenseNet121 model achieves a test accuracy of 71.26% with only ~7.1 M trainable parameters. This performance is highly competitive, modestly surpassing strong transformer-based models like LeRaC + SepTr (70.95%). While LeRaC + SepTr uses dynamic learning rate adjustments and axis-wise attention, our model’s unique hybrid architecture—combining the deep feature extraction of DenseNet121 with the refinement of CBAM—achieves this result with approximately half the number of parameters. This demonstrates a significant advantage in computational efficiency.
Our model’s strength is further highlighted when compared to other approaches. Models like ResNet-18 + SPEL (68.12%) and the much larger Vision Transformer (ViT) at 86M parameters (67.81%) fall short. Their performance is likely limited by either shallower architectures (ResNet-18) that cannot capture nuanced spectrogram patterns, or in the case of ViT, a lack of the strong convolutional priors needed for fine-grained, localized feature analysis. Our model’s densely-connected architecture is explicitly designed to overcome these issues.
Finally, it is important to contextualize our work against the latest SOTA models, such as the 82% accuracy reported by Lin et al. [30]. This impressive result was achieved using a complex, semi-supervised training regimen focused on a different primary goal (improving fairness). Therefore, our model occupies a crucial position: for a standard, supervised, audio-only task, it provides a highly practical and scalable solution that delivers a state-of-the-art balance between accuracy and computational efficiency.

4.7.1. Key Contributions of CBAM-DenseNet121

There are a number of architectural changes that make CBAM-DenseNet121 stand out from other models and help it do so well. First, adding CBAM to our model gives it a strong way to focus on the most important features in the spectrogram, both in the channel and spatial domains. This ability is very important for looking at complicated emotional patterns in audio data, where both local and global dependencies are important. Unlike pure attention-based models like SepTr and ViT, which mainly look at global feature interactions and may miss local details that are important for ASER tasks, CBAM can dynamically change its focus on relevant parts of the spectrogram. This makes it easier to tell the difference between the features.Unlike these pure attention mechanisms, CBAM combines convolution-based feature extraction (through DenseNet121) with attention refinement. This is a balanced way to do things because it can both capture fine-grained local features (which CNNs are good at) and model long-range dependencies (which attention mechanisms are good at). CBAM-DenseNet121 is better at processing spectrograms than models that only use convolutions or attention because it uses both. DenseNet121 is the most important part of our model and is what makes it work. DenseNet121’s dense connections make it easier to reuse features, help with the vanishing gradient problem, and support deep feature extraction, which is important for picking out small emotional cues in spectrograms. DenseNet121 performs better than shallower architectures like ResNet-18, which has an accuracy of only 68.12% and has trouble capturing complex spectrogram patterns. This is because DenseNet121 has deeper layers and more connections, which allow for richer, hierarchical feature extraction.In addition, techniques like Global Adaptive Average Pooling, BatchNorm, and Dropout make the final classifier of CBAM-DenseNet121 better by making it more general and less likely to overfit. The model’s training accuracy (99.98%) and validation accuracy (67.49%) are very different, which shows that it can generalize well to new data, even though the model is very complex. A class-wise confusion analysis shows even more clearly what the model does well and what it does not do well. CBAM-DenseNet121 is great at telling the difference between emotions that are very different from each other, like “happy” and “angry”, but it has trouble telling the difference between emotions that sound similar, like “fear” and “neutral.” This limitation is not unique to our model but is a common challenge across ASER tasks due to the subtle acoustic differences between such emotions. Still, the new architectural ideas and analysis give us useful information about the model’s strengths and weaknesses, especially when it comes to dealing with closely related emotional categories.

4.7.2. Comparison with Related Techniques

CBAM-DenseNet121 demonstrates competitive performance compared to several state-of-the-art ASER techniques. For instance, LeRaC + SepTr, which combines the Learning Rate Curriculum with the Separable Transformer, achieves an accuracy of 70.95%, closely rivaling CBAM-DenseNet121’s 71.26%. However, CBAM-DenseNet121 attains this performance without the computational overhead associated with training-specific methodologies like LeRaC. The standalone SepTr model, which leverages axis-wise transformer attention, achieves an accuracy of 70.47%. While effective, it lacks the fine-grained spatial and channel prioritization offered by CBAM, which enhances feature refinement. ResNet-18 + SPEL, despite incorporating SPEL to improve model collaboration and pseudo-labeling, achieves a lower accuracy of 68.12%. This performance gap can be attributed to ResNet-18’s shallower architecture, which limits its ability to capture nuanced spectrogram features compared to the deeper and more feature-rich DenseNet121 backbone. Similarly, the Vision Transformer (ViT), designed for audio spectrogram processing, achieves an accuracy of 67.81%. While ViT excels in capturing global dependencies, it lacks the convolutional priors necessary for processing localized time-frequency patterns in spectrograms, a strength inherent in CBAM-DenseNet121’s hybrid design. Finally, an improved form of ResNet specifically designed for ASER, SpectoResNet, achieves an accuracy of 65.20%. SpectoResNet shows the robustness of deep CNNs in extracting complex patterns from spectrograms, even if its performance is somewhat less than transformer-based and attention-enhanced models. To replicate actual variability and enhance generalization, it uses cutting-edge data augmentation methods, including noise addition and pitch modification. Especially on datasets like CREMA-D, SpectoResNet’s architecture highlights the efficiency of convolutional residual connections in analyzing emotional audio features, so contributing to ASER tasks. These comparisons taken together show how well CBAM-DenseNet121 balances accuracy, computational efficiency, and feature refinement, so establishing its competitiveness in ASER research.

4.7.3. Architectural Trade-offs and Practical Benefits

SepTr and ViT are two transformer-based models that have attracted a lot of interest for their capacity to record long-range dependencies and global context. Usually, though, especially when processing high-dimensional spectrogram data, these benefits come at the expense of higher memory use and computational complexity. By combining the convolutional efficiency of DenseNet121 with the lightweight yet pragmatic CBAM, CBAM-DenseNet121 provides a balanced approach instead. There are several sensible advantages from this hybrid architecture. First, it beats simple transformer-based or ensemble approaches with a higher accuracy of 71.26%. Second, it keeps reasonable training complexity, which qualifies it better for practical uses where computational resources are sometimes limited. CBAM-DenseNet121 offers a convincing answer for jobs requiring strong but resource-wise conscious models by giving performance and efficiency top priority.

4.8. Ablation Study

In this part, we investigate the effect of architectural changes and important components on the performance of the model by ablation. This work aims to methodically assess how changes in dropout rates and the integration of advanced blocks, such as the CBAM attention mechanism, influence the accuracy and robustness of several network architectures, including DenseNet121, ResNet, and MobileNet. By isolating these factors and analyzing their individual and combined impacts, we aim to identify the most effective configurations for optimizing model performance in the given task. The subsequent subsections will present the results of each individual experiment, followed by a discussion on the significance of the findings.

4.8.1. Evaluating the Impact of CBAM on Model Performance Across Architectures

As shown in Table 6 and Figure 8, the results clearly indicate that every modification played a significant role in enhancing the overall performance of the DenseNet121 model.
Starting with the baseline performance, the MOBILENETv2 model achieved a reasonable accuracy of 69.92% and UAR of 69.72%, but struggled in recognizing facial expressions, particularly in Class 5 (FEA), where the recall was 55.96%. The recall values for the other classes were relatively consistent, but the model’s performance in recognizing Class 5 (FEA) was particularly weak. The RESNET50 model, on the other hand, performed much worse, with an accuracy of 55.21% and UAR of 55.40%. The recall for Class 2 (SAD) was 38.13%, and for Class 4 (DIS), it was 39.06%, suggesting a major deficiency in distinguishing these emotions, particularly those associated with negative expressions like sadness and disgust.
The DenseNet121 model, when fine-tuned, demonstrated better performance, with an accuracy of 69.25%, an F1 score of 69.05%, and a UAR of 69.34%, showing substantial improvements over both MOBILENETv2 and RESNET50. The recall for Class 4 (DIS) improved to 52.34%, while Class 5 (FEA) showed a modest improvement to 59.95%. However, the recall in these categories still indicated room for further improvement, especially for the more subtle and complex emotional categories.
Upon integrating CBAM into the models, we observed a variety of effects. An interesting outcome of the ablation study was the divergent effect of CBAM on MobileNetV2. While it boosted other architectures, it slightly diminished MobileNetV2’s accuracy from 69.92% to 67.91%. We attribute this phenomenon to the intrinsic design philosophy of MobileNetV2. The model’s architecture is predicated on extreme efficiency, using linear bottlenecks within its inverted residual blocks to maintain a compressed feature representation. It is plausible that the explicit attention mechanism of CBAM, with its own pooling and MLP computations, interferes with this carefully tuned information pathway. The attention module may also be functionally redundant; the expansion-projection design of MobileNetV2’s core blocks can be interpreted as an inherent form of feature gating, meaning the addition of CBAM offers little benefit and instead risks disrupting a process that is already highly optimized. In contrast, the DenseNet121 model, when fine-tuned with CBAM, demonstrated impressive performance gains. This version of DenseNet121 achieved an accuracy of 71.26%, an F1 score of 71.25%, and a UAR of 71.01%, marking a substantial improvement over the baseline DenseNet121 (Finetuned). The recall for Class 4 (DIS) improved significantly to 66.41%, and Class 5 (FEA) saw an improvement to 54.78%, underscoring the positive impact of CBAM in focusing the model’s attention on the relevant features of these more challenging emotional categories.
DenseNet121 fine-tuned with CBAM performs especially remarkably since, unlike MobileNetV2 or ResNet50, CBAM had a far more noticeable impact on DenseNet121. It emphasizes the need for attention mechanisms in improving the capacity of the model to extract more discriminative features, particularly in difficult tasks like emotion classification.
In terms of general accuracy and recall, the combination of DenseNet121 fine-tuned with CBAM beats all other architectures, according to the analysis shown in Table 6 and Figure 8. This model is especially fit for real-world emotion classification tasks since the recall improvements—especially for Class 4 (Dis) and Class 5 (FEA)—are rather significant. Previously difficult for the baseline model, DenseNet121 fine-tuned with CBAM can concentrate on important features that support the recognition of subtle emotional expressions, so validating it as our proposed model. Different from other models, this model unequivocally shows that CBAM is essential for enhancing the performance of DenseNet121, so confirming the need for including attention mechanisms in deep learning architectures.

4.8.2. Impact of Dropout Rate on Model Performance

The effect of several dropout rates on the performance measures of DenseNet121 Finetuned with CBAM is shown in Table 7. Clearly, Dropout 0.5 produces the best overall performance, obtaining the highest Test Accuracy (71.26%), UAR (mean) (71.01%), Precision Total (71.30%), and F1 Score Total (71.25%).
Especially when compared to the other dropout rates, this arrangement finds the ideal balance between avoiding overfitting and guaranteeing that the model retains necessary discriminative features. The model routinely improves as dropout moves from 0.1 to 0.5; dropout 0.5 offers the strongest generalization without compromising accuracy. Dropout 0.5 stands out as the ideal configuration, providing the best mix of high accuracy and a well-balanced trade-off between recall and precision, even while Dropout 0.3 and Dropout 0.4 also show improvements over 0.1 and 0.2. These findings underline the need to optimize dropout as a regularization method to improve model performance, especially in tasks like emotion classification, where small differences are quite important.

5. Discussion

The primary contribution of this work is the development and validation of a computationally efficient deep learning architecture that achieves a competitive balance between accuracy and model complexity for speech emotion recognition. While many state-of-the-art models push for marginal gains in accuracy at the cost of significant computational overhead, our proposed CBAM-DenseNet121 framework demonstrates that a carefully synthesized hybrid model can rival more complex systems while remaining practical for real-world applications. The following discussion analyzes the architectural strengths that enable this balance, contextualizes our performance against recent findings, and addresses the limitations of our study.
The CBAM-DenseNet121 architecture shows notable strength in efficiently extracting and ranking pertinent features in audio spectrograms. Combining the CBAM helps the model to address the complexity of emotional audio data by concentrating on prominent spatial and channel-level details. The backbone architecture of DenseNet121 helps to effectively reuse features, so preventing information loss and allowing the model to generalize more. By using a mix between convolution-driven feature extraction and attention-based refinement mechanisms, these architectural enhancements enable the CBAM-DenseNet121 model to outperform notable models, such as ResNet-18 + SPEL. Furthermore, the model’s greater accuracy (71.26%) on the CREMA-D test set shows its capacity to manage complex emotional recognition tasks, so surpassing related works including LeRaC + SepTr and SepTr. Particularly in situations when computational resources are limited, its manageable computational complexity and practical scalability make it an interesting alternative for real-world uses.
The CBAM-DenseNet121 model has certain restrictions notwithstanding its advantages. First, even if the model achieves a competitive test accuracy of 71.26%, there is still space for development, especially in terms of handling minor inter-class differences. Misclassifications between emotions with acoustically similar characteristics highlight this difficulty between “fear” and “neutral.” The confusion matrix analysis reveals that the model suffers in situations where emotional boundaries are less clear, so stressing the need for more feature extraction and classification improvement. Second, possible overfitting is raised by the notable discrepancy between the training accuracy (99.98%) and validation/test performance. This disparity implies that the model might be too specialized to the training data; hence, further research is justified to guarantee that the training set sufficiently reflects the variety of emotional spectrogram patterns. Furthermore, while our model shows improved performance over several baselines, a direct statistical significance test was not performed due to the unavailability of prediction outputs from the compared models; we acknowledge this as a limitation of our comparative analysis. At last, the performance of the model mostly relies on the spectrogram preprocessing quality. Variations or discrepancies in the spectrogram-generating process could compromise its robustness, hence stressing the need for consistent and excellent preprocessing pipelines. Dealing with these constraints will improve the generalization and practical relevance of the model in actual environments.
Several interesting directions are suggested to overcome constraints and improve the CBAM-DenseNet121 model’s performance even more. Using more broad and varied datasets could first help to reduce overfitting and increase generalizability. Expanding the CREMA-D dataset will help the model to capture a wider spectrum of emotional expressions, so addressing the limited scale of the dataset and hence the capacity to capture subtle and less frequent emotional nuances. Second, advanced data augmentation techniques, including noise injection, pitch-shifting, and time-stretching, could improve the model’s robustness against real-world variability, so guaranteeing better performance in many noisy environments. Third, by merging audio-based characteristics with semantic cues derived from textual transcripts, investigating hybrid architectures—such as audio–text fusion models—may help to improve emotion recognition. Investigating lightweight attention modules or other effective designs could also enable one to balance model complexity with accuracy enhancement. Fourth, improving the attention mechanisms inside CBAM—such as adding temporal or multi-scale attention techniques—may help to solve misclassification trends and hence improve per-class performance. Reducing overfitting by more aggressive regularization methods, including early stopping mechanisms, more effective data augmentation, or higher dropout, could help more precisely align training and test performance. Together, these future directions seek to solve present constraints and expand the limits of speech-emotion recognition research.

6. Conclusions

This work presents a fresh ASER framework by aggregating DenseNet121 with CBAM. By means of attention mechanisms, this integration improves feature reusing and highlights emotionally relevant cues. On the CREMA-D dataset, the model achieved excellent results, underscored by metrics robust to class imbalance. It obtained an Unweighted Average Recall (UAR) of 71.01% and an F1 score of 71.25%, demonstrating strong and balanced performance across all emotions. These findings are supported by an overall test accuracy of 71.26%. The results highlight how well merging convolutional networks with attention mechanisms increases ASER capability.
Because DenseNet121’s dense connections allow the learning of subtle emotional patterns while minimizing information loss, its architecture is especially suited for sophisticated emotion recognition. By emphasizing significant features, the CBAM improves the representational power of the model, so enabling efficient identification of emotionally salient areas. The CBAM-DenseNet121 framework lays a strong basis for more flexible affective computing systems and marks a major progress in ASER by balancing classification performance and efficiency.
Notwithstanding its advantages, the model shows signs of overfitting and finds it difficult to distinguish between like emotions, such as fear and sadness. This emphasizes how important more varied training data and better feature strategies are. Consistent performance of the model depends also on good preprocessing for consistency. Future research could include cross-dataset testing to improve generalizability and merging multimodal data—visual or physiological signals—to capture more complex emotional cues.

Author Contributions

Conceptualization, N.T. and I.B.; formal analysis, Z.S.K. and O.O.; investigation, D.E.B.; writing—original draft, Z.S.K.; writing—review and editing, E.I.H. and K.A.; project administration, K.A.; funding acquisition, O.O.; supervision, N.T. and I.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The experiments were conducted using the CREMA-D dataset, which is a publicly available resource.

Acknowledgments

The researchers would like to thank the Deanship of Graduate Studies and Scientific Research at Qassim University for financial support (QU-APC-2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Betka, A.; Ferhat, Z.; Barka, R.; Boutiba, S.; Kahhoul, Z.; Lakhdar, T.; Abdelali, A.; Dahmani, H. On Enhancing Fine-Tuning for Pre-trained Language Models. In Proceedings of ArabicNLP 2023; Sawaf, H., El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh, N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., et al., Eds.; Hybrid: Singapore, 2023; pp. 405–410. [Google Scholar] [CrossRef]
  2. Ferhat, Z.; Betka, A.; Barka, R.; Kahhoul, Z.; Boutiba, S.; Tiar, M.; Dahmani, H.; Abdelali, A. Functional Text Dimensions for Arabic Text Classification. In Proceedings of the Second Arabic Natural Language Processing Conference; Habash, N., Bouamor, H., Eskander, R., Tomeh, N., Abu Farha, I., Abdelali, A., Touileb, S., Hamed, I., Onaizan, Y., Alhafni, B., et al., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 352–360. [Google Scholar] [CrossRef]
  3. Schuller, D.M.; Schuller, B.W. A review on five recent and near-future developments in computational processing of emotion in the human voice. Emot. Rev. 2021, 13, 44–50. [Google Scholar] [CrossRef]
  4. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  5. Waheed, H.; Hassan, S.U.; Nawaz, R.; Aljohani, N.; Chen, G.; Gasevic, D. Early prediction of learners at risk in self-paced education: A neural network approach. Expert Syst. Appl. 2022, 213, 118868. [Google Scholar] [CrossRef]
  6. Shah Fahad, M.; Ranjan, A.; Yadav, J.; Deepak, A. A survey of speech emotion recognition in natural environment. Digit. Signal Process. 2021, 110, 102951. [Google Scholar] [CrossRef]
  7. Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
  8. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html (accessed on 28 August 2025). [CrossRef]
  9. Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision And pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  11. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  12. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 28 August 2025).
  13. Lin, Z.; Feng, M.; Santos, C.N.d.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. arXiv 2017, arXiv:1703.03130. [Google Scholar] [CrossRef]
  14. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
  15. Das, A.K.; Naskar, R. A deep learning model for depression detection based on MFCC and CNN generated spectrogram features. Biomed. Signal Process. Control. 2024, 90, 105898. [Google Scholar] [CrossRef]
  16. El Ayadi, M.; Kamel, M.S.; Karray, F. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit. 2011, 44, 572–587. [Google Scholar] [CrossRef]
  17. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
  18. Chen, Y.W.; Hirschberg, J.; Tsao, Y. Noise robust speech emotion recognition with signal-to-noise ratio adapting speech enhancement. arXiv 2023, arXiv:2309.01164. [Google Scholar]
  19. Gong, Y.; Chung, Y.A.; Glass, J. Ast: Audio spectrogram transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar] [CrossRef]
  20. Logan, B. Mel frequency cepstral coefficients for music modeling. In Proceedings of the Ismir, Plymouth, MA, USA, 23–25 October 2000; Volume 270, p. 11. [Google Scholar]
  21. Razavi, M.; Ziyadidegan, S.; Mahmoudzadeh, A.; Kazeminasab, S.; Baharlouei, E.; Janfaza, V.; Jahromi, R.; Sasangohar, F. Machine Learning, Deep Learning, and Data Preprocessing Techniques for Detecting, Predicting, and Monitoring Stress and Stress-Related Mental Disorders: Scoping Review. JMIR Ment. Health 2024, 11, e53714. [Google Scholar] [CrossRef]
  22. Çelik, B.; Baslak, M.; Genç, M.; Celik, M. Automated segmentation of dental restorations using deep learning: Exploring data augmentation techniques. Oral Radiol. 2025, 41, 207–215. [Google Scholar] [CrossRef] [PubMed]
  23. Samala, R.K.; Chan, H.P.; Hadjiiski, L.; Helvie, M.A. Risks of feature leakage and sample size dependencies in deep feature extraction for breast mass classification. Med. Phys. 2021, 48, 2827–2837. [Google Scholar] [CrossRef] [PubMed]
  24. Schuller, B.; Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
  25. Ververidis, D.; Kotropoulos, C. Emotional speech recognition: Resources, features, and methods. Speech Commun. 2006, 48, 1162–1181. [Google Scholar] [CrossRef]
  26. Kahhoul, Z.; Terki, N.; Benaissa, I.; Baarir, Z.E. SpectoResNet: Advancing Speech Emotion Recognition Through Deep Learning and Data Augmentation on the CREMA-D Dataset. 2024. Available online: https://sciforum.net/paper/view/20789 (accessed on 28 August 2025).
  27. Tarantino, L.; Garner, P.N.; Lazaridis, A. Self-attention for speech emotion recognition. In Proceedings of the 9th International on Speech Emotion Recognition and Understanding (IS-SERU), Graz, Austria, 15–19 September 2019; pp. 2578–2582. [Google Scholar]
  28. Zhao, Z.; Bao, Z.; Zhao, Y.; Zhang, Z.; Cummins, N.; Schuller, B.; Tao, J. Speech emotion recognition using deep 3D convolutional neural networks with a bidirectional LSTM network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7546–7553. [Google Scholar]
  29. Ristea, N.C.; Ionescu, R.T. Self-paced ensemble learning for speech and audio classification. arXiv 2021, arXiv:2103.11988. [Google Scholar] [CrossRef]
  30. Lin, Y.C.; Chou, H.C.; yi Lee, H. Mitigating Subgroup Disparities in Multi-Label Speech Emotion Recognition: A Pseudo-Labeling and Unsupervised Learning Approach. arXiv 2025, arXiv:2505.14449. Available online: http://arxiv.org/abs/2505.14449 (accessed on 28 August 2025).
  31. Alam, M.M.; Dini, M.A.; Kim, D.S.; Jun, T. TMNet: Transformer-fused multimodal framework for emotion recognition via EEG and speech. ICT Express 2025, 11, 657–665. [Google Scholar] [CrossRef]
  32. Huang, X.; Acero, A.; Hon, H.W.; Reddy, R. Spoken Language Processing: A Guide to Theory, Algorithm, and System Development; Prentice Hall PTR: Upper Saddle River, NJ, USA, 2001. [Google Scholar]
  33. Kim, J.Y.; Lee, S.H. Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer. IEEE Access 2024, 12, 62235–62247. [Google Scholar] [CrossRef]
  34. Tareq, I.; Elbagoury, B.; El-Regaily, S.; El-Horbaty, E.S. Analysis of ToN-IoT, UNW-NB15, and Edge-IIoT Datasets Using DL in Cybersecurity for IoT. Appl. Sci. 2022, 12, 9572. [Google Scholar] [CrossRef]
  35. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
  36. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
  37. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  38. Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
  39. McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. librosa: Audio and music signal analysis in python. SciPy 2015, 2015, 18–24. [Google Scholar]
  40. Ristea, N.; Ionescu, R.; Khan, F. SepTr: Separable Transformer for Audio Spectrogram Processing. arXiv 2022, arXiv:2203.09581. [Google Scholar] [CrossRef]
  41. Croitoru, F.A.; Ristea, N.C.; Ionescu, R.T.; Sebe, N. Learning Rate Curriculum. Int. J. Comput. Vis. 2024, 133, 291–314. [Google Scholar] [CrossRef]
Figure 1. Flowchart summarizing the feature extraction process from Speech Emotion Signal to log-Mel spectrogram.
Figure 1. Flowchart summarizing the feature extraction process from Speech Emotion Signal to log-Mel spectrogram.
Applsci 15 09692 g001
Figure 2. DenseNet121 architecture [34].
Figure 2. DenseNet121 architecture [34].
Applsci 15 09692 g002
Figure 3. The overview of the convolutional block attention module [4].
Figure 3. The overview of the convolutional block attention module [4].
Applsci 15 09692 g003
Figure 4. Diagram illustrating each sub-module of attention [4].
Figure 4. Diagram illustrating each sub-module of attention [4].
Applsci 15 09692 g004
Figure 5. Spectrogram-based audio classification using DenseNet-121 and convolutional attention mechanisms. The output numbers (0–5) correspond to the six emotion classes in the dataset.
Figure 5. Spectrogram-based audio classification using DenseNet-121 and convolutional attention mechanisms. The output numbers (0–5) correspond to the six emotion classes in the dataset.
Applsci 15 09692 g005
Figure 6. Analysis of training and validation performance: insights from loss and accuracy curve.
Figure 6. Analysis of training and validation performance: insights from loss and accuracy curve.
Applsci 15 09692 g006
Figure 7. Confusion matrix of test using CBAM-DenseNet121 on CREMA-D spectrograms.
Figure 7. Confusion matrix of test using CBAM-DenseNet121 on CREMA-D spectrograms.
Applsci 15 09692 g007
Figure 8. Recall for each model and category.
Figure 8. Recall for each model and category.
Applsci 15 09692 g008
Table 1. Label distribution of the CREMA-D dataset.
Table 1. Label distribution of the CREMA-D dataset.
EmotionAbbreviationNumber of Samples
AngerANG1271
DisgustDIS1271
FearFEA1271
HappyHAP1271
SadSAD1271
NeutralNEU1087
Total 7442
Table 2. Data split and image resize information.
Table 2. Data split and image resize information.
TotalImages ResizeTrain (70%)Validation (20%)Test (10%)
Somme total128 × 2565209 audio1488 audio7445 audio
Table 3. Training workstation hardware and software setup.
Table 3. Training workstation hardware and software setup.
DeviceValue
CPUIntel i7-10th Gen
RAM64
GPURTX 3060 12 GB
CuDA Version11.8
Python version3.10.14
Pythorch version2.3.1
Torchvision version0.18.1
Table 4. Layer-wise parameter summary of a DenseNet-based model.
Table 4. Layer-wise parameter summary of a DenseNet-based model.
BlockOutput ShapeParameters
Input[4, 3, 128, 256]0
Conv0[4, 64, 64, 128]9,408
DenseBlock 1[4, 256, 32, 64]335,040
Transition 1[4, 128, 16, 32]33,280
DenseBlock 2[4, 512, 16, 32]919,680
Transition 2[4, 256, 8, 16]132,096
DenseBlock 3[4, 1024, 8, 16]2,837,760
Transition 3[4, 512, 4, 8]526,336
DenseBlock 4[4, 1024, 4, 8]2,158,080
CBAM[4, 1024, 4, 8]132,260
Classifier[4, 6]8,198
Total Parameters-7,092,138
Table 5. Performance and complexity comparison of the proposed model against baseline and SOTA architectures on CREMA-D. Our model is in bold.
Table 5. Performance and complexity comparison of the proposed model against baseline and SOTA architectures on CREMA-D. Our model is in bold.
ModelAccuracy (%)Parameters (Approx.)
Lin et al. (2025) [30]82.00Large (not specified)
CBAM-DenseNet121 (Ours)71.26~7.1 M
LeRaC + SepTr [40]70.95~15 M
SepTr [41]70.47~15 M
ResNet-18 + SPEL [29]68.12~11.7 M
ViT (Audio Spectogram Transformer) [19]67.81~86 M
SpectoResNet [26]65.20~11.3 M
Table 6. Rearranged model performance evaluation with mean recall.
Table 6. Rearranged model performance evaluation with mean recall.
ModelAccuracy (%)F1 (%)Precision (%)UAR (%)Recall (%)
MobileNetV269.9269.7969.9769.7269.73
MobileNetV2 + CBAM67.9167.9368.1767.9268.51
ResNet5055.2154.5656.6455.4057.74
ResNet50 + CBAM56.9556.4156.8756.9557.78
DenseNet12162.5761.4463.8662.5659.95
DenseNet121 (Finetuned)69.2569.0569.4469.3469.67
DenseNet121 (FT + CBAM)71.2671.2571.3071.0170.34
Table 7. Effect of dropout on DenseNet121 fine-tuned with CBAM model performance metrics.
Table 7. Effect of dropout on DenseNet121 fine-tuned with CBAM model performance metrics.
DropoutAccuracy (%)UAR (Mean) (%)Precision Total (%)F1 Score Total (%)
0.167.6567.7067.7967.61
0.268.1868.0568.6067.79
0.369.9269.6670.0469.57
0.468.0567.9667.8867.87
0.571.2671.0171.3071.25
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kahhoul, Z.S.; Terki, N.; Benaissa, I.; Aldwoah, K.; Hassan, E.I.; Osman, O.; Boukhari, D.E. Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset. Appl. Sci. 2025, 15, 9692. https://doi.org/10.3390/app15179692

AMA Style

Kahhoul ZS, Terki N, Benaissa I, Aldwoah K, Hassan EI, Osman O, Boukhari DE. Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset. Applied Sciences. 2025; 15(17):9692. https://doi.org/10.3390/app15179692

Chicago/Turabian Style

Kahhoul, Zineddine Sarhani, Nadjiba Terki, Ilyes Benaissa, Khaled Aldwoah, E. I. Hassan, Osman Osman, and Djamel Eddine Boukhari. 2025. "Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset" Applied Sciences 15, no. 17: 9692. https://doi.org/10.3390/app15179692

APA Style

Kahhoul, Z. S., Terki, N., Benaissa, I., Aldwoah, K., Hassan, E. I., Osman, O., & Boukhari, D. E. (2025). Mathematical Analysis and Performance Evaluation of CBAM-DenseNet121 for Speech Emotion Recognition Using the CREMA-D Dataset. Applied Sciences, 15(17), 9692. https://doi.org/10.3390/app15179692

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop