Starting with a thorough analysis of the DenseNet121 architecture—which forms the backbone for feature extraction—this section describes the technique used in our model. We then look at how the CBAM might improve these properties, so improving the focus of the model on pertinent trends. The core of our method is the integration of the inspired DenseNet121 with CBAM, producing a more accurate and efficient feature representation. At last, we discuss the classification stage in which the processed data is mapped to emotional categories. Every one of these elements is absolutely important for the whole architecture; the following sections go into great length regarding their purposes.
3.1. Audio Feature Extraction
The Short-Time Fourier Transform (STFT) takes each audio sample and turns it into a two-dimensional time-frequency representation. This change lets us make a matrix that looks like a picture, which is necessary for using deep learning methods. To find the discrete STFT, do the following:
In this case,
is the input discrete signal and
is the window function (a Hamming window) with length
L. H is the hop size, and N is the number of discrete Fourier transform (DFT) points, or frequency bins. The
is the STFT coefficient for the
k-th frequency bin and the
m-th time-frame [
32].
After the STFT is calculated, the spectrogram is log-transformed and Mel-scaled to match how people hear things. The Mel scale is an auditory scale that tries to mimic how people hear different frequencies. The human ear can easily tell the difference between frequencies below 1000 Hz, but it is less sensitive to differences above 10 kHz. So, the spectrogram is turned into the Mel scale, which is straight up to 1 kHz and logarithmic for frequencies above that.
We get the log-Mel spectrogram by passing the spectrogram from the STFT through a Mel-filter bank. This change makes the model better at picking up on emotional cues by making the feature extraction process more like how people hear things [
32,
33]. Mathematically, the Mel scale is defined as:
The frequency in Hertz is f. The log-Mel spectrogram is a compressed version that keeps the most important sound features for recognizing emotions.
We used the Librosa library with the following specific settings to create the log-Mel spectrograms for this study, making sure that they could be reproduced. We used a Short-Time Fourier Transform (STFT) to process the 16 kHz audio. The window size was 2048 points, the Fast Fourier Transform (FFT) window size was 2048 points, and the hop length was 512 points. This setup makes it so that there is a 75% overlap between frames that come one after the other. A filter bank with 128 Mel filters and a maximum frequency of 8000 Hz was then used to map the power spectrogram onto the Mel scale.
The flowchart in
Figure 1 gives a clear picture of the whole feature extraction pipeline.
3.2. DenseNet121
At the core of the architecture, DenseNet121 offers a strong basis for feature extraction. Dense blocks, which are meant to effectively reuse features, form the basis of the model and guarantee best parameter use while so reducing problems including vanishing gradients. Four dense blocks make up DenseNet121; each generates feature maps with a specified growth rate as detailed in
Figure 2. Transition blocks are used between these dense blocks to minimize feature map size by convolution and pooling processes, so preserving computational efficiency while conserving necessary information [
11].
The DenseNet121 architecture comprises a sequence of convolutional blocks, each intended to progressively extract and refine feature representations; the main elements of the model are enumerated here.
- A.
Dense Block Function (Dense_Block)
Leveraging its densely connected structure to maximize feature propagation, the dense block is essential for feature extraction within the DenseNet architecture. Denoted as xx, the dense block’s input is a sample. This input passes a sequence of convolutional transformations; the output is stacked feature maps produced by these processes.
Depending on the block size, the input feature map is first passed through a convolutional block comprising four filters in each layer, enabling the first stage of feature extraction. Following this, another convolutional block applies filters using a 3 × 3 kernel size, further improving the feature maps and enhancing their representation. The output of the second convolutional block is then concatenated with the input feature map to create a densely connected structure, ensuring flawless information flow across the layers.
This architecture guarantees that the dense block records local as well as global trends in data. Features from several layers concatenate to strengthen the representation, so increasing the network’s capacity to learn complex features. Consequently, the dense block greatly helps the DenseNet architecture to effectively manage difficult tasks and hence contributes to its performance.
- B.
Transition Layer Function (Transition_Layer)
Downsampling the feature maps and enabling the flow between dense blocks depends critically on the transition layer. First, the input x undergoes a convolution block that halves the last dimension of the input, so lowering the number of filters. This method preserves pertinent features while helping to lower the computational load. Then, x undergoes an average pooling operation (Avg Pool 1D), with a pool size of 2, a stride of 2, and padding set to “same”. This pooling process preserves significant information while down-sampling the feature map, so lowering its spatial dimensions.
DenseNet121 is designed as follows:
Beginning with a 1D convolutional layer with 64 filters, a kernel size of 7, strides of 2, and “same” padding applied to the input, the model proceeds. To minimize the spatial dimensions of the feature map, a max pooling operation (Max Pool 1D) with a kernel size of 3, strides of 2, and “same” padding follows.
The model then works through several dense blocks, each with different block sizes (). Every block size generates a dense block whose output is passed through the matching transition layer to enable downsampling and control of the feature map count.
Following processing through all the dense blocks, the last dense block’s final output undergoes a global average pooling operation (GlobalAverage Pooling 1D), so lowering the feature map’s dimensions to a single vector per channel.
At last, the output passes through a dense layer with a softmax activation function, which forecasts the class probabilities for the input sample, so finishing the classification process.
The model takes in 128 × 256 log-Mel spectrograms as input. Even though spectrograms are naturally single-channel (gray scale), we make three copies of this single channel. This step is needed to make sure that the input format of the DenseNet121 backbone matches. It was pre-trained on three-channel RGB images from ImageNet, which allows for transfer learning. The DenseNet121 architecture gradually shrinks the dimensions of the spectrogram data as it moves through it. This makes a 1024-channel feature map. This map, which now has the extracted features, is then sent to an enhancement module for more work.
The architectural design of DenseNet121 is especially suited for datasets like CREMA-D, which entail complex emotional recognition from audio-visual data. While reducing information loss across layers, the strong links help to learn subtle patterns across facial expressions and vocal cues. DenseNet121 is a perfect candidate for high-resolution, emotionally complex datasets such as CREMA-D because of its ability for fine-grained feature retention and effective gradient flow.
3.3. Convolutional Block Attention Module (CBAM)
Including a dual attention mechanism, CBAM improves the feature maps produced by the DenseNet121 backbone. Two main elements comprise this module: the Spatial Attention Module and the Channel Attention Module [
4]. The CBAM runs sequentially from a feature map
.
Using global average pooling and max pooling, the Channel Attention Module finds significant channels inside the feature maps and then shares multi-layer perceptron (MLP) layers. Applying this generates a 1D channel attention map
, which is then refined using element-wise multiplication:
The Spatial Attention Module focuses on salient spatial regions by performing average and max pooling along the channel axis, followed by a convolution layer to generate spatial attention weights. This results in a 2D spatial attention map
. The spatial attention map is then applied to the channel-refined feature map
via element-wise multiplication:
This process produces the final refined feature map
, incorporating both channel-wise and spatial attention, enhancing performance for tasks such as ASER. The detailed computation of the attention maps is shown in
Figure 3, with further explanations in
Figure 4.
CBAM enhances a network’s representational power by sequentially applying channel and spatial attention mechanisms, allowing the model to focus on what and where to attend in feature maps. This selective emphasis enables the model to prioritize emotionally salient regions in both facial expressions and acoustic signals. When applied to tasks like emotion recognition on the CREMA-D dataset, CBAM refines intermediate feature maps by suppressing irrelevant information and amplifying critical cues—such as micro-expressions or tonal fluctuations. Its lightweight, plug-and-play nature allows for easy integration with backbones like DenseNet121, offering improved performance with minimal computational overhead.
- A.
Channel Attention Mechanisms:
Woo et al. [
4] developed the channel attention module to enhance the most important features of the input image by exploiting the inter-channel dependencies within the feature map. As each channel in the feature map functions as a detector [
35], channel attention emphasizes the most informative channels. To efficiently compute this attention, we first reduce the spatial dimensions of the input feature map, commonly using average pooling to aggregate spatial information. This method is recommended by Zhou et al. [
36] to capture target object regions effectively and has been incorporated by Hu et al. [
37] into their attention mechanism for computing spatial statistics. Building on this, we argue that max-pooling can offer complementary insights into distinctive object features, further refining channel-specific attention. Therefore, we employ both average and max-pooled features simultaneously.
The process begins with the application of both average and max-pooling to the feature map, resulting in two distinct spatial context representations:
for average-pooled features and
for max-pooled features. These descriptors are then processed through a shared multi-layer perceptron (MLP) network with a single hidden layer, yielding the channel attention map
. To optimize parameter efficiency, the hidden layer’s activation is set to
, where
r represents the reduction ratio. After passing each descriptor through the shared network, the output vectors are combined using element-wise summation. The channel attention map is thus computed as follows:
where
is the sigmoid function, and
,
are the learned weight matrices shared across inputs, with ReLU activation applied after
.
- B.
Spatial Attention Mechanism:
While channel attention focuses on the importance of the channels, the spatial attention mechanism, as introduced by Woo et al. [
4], addresses the question of “where” to focus attention within the image. Spatial attention complements channel attention by identifying the most informative regions of the feature map. The process begins with the application of average-pooling and max-pooling operations along the channel axis, as these operations effectively highlight key spatial areas. The resulting pooled features are concatenated to form an efficient spatial descriptor, which is subsequently passed through a convolutional layer to generate the spatial attention map
, signaling regions where attention should be emphasized or suppressed.
The concatenated feature descriptor, consisting of the pooled features, is processed through a standard convolution operation with a
filter to produce the spatial attention map. This operation can be formalized as:
where
represents the sigmoid function and
denotes the convolution operation. This approach allows the network to focus on the most significant spatial regions of the input image, thereby improving its ability to localize critical information.
3.5. Final Classifier
Architectural final classifiers are meant to map the refined feature representations to the target emotion classes. Global Adaptive Average Pooling first reduces the spatial dimensions of the feature maps to 1 × 1, so aggregating the spatial information into a single vector. The features are then normalized by a BatchNorm layer normalizing, guaranteeing stable and effective convergence during training. One dropout layer with a probability of 0.5 is included as a regularizing mechanism to reduce overfitting. A fully connected (FC) layer then maps the 1024-dimensional feature vector to the six output classes corresponding to the emotions: happiness, sadness, anger, fear, disgust, and neutral, so producing the resulting features. At last, a softmax activation function is used to translate the outputs into class probabilities, thus allowing the model to generate predictions depending on the highest probability. This method guarantees strong classification and keeps computational efficiency by means of structure.
Using a well-crafted set of hyperparameters and optimization techniques catered to the difficulties of the emotion recognition task, the proposed CBAM-DenseNet121 architecture was trained on the CREMA-D dataset. We go over the training environment and associated methods below.