Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition

Smietanka, Lukasz; Maka, Tomasz

doi:10.3390/app15052598

Open AccessArticle

Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition

by

Lukasz Smietanka

^*,†

and

Tomasz Maka

^†

Faculty of Computer Science and Information Technology, West Pomeranian University of Technology, Szczecin, Zolnierska 52, 71-210 Szczecin, Poland

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(5), 2598; https://doi.org/10.3390/app15052598

Submission received: 4 February 2025 / Revised: 20 February 2025 / Accepted: 25 February 2025 / Published: 27 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

This work proposes an approach that uses a feature space by combining the representation obtained in the unsupervised learning process and manually selected features defining the prosody of the utterances. In the experiments, we used two time-frequency representations (Mel and CQT spectrograms) and EmoDB and RAVDESS databases. As the results show, the proposed system improved the classification accuracy of both representations: 1.29% for CQT and 3.75% for Mel spectrogram compared to the typical CNN architecture for the EmoDB dataset and 3.02% for CQT and 0.63% for Mel spectrogram in the case of RAVDESS. Additionally, the results present a significant increase of around 14% in classification performance in the case of happiness and disgust emotions using Mel spectrograms and around 20% in happiness and disgust emotions for CQT in the case of best models trained on EmoDB. On the other hand, in the case of models that achieved the highest result for the RAVDESS database, the most significant improvement was observed in the classification of a neutral state, around 16%, using the Mel spectrogram. For CQT representation, the most significant improvement occurred for fear and surprise, around 9%. Additionally, the average results for all prepared models showed the positive impact of the method used on the quality of classification of most emotional states. For the EmoDB database, the highest average improvement was observed for happiness—14.6%. For other emotions, it ranged from 1.2% to 8.7%. The only exception was the emotion of sadness, for which the classification quality was average decreased by 1% when using the Mel spectrogram. In turn, for the RAVDESS database, the most significant improvement also occurred for happiness—7.5%, while for other emotions ranged from 0.2% to 7.1%, except disgust and calm, the classification of which deteriorated for the Mel spectrogram and the CQT representation, respectively.

Keywords:

speech emotion recognition; deep learning; audio features

1. Introduction

Identifying the emotional state from the speech signal is an integral part of the human–computer interaction systems. Knowing the speaker’s emotional state in such systems allows additional context to be added to the human–computer interaction process. Such information may also be helpful in speech recognition tasks while taking into account the emotional condition in the speech synthesis process, which can significantly improve the realism of the produced speech signal. The task of effectively recognizing the speaker’s emotional state in the speech signal is complex because, besides low–level attributes estimated from the signal, prosodic information, the speaker’s characteristics, the language he uses and the content of his speech significantly impact effectiveness. These factors make the search for convenient representation effective determining the emotional state from a speech signal is still ongoing by many research groups around the world. Despite the dominance of unsupervised learning systems in classification applications emotional states in the speech signal, there are solutions based on modifying the feature space by introducing additional attributes to increase the effectiveness of the classification process. For example, in work [1] the authors proposed a method of combining features, which implements the fusion of representations from the convolutional neural network (CNN) with heuristic-based discriminative features and utilize extreme learning machine in the learning process. Experiments with the EmoDB database show a 40% relative error reduction in F1-score compared to a solution based on a CNN network trained on spectrograms and the learning mechanism using bidirectional long short term memory. An interesting approach to feature fusion for emotional speech can be found in [2]. Authors proposed a new feature fusion model called Dual-TBNet where two 1D convolutional layers for features calculation, two Transformers and two bidirectional long short-term memory (BiLSTM) modules were used. The attention mechanism was used to capture the connections between features and BiLSTM was used to enhance the contextual information of the fusion. Model was tested on five popular emotional speech datasets and in the results Authors achieved decent results. Cao, et al. proposed in [3] a hierarchical network for integration of static and dynamic features for speech emotion recognition. The system uses encoder to encode static and dynamic features, then gated multi-features unit is used to determine intermediate representations and attention module performs the final prediction. The experimental stage shows that proposed approach outperforms state-of-the-art baselines. The combination of convolutional neural network-based multiperspective awareness module with a frame-level fine-grained fusion strategy for speech emotion recognition was presented in [4]. Authors showed, that fusion of a multiperspective–aware module to obtain emotional information from speech, an attention mechanism to focus on the salient attributes, and finally a fine–grained fusion strategy leads to classification accuracy over 72% on the IEMOCAP dataset. Another system based on transformer for speech emotion recognition in presented in [5]. Authors proposed cross–attention transformer with the fuse of raw speech signal, spectrogram, and mel–frequency cepstral coefficients (MFCC) features. In the result they reported classification accuracy over 73% which outperforms existing approaches. A hybrid representation of speech including temporal and spatial information is proposed in [6]. The system utilizes CNN with spectrogram and deep neural network (DNN) for features and long short–term memory (LSTM) for extracting temporal information using MFCC feature. Using obtained representations and applaying two different fusion strategies, the final classification accuracy was equal 91.25% for EmoDB dataset and 72.02% for IEMOCAP dataset. Static and dynamic time–frequency network for speech emotion recognition was proposed by Liu and others [7]. In their work, to generate static and dynamic emotional features, the time–frequency convolutional neural network and time–frequency transformer modules were used. To integrate both feature sets and generate meaningful representations a dedicated SD–Cross module was introduced. In the last step, to reduce the risk of overfitting, an augmentation stage was added. In the result of performed experiments, the proposed system outperforms state–of–the–art methods. Another transformer-based system for speech emotion recognition is presented in [8]. Authors proposed multimodal embedding module using pretrained models which provide a priori knowledge of multimodal information to the model. To learn multimodal emotional features and improve contextual emotional semantic understanding, a mutual transformer was introduced. The experimental results show that proposed architecture achieved over 82% classification accuracy. In [9], authors examined various ASR (Automatic Speech Recognition) outputs and fusion methods for speech emotion recognition. The results showed that using ASR hidden and text output with a hierarchical co-attention fusion improves speech emotion recognition accuracy in joint automatic speech and speech emotion recognition training. A self-attention mechanism and multi-scale fusion were evaluated in [10] for speech emotion recognition. The presented approach uses a self-attentional multi-channel convolutional neural network for learning dedicated features from text and a multi-scale fusion. The results show an improvement of weighted accuracy by 1.48% and of unweighted accuracy by 3% on the IEMOCAP dataset. A supervised approach based on a fusion of low-level feature sets for speech emotion classification is presented in [11], where the authors proposed a technique which uses two-stage feature selection and two fusion strategies. In the experimental stage, three sets of features and four popular datasets were used. Applying the proposed scheme, the obtained classification for all datasets exceeded 85% and was highest for EmoDB at 95.29%. Yao et al. [12] propose a framework that integrates three types of neural networks in the classification stage: DNN, CNN, and RNN. The proposed scheme for speech emotion recognition for four emotions is based on speech descriptors at frame, segment and utterance levels fed up separately to classifiers. Then, a confidence-based fusion strategy was designed to integrate the results of classification by separate classifiers. Authors reported achievement of 51.1% weighted and 58.3% unweighted accuracy on the IEMOCAP corpus. Speech emotion recognition approach with co-attention-based fusion was proposed in [13]. Multi-level acoustic information extracted from low-level features and embedded high-level information were treated as multimodal inputs to the dedicated co-attention module working as a fusion stage. The performed experiments show that the proposed technique achieves competitive performance with different cross-validation methods. Another different approach is presented by the authors in the paper [14], where they use separate Multitask Transformers architectures fusion of audio and text data. The described study focuses on three emotional speech datasets: IEMOCAP, MSP-IMPROV, and EmoDB, in a multimodal cross-corpus classification. The obtained results show an improvement in classification compared to using only audio or text data. Another fusion mechanism is presented in the paper [15]. This technique uses a CNN to extract multi-scale features, which are then fused. The authors present results obtained for six SER datasets: CASIA, IEMOCAP, EmoDB, RAVDESS, EMOVO, and SAVEE, for which an improvement in classification was achieved. On the other hand, the paper [16], presents the fusion of features generated using Wav2Vec 2.0 and Pitch Encoder, which are based on raw audio and pitch sequence, respectively. The authors conducted studies on tonal languages, for which they achieved significant improvement.

In this study, we have shown how low–level features describing fundamental frequency contour and the vocal tract’s resonances can improve the convolutional neural network’s embedded space and the final classification accuracy.

The main contribution of our work include:

Improvement of classification accuracy by using modified embedded space in ResNet architecture with prosody features.
Confirmation that the Constant-Q transform (CQT) spectrograms better describe emotional speech than Mel spectrograms, thus more suitable for emotional speech classification tasks.
An analysis of negative and positive influences of the injection of low–level prosody features of speech signals to classify emotional states.

Our motivation to perform such experiments was to determine how low-level speech features can influence the embedded space obtained in the learning process in an unsupervised manner and how it can improve discriminative power in the classification of emotional speech.

The paper is organized as follows: the methodology, audio representations and brief description of deep neural network architecture are presented in the next section. Section 3 describes the dataset used, experimental setup, evaluation approach, and the result. Finally, the last section concludes the work.

2. Materials and Methods

2.1. Methodology

The proposed method uses low–level features of the speech signal, which are combined with the features generated by the selected neural network model at the final stage of classification.

The diagram shown in Figure 1 and the Algorithm 1 shows a general outline of the proposed solution. First, for each recording

s_{i}

of emotional speech from the set two representations are determined, base features:

r_{1}^{i} = γ_{1} (s_{i})

and injected features:

r_{2}^{i} = γ_{2} (s_{i})

resulting in two sets,

R_{1} = {r_{1}^{i}}_{i = 1}^{N}

and

R_{2} = {r_{2}^{i}}_{i = 1}^{N}

. In other words, each

s_{i}

recording is represented by a pair

{r_{1}^{i}, r_{2}^{i}}

. The functions

γ_{1}

and

γ_{2}

return the time–frequency representation of the signal and selected low–level features where N is the number of recordings in the considered set. In the next step, each pair of features represents subsequent recordings processed as follows, the representation of

r_{1}

is processed by neural network and then projected to a given size in the dense layer

{F C}_{1}

. As a result, a vector

x_{1}

of size

Z = M - K

is created, where K is the number of injected features and M is the size of the final layer classification

{F C}_{2}

. In both cases,

F C

represents a single fully connected layer. The

r_{2}

representation is normalized in the

N L

layer [17], which creates the vector

x_{2}

. Then, the vectors

x_{1}

and

x_{2}

are combined in the concatenation process, giving final feature vector

X = [x_{1} ∣ x_{2}]

with length Z, which is processed by the final classification layer

{F C}_{2}

and then

s o f t m a x

operation is applied.

2.2. Audio Representations

In the performed experiments, as base features (

γ_{1}

), we have used two time–frequency representations – CQT (Constant–Q Transform) and Mel spectrograms. This choice depends on previous research [18], where these representations obtained significantly higher results than the regular spectrogram in classifying emotional states using convolutional networks. Moreover, these representations present two different frequency scales. The frequency distribution is geometric for the constant–Q spectrogram and the ratio of the band’s center frequency to its width is constant. As a result, the resulting frequency scale has a different accuracy reproduction in the low and medium frequency range compared to Mel scale [19].

Algorithm 1 Injection mechanism

Input: Audio representations

{R_{1}^{i}, R_{2}^{i}}_{i = 1}^{N}

Output: Predicted labels

{Y^{i}}_{i = 1}^{N}

$Y = {}$
fori = 1 to N do
Input $r_{1}^{i}$ into neural network and next to ${F C}_{1}$ for mapping to specific size and get feature map $x_{1}^{i}$ .
Send $r_{2}^{i}$ to $n o r m$ layer to get vector $x_{2}^{i}$ .
Concatenate $x_{1}^{i}$ and $x_{2}^{i}$ to get final feature vector $X^{i}$ .
Send features $X^{i}$ to final classification layer ${F C}_{2}$ , execute $s o f t m a x$ operation and get predicted label $y^{i}$ .
$Y^{i} \leftarrow y^{i}$
end for

Mel spectrogram is calculated based on the spectrogram, which is obtained using the STFT (Short–Time Fourier Transform) algorithm, and then, the frequencies are converted to the Mel scale according to Formula (1):

f_{m e l} (f_{h z}) = 2595 {log}_{10} (1 + \frac{f_{h z}}{700}),

(1)

where

f_{m e l}

is the Mel frequency and

f_{h z}

is the frequency in Hertz. In turn, in CQT, compared to a standard spectrogram, each signal frame is transformed using the Constant–Q Transform [20] method. This method uses a bank of filters corresponding to subsequent octaves. The central frequencies of successive filters are determined according to Formula (2).

f_{m} = {(2^{\frac{1}{b}})}^{m} \cdot f_{m i n},

(2)

where

f_{m}

is the center frequency of the m–th filter, b is the number of filters per octave and

f_{m i n}

is the minimum frequency. The constant Q is then calculated according to the Equation (3).

Q = \frac{f_{m}}{Δ f_{m}} = \frac{f_{m}}{f_{m + 1} - f_{m}} = {(2^{\frac{1}{b}} - 1)}^{- 1} .

(3)

In the final step, the frequency domain representation is computed for subsequent signal frames according to the Formula (4).

X_{p} [m] = \frac{1}{N [m]} \cdot \sum_{n = 0}^{N [m] - 1} W [m, n] \cdot x_{p} [n] \cdot e^{\frac{- j 2 π Q n}{N [m]}},

(4)

where W is the window function, p is the signal frame number, and

N [m]

is the frame length for bin m, which is determined according to the following equation:

N [m] = Q \cdot \frac{f_{s}}{f_{m}},

(5)

where

f_{s}

denotes sampling rate. This representation was initially proposed for reproducing the Western musical scale [21], but it can also be used to classify emotional states based on speech signals [18,22].

As a set of injected low–level features (

γ_{2}

), we have used a set of statistical features calculated based on the fundamental frequency signal

F_{0}

, three consecutive formants:

F_{1}

,

F_{2}

,

F_{3}

and derivatives of these values, resulting in 99 features in total. Our motivation for selecting the fundamental frequency and its harmonics is based on numerous studies [16,23,24,25], that confirm the crucial role of

F_{0}

in emotion recognition from speech signals. The fundamental frequency and its harmonics are strongly linked to speech intonation, which is an important element in expressing emotions. Moreover, in our work, we also considered previous studies focused on the interpretation of convolutional networks in the context of emotion recognition [26]. These studies demonstrated a significant relationship between the fundamental frequency and how CNNs learn to recognize emotions in time-frequency representations, further supporting the rationale for choosing this feature in our approach. A general list of the feature space and the number of individual groups is presented in Figure 2. The letters U, V and G represent accordingly: unvoiced part, voiced segment and gradient. The symbols

L R

,

Δ

, J, and

S H

represent linear regression, derivative, jitter and shimmer. Whereas the letter S represents a set of statistical descriptors which includes mean, minimum and maximum value, standard deviation, range, and interquartile range. The numbers in the brackets on the bottom represent the number of low–level features in each group.

Additionally, in the Figure 3 examples of individual representations of the speech signal are depicted.

2.3. Deep Neural Network Architectures

Due to the use of time–frequency representation in this work, we decided to use the ResNet [27] convolutional architecture as the base model of the neural network. This architecture is prevalent in image processing, the classification of emotions in speech signal [28,29], and it is also used in connection with time–frequency representations [30]. This architecture was designed to solve the problem of vanishing gradient that appears in models composed of many layers. By using residual connections in residual blocks, from which it is built, it is possible to make deep models and train their layers without significant risk of getting stuck in a local minimum. Moreover, this type of mechanism makes it highly efficient, even when using many layers. The ResNet152 variant was used in this work.

3. Experiments and Results

We have tested the proposed mechanism in various proportions

β = K / M

, where K is the number of injected features and M specifies the total number of attributes in the final classification stage. Then, we compared the results with those obtained by base architectures with the same dimensionality of the final feature space but without an injection mechanism. The size of the final feature space M took values from two different sets. For the base models, these were the values 32, 64, 128, 256, 512, 1024 and 2048. These models with an injection mechanism were the same, omitting the first two, 32 and 64. It was necessary due to the number of injected features

K = 99

, which would be greater than the target feature space. Additionally, to obtain more reliable results for each variant, we have prepared five models with different starting seeds, then a part of results we have presented as an average. As a result, jointly for both base representations and different values of M, we have prepared 120 models.

We used the following parameters to calculate the CQT representation: number of bins equal to 84, 12 bins per octave and minimum frequency equal to 32.70 Hz. In the case of Mel spectrograms, we have used 128 filters. In training individual models, we used the following parameters: 100 epochs, a learning rate equal to 0.001, an Adam optimizer and cross entropy as a loss function. The data was randomly divided into a train, validation and test subsets as

60 %

,

20 %

and

20 %

, respectively, for all recordings.

3.1. Datasets

The experiments were conducted on the Berlin Database of Emotional Speech (EmoDB) [31] and RAVDESS dataset [32].

The EmoDB database contains 535 sentences in German language that are recorded as monophonic with a sample rate equal to 16 kHz. The utterances are spoken by ten actors, including five men and five women. These actors speak ten different sentences in the following emotional states: anger (ang), disgust (dis), fear (fea), happiness (hap), sadness (sad), boredom (bor) and neutral (neu). Figure 4 shows distributions of recordings for each emotion. Additionally, Table 1 shows the distribution of recordings for individual speakers and emotional states.

The RAVDESS database contains 1440 monophonic sentences in English language with a 48kHz sampling rate. The utterances are spoken by 24 actors, including 12 men and women, two sentences per actor in eight emotional states: neutral, calm (cal), happiness, sadness, anger, fear, disgust, and surprised (sur). Recordings representing particular emotions in the RAVDESS database have two levels of intensity, unlike the EmoDB database. For a more accurate comparison of the results obtained between databases, we used only recordings with the same emotional intensity equal to 2 from the RAVDESS database, except for the neutral state, where all recordings had the same intensity level. The final set contained 96 recordings for every emotional state, giving 768 examples.

3.2. Evaluation Metric

Due to an imbalance problem in the EmoDB database (Table 1 and Figure 4), we adopt two different evaluation metrics to compare speech emotion recognition results: unweighted accuracy (UA) and weighted accuracy (WA) [8,33,34].

In the case of the RAVDESS database, where the distribution of recordings for individual emotions is equal, we presented the results only in weighted accuracy. The weighted accuracy is calculated as:

W A = \frac{\sum_{c = 1}^{C} {T P}_{c}}{\sum_{c = 1}^{C} ({T P}_{c} + {F N}_{c})},

(6)

whereas unweighted accuracy is calculated using the following equation:

U A = \frac{1}{C} \sum_{c = 1}^{C} \frac{{T P}_{c}}{{T P}_{c} + {F N}_{c}},

(7)

where

T P

(True Positives) represents the number of samples correctly predicted as positive,

F P

(False Positives) is the number of samples wrongly predicted as positive,

F N

(False Negatives) is the number of samples wrongly predicted as negative, and C is the number of emotion classes.

3.3. Results

Figure 5 shows the distribution of classification results for the test set using all prepared models without injection mechanism, divided into the individual sizes of the target feature space M. The graph on the left side (Figure 5a) shows the results in the form of

W A

obtained by the models trained on the EmoDB database, and the chart on the right side (Figure 5b) presents results for the models trained on the RAVDESS database. On the other hand, Figure 6 also illustrates the distribution of classification results in the same manner but obtained for models with injection mechanisms. Like Figure 5, the left graph (Figure 6a) presents results for the EmoDB database and the right (Figure 6b) for the RAVDESS database. As you can observe, in the general classification for both databases, models trained on CQT representations achieved significantly better classification accuracy for both variants, with and without injection mechanism, than models trained on Mel spectrograms. Additionally, in both cases, EmoDB and RAVDESS, these graphs illustrate the noticeable improvement in classification with the feature injection mechanisms for both CQT and Mel spectrogram. Moreover, in the case of EmoDB and models trained on CQT representation, the injection mechanism significantly reduced the range of results obtained by individual models within a given variant, which is especially visible for

M = 128

and

M = 2048

. Furthermore, by comparing the highest accuracy obtained by the baseline models with the best result using the injection mechanism, an improvement of 3.9% for CQT and 6.25% for the Mel spectrogram was achieved for EmoDB. In the case of RAVDESS, it was 1.06% for the Mel spectrogram and 4.12% for CQT representation.

Table 2, Table 3 and Table 4 show averaged results for individual base representations and sizes of the target feature space M, as well as the subsequent values of

β

. The results are presented as

W A

for the base models and

{W A}_{I}

for the injection mechanism models. In the case of the EmoDB database, the results are additionally presented in the form of

U A

for base models and

{U A}_{I}

for models with injection mechanisms. Additionally, for each measure, the value (

{W A}_{Δ}

,

{U A}_{Δ}

) is shown by what improvement occurred after applying the feature injection mechanism for individual sizes M. In the case of EmoDB, the highest average scores, 74.01% (WA) for the base models, were obtained for CQT representation and

M = 32

, while for the variant using the injection mechanism, was equal to 75.31% (WA) for

M = 128

, which gives an improvement of 1.3% (WA). For the same value

M = 128

, the improvement was 4.69% (WA).

The highest average improvement was observed for

M = 2048

, equal to 8.91% (WA),

β

was 0.05, so the injected features represented only 5% of the final feature space. The lowest improvement of 2.5% (WA) was obtained for

M = 256

, where the injected features constituted about 40% of the feature space. On the other hand, for the Mel spectrogram, the highest average accuracy was 62.5% (WA), for the models base values were obtained for

M = 512

, and the variant with the injection mechanism was equal to 66.25% (WA) for

M = 1024

, which gives an improvement of 3.75% (WA). However, for

M = 1024

, for which the overall highest result was obtained for the Mel spectrogram when injecting features, the average improvement was 8.75% (WA), where the injected features constituted 10% of the total. The lowest average difference was observed for

M = 512

, which was 0.87% (WA) for

β = 0.19

.

For the RAVDES database, the highest average accuracy, 66.04% (WA), for the base models, was also achieved for CQT representation and

M = 32

. The variant with injection mechanism was 69.06% (WA) for

M = 256

, which gives an improvement equal to 3.02% (WA). For the same value

M = 256

, the improvement was 4.27% (WA). The best average improvement, 4.48% (WA), was achieved for

M - 1024

where

β

was equal to 0.10. On the other hand, the lowest improvement 1.77% was observed for

M = 512

and

β = 0.19

. In the case of base models trained on Mel spectrograms, the highest average result was 56.77% (WA), which, compared to the best score of 57.4% (WA) for the variant with injection mechanism, gives improvements equal to 0.63% (WA).

The graphs in Figure 7 show the relationship of classification quality (

W A

) with the size of the target feature space. Figure 7a and Figure 7c present information for models without injection mechanism for EmoDB and RAVDESS databases, respectively. Figure 7b,d illustrate results for models with injection mechanism also for EmoDB and RAVDESS. Moreover, comparing graphs contained results for EmoDB, the use of the injection mechanism not only had a positive impact on the average classification result but also reduced the average spread of results obtained for different values of M parameter. Additionally, the Figure 7b shows the

β

coefficient changes depending on the M value.

Moreover, changes in the

β

coefficient do not significantly impact the average level of classification, which remains at a more or less similar level throughout the entire range of

β

changes. Such a situation and the average improvement in results after using the injection mechanism may prove the importance of the injected features, which partially reduces the complexity of the neural network architecture while maintaining its level of prediction.

Figure 8 and Figure 9 show the confusion matrices of models trained on EmoDB, without and with the mechanism of injection that received the highest scores for CQT and Mel spectrograms, respectively. Similarly, Figure 10 and Figure 11 lustrate confusion matrices for RAVDESS database. Table 5 and Table 6 present a classification accuracy improvement in individual emotional states for the models depicted on confusion matrices. In the case of EmoDB, injection of additional features (

β = 0.05

) for CQT representation significantly improved classification accuracy for dis emotional state, 30% (WA). A slight improvement was also observed for the fea and ang states, which were equal to 16.7% (WA) and 6.7% (WA), respectively. The classification accuracy of emotions such as neu and bor was decreased. However, the situation for hap and sad did not change. A different result occurred in the case of the Mel spectrogram (

β = 0.1

), where only hap got worse by 16.6% (WA). However, significant improvement was observed for neu equal to 21% (WA) and for fea equal to 38.9% (WA). No changes for ang, sad, dis and bor. A different situation can be observed in the results for the RAVDESS database. The most significant improvement occurred for dis—12.5% (WA) in the case of Mel spectrogram and for hap, dis equal to 16.7% (WA) in the case of CQT representation. On the other hand, the accuracy of classification was decreased in the model trained on Mel spectrograms for hap and cal. For the CQT representations, the classification of neu, ang, sad, and sur has deteriorated. In the case of EmoDB, for both representations the classification was noticeably improved fea emotional state. On the other hand, in both cases, accuracy for sad remained unchanged. For RAVDESS, the highest, positive change was occurred in dis for both representations. In both case, EmoDB and RAVDESS, for Mel spectrogram, accuracy for ang remained unchanged but accuracy for hap was decreased. For the models trained on CQT representation, for both databases, classification of neu was also decreased.

Moreover, Figure 12 and Figure 13 shows the classification distribution of individual emotional states of the models with the size of the final space feature M, for which the highest average improvement was observed when using the injection mechanism. In the case of EmoDB, for CQT representation and Mel spectrogram, it was

M = 2048

and

M = 1024

, respectively. For RAVDESS, it was

M = 1024

for both representations. The ranges of some results have changed significantly in reduction. For CQT, these were hap and fea, and neu for Mel spectrogram. In both cases, the range of results for the bor state has changed extension.

Additionally, in Table 7 and Table 8 average values from all models, without injection mechanism (

W A

) and with injection mechanism

({W A}_{I})

, are presented for individual emotional states. Furthermore, for every emotion, the change in the classification accuracy after using the

{W A}_{Δ}

injection mechanism was shown. Table 7 contains results for the EmoDB database. How can be noticed, in the case of CQT representation, the highest average improvement was recorded for emotions hap—14.6% and neu—6.5%. Classification of other emotional states also improved slightly in the range from 1.2% to 4%. In turn, for Mel spectrograms, the classification of neu emotional state improved significantly 8.7% compared to other emotions, for which there was also an improvement, but to a lesser extent range 4.1–5.9%. The exception was sad emotion, where the injection mechanism worsened the average classification level by 1%. The situation is different in the case of the RAVDESS database, which is shown in Table 8. The results for the CQT representation show significant improvement for emotions like hap—7.5% and sad—7.1%. However, a minimal deterioration of 0.1% dis classification accuracy has been observed. For the remaining emotional states, a slight improvement occurred from 0.5% to 3.5%. For models trained on Mel spectrograms, significant deterioration was observed for cal—6.6%, the highest improvement occurred for sad emotional state equal to 6.2%. For the rest of the emotions, accuracy slightly improved from 0.2% to 4.8%.

4. Conclusions

We have proposed a simple approach to fuse the embedded space of convolutional neural networks and low–level features to obtain feature space with higher discriminative power for emotion recognition in speech signals. The experiments showed that source representation in CQT spectrograms achieves better classification results than popular Mel spectrograms for emotional speech. The results show that the models that used additional information from low–level features achieved better performance in the overall classification. The average improvement for the models trained on CQT representations was equal to 1.29% and 3.75% (weighted accuracy) for models trained on Mel spectrograms in the case of the EmoDB dataset. However, for RAVDESS, it was 3.02% and 0.63% for CQT representation and Mel spectrogram, respectively. Furthermore, for variants with the highest change in classification score, the results demonstrate that the average classification performance was improved for almost all emotions compared to variants without an injection mechanism for both databases. In the case of EmoDB, for best models where Mel spectrograms were base representations, the accuracy of all emotions was improved. The highest improvement, around 14%, was obtained for sad and dis states, and the lowest, around 5%, for ang, hap, and bor emotional states. Using models trained on CQT representations, the most significant improvement in classification occurs for hap and dis (more than 20%) emotional states. On the other hand, the classification accuracy of anger and boredom decreases states by around 2%. For models that achieved the best accuracy on the RAVDESS database, the classification of neu was improved by around 16% using the Mel spectrogram, and in the case of CQT representations, the classification of emotional states like sur and fea was increased by around 9%. In most cases, the resulting average classification accuracy was improved for individual emotional states. The most positive impact after applying the injection mechanism was observed for hap—14.6% and neu—8.7% emotions for the EmoDB database. However, for the RAVDESS dataset, the most significant improvement equal to 7.5% and 7.1% occurred for hap and sad emotional states. Additionally, the classification of sad emotion for EmoDB combined with Mel spectrogram has deteriorated by 1%. A similar situation occurred in the case of the RAVDESS database, where the Mel spectrogram had the highest average accuracy decline by 6.6%, but this time for cal emotional state.

In future work, we plan to use other low–level features, including other attributes that combine the properties of a signal in different domains Additionally, we plan to apply our method to recurrent neural networks and transformer architectures.

Author Contributions

Conceptualization, L.S. and T.M.; methodology, T.M.; software, L.S.; validation, L.S. and T.M.; formal analysis, L.S. and T.M.; investigation, L.S. and T.M.; data curation, L.S.; writing—original draft preparation, L.S. and T.M.; writing—review and editing, L.S. and T.M.; visualization, L.S. and T.M.; supervision, T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The list of all used low-level features with brief descriptions, information about dataset split strategy and code examples based on the model that achieved the best results with the test subset are available in the project repository located at https://github.com/staticvoice/emoinjection (accessed on 19 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, L.; Wang, L.; Dang, J.; Zhang, L.; Guan, H. A Feature Fusion Method Based on Extreme Learning Machine for Speech Emotion Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 2666–2670. [Google Scholar] [CrossRef]
Liu, L.Y.; Liu, W.Z.; Feng, L. SDTF-Net: Static and dynamic time–frequency network for Speech Emotion Recognition. Speech Commun. 2023, 148, 1–8. [Google Scholar] [CrossRef]
Cao, Q.; Hou, M.; Chen, B.; Zhang, Z.; Lu, G. Hierarchical Network Based on the Fusion of Static and Dynamic Features for Speech Emotion Recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6334–6338. [Google Scholar] [CrossRef]
Li, G.; Hou, J.; Liu, Y.; Wei, J. MPAF-CNN: Multiperspective aware and fine-grained fusion strategy for speech emotion recognition. Appl. Acoust. 2023, 214, 109658. [Google Scholar] [CrossRef]
He, Y.; Minematsu, N.; Saito, D. Multiple Acoustic Features Speech Emotion Recognition Using Cross-Attention Transformer. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Xu, X.; Li, D.; Zhou, Y.; Wang, Z. Multi-type features separating fusion learning for Speech Emotion Recognition. Appl. Soft Comput. 2022, 130, 109648. [Google Scholar] [CrossRef]
Liu, Z.; Kang, X.; Ren, F. Dual-TBNet: Improving the Robustness of Speech Features via Dual-Transformer-BiLSTM for Speech Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2193–2203. [Google Scholar] [CrossRef]
Zhao, Z.; Yuhua, W.; Shen, G.; Xu, Y.; Zhang, J. TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3771–3782. [Google Scholar] [CrossRef]
Li, Y.; Bell, P.; Lai, C. Fusing ASR Outputs in Joint Training for Speech Emotion Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 7362–7366. [Google Scholar] [CrossRef]
Liu, Y.; Sun, H.; Guan, W.; Xia, Y.; Zhao, Z. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Commun. 2022, 139, 1–9. [Google Scholar] [CrossRef]
Xie, J.; Zhu, M.; Hu, K. Fusion-based speech emotion classification using two-stage feature selection. Speech Commun. 2023, 152, 102955. [Google Scholar] [CrossRef]
Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [Google Scholar] [CrossRef]
Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 7367–7371. [Google Scholar] [CrossRef]
Ahn, C.S.; Rana, R.; Busso, C.; Rajapakse, J.C. Multitask Transformer for Cross-Corpus Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 1–12. [Google Scholar] [CrossRef]
Li, M.; Zheng, Y.; Li, D.; Wu, Y.; Wang, Y.; Fei, H. MS-SENet: Enhancing Speech Emotion Recognition Through Multi-Scale Feature Fusion with Squeeze-and-Excitation Blocks. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12271–12275. [Google Scholar] [CrossRef]
Thanh, P.V.; Huyen, N.T.T.; Quan, P.N.; Trang, N.T.T. A Robust Pitch-Fusion Model for Speech Emotion Recognition in Tonal Languages. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12386–12390. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Smietanka, L.; Maka, T. DNN Architectures and Audio Representations Comparison for Emotional Speech Classification. In Proceedings of the 2021 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), Hvar, Croatia, 23–25 September 2021; pp. 1–5. [Google Scholar] [CrossRef]
Lidy, T.; Schindler, A. CQT-based Convolutional Neural Networks for Audio Scene Classification. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, Budapest, Hungary, 3 September 2016. [Google Scholar]
Schörkhuber, C.; Klapuri, A. Constant-Q Transform Toolbox for Music Processing. In Proceedings of the 7th Sound and Music Computing Conference (SMC2010), Barcelona, Spain, 21–24 July 2010. [Google Scholar] [CrossRef]
Brown, J.C. Calculation of a constant Q spectral transform. J. Acoust. Soc. Am. 1991, 89, 425–434. [Google Scholar] [CrossRef]
Vu, L.; Phan, R.C.W.; Han, L.W.; Phung, D. Improved speech emotion recognition based on music-related audio features. In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 29 August–2 September 2022; pp. 120–124. [Google Scholar] [CrossRef]
Schuller, B.; Rigoll, G.; Lang, M. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; Volume 1, pp. I–577. [Google Scholar] [CrossRef]
Busso, C.; Lee, S.; Narayanan, S. Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 582–596. [Google Scholar] [CrossRef]
Hammerschmidt, K.; Jürgens, U. Acoustical Correlates of Affective Prosody. J. Voice 2007, 21, 531–540. [Google Scholar] [CrossRef] [PubMed]
Smietanka, L.; Maka, T. Interpreting Convolutional Layers in DNN Model Based on Time–Frequency Representation of Emotional Speech. J. Artif. Intell. Soft Comput. Res. 2023, 14, 5–23. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, X.; Zhang, Z.; Gan, C.; Xiang, Y. Multi-Label Speech Emotion Recognition via Inter-Class Difference Loss Under Response Residual Network. IEEE Trans. Multimed. 2023, 25, 3230–3244. [Google Scholar] [CrossRef]
Guizzo, E.; Weyde, T.; Scardapane, S.; Comminiello, D. Learning Speech Emotion Representations in the Quaternion Domain. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1200–1212. [Google Scholar] [CrossRef]
Hsu, J.H.; Su, M.H.; Wu, C.H.; Chen, Y.H. Speech Emotion Recognition Considering Nonverbal Vocalization in Affective Conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.; Weiss, B. A database of german emotional speech. In Proceedings of the Interspeech 2005, Lisbon, Portugal, 4–8 September 2005; International Speech Communication Association: Lissabon, Portugal, 2005; pp. 1517–1520. [Google Scholar]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Dutt, A.; Gader, P. Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2043–2054. [Google Scholar] [CrossRef]
Wu, X.; Cao, Y.; Lu, H.; Liu, S.; Wang, D.; Wu, Z.; Liu, X.; Meng, H. Speech Emotion Recognition Using Sequential Capsule Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3280–3291. [Google Scholar] [CrossRef]

Figure 1. The proposed architecture of the injection mechanism.

Figure 2. Low–level features (

γ_{2}

) used in injection process.

Figure 2. Low–level features (

γ_{2}

) used in injection process.

Figure 3. Audio representation for selected recording (03a04Nc.wav) with a neutral emotional state. From top to bottom: CQT spectrogram, the Mel spectrogram, fundamental frequency, and formants.

Figure 4. The distribution of recordings in EmoDB database.

Figure 5. Accuracy distribution for models trained on CQT and Mel representations without injection mechanism for EmoDB (a) and RAVDES (b).

Figure 6. Accuracy distribution for models trained on CQT and Mel representations with injection mechanism for EmoDB (a) and RAVDES (b).

Figure 7. The classification accuracy relationship with the target feature space size and

β

coefficient for models without (a,c) and with (b,d) injection mechanism. The top row is for the EmoDB database, and the bottom is for the RAVDESS database.

Figure 7. The classification accuracy relationship with the target feature space size and

β

coefficient for models without (a,c) and with (b,d) injection mechanism. The top row is for the EmoDB database, and the bottom is for the RAVDESS database.

Figure 8. The confusion matrices for the best models trained on CQT representations without (a) and with injection mechanism (b)—EmoDB.

Figure 9. The confusion matrices for the best models trained on Mel spectrograms without (a) and with injection mechanism (b)—EmoDB.

Figure 10. The confusion matrices for the best models trained on CQT representations without (a) and with injection mechanism (b)—RAVDESS.

Figure 11. The confusion matrices for the best models trained on Mel representations without (a) and with injection mechanism (b)—RAVDESS.

Figure 12. The accuracy distribution of individual emotions for models with M size for which the highest improvement occurred. Models without injection mechanisms trained on CQT and Mel spectrograms for EmoDB database (a,b) for RAVDESS database.

Figure 13. The accuracy distribution of individual emotions for models with M size for which the highest improvement occurred. Models with injection mechanisms trained on CQT and Mel spectrograms for EmoDB database (a,b) for RAVDESS database.

Table 1. Distribution of recordings for individual speakers and the emotional states.

Speaker ID	ang	bor	fea	hap	neu	sad	dis	Sum
03	14	5	4	7	11	7	1	49
08	12	10	6	11	10	9	0	58
09	13	4	1	4	9	4	8	43
10	10	8	8	4	4	3	1	38
11	11	8	10	8	9	7	2	55
12	12	5	6	2	4	4	2	35
13	12	10	7	10	9	5	8	61
14	16	8	12	8	7	10	8	69
15	13	9	8	6	11	4	5	56
16	14	14	7	11	5	9	11	71
Total	127	81	69	71	79	62	46	535

Table 2. Mean values for

U A

,

W A

, and their changes

{W A}_{Δ}

,

{U A}_{Δ}

for each M and

β

values (CQT spectrogram)—EmoDB.

Table 2. Mean values for

U A

,

W A

, and their changes

{W A}_{Δ}

,

{U A}_{Δ}

for each M and

β

values (CQT spectrogram)—EmoDB.

M	$WA$	$UA$	$β$	${WA}_{I}$	${UA}_{I}$	${WA}_{Δ}$	${UA}_{Δ}$
32	74.01	71.61	-	-	-	-	-
64	68.59	66.07	-	-	-	-	-
128	70.62	68.41	0.77	75.31	72.48	4.69	4.07
256	72.03	69.81	0.39	75.16	72.31	3.13	2.5
512	69.84	67.47	0.19	72.97	70.36	3.13	2.89
1024	67.5	63.40	0.10	72.66	69.67	5.16	6.27
2048	65.78	62.03	0.05	74.69	72.98	8.91	10.95

Table 3. Mean values for

U A

,

W A

, and their changes

{W A}_{Δ}

,

{U A}_{Δ}

for each M and

β

(Mel spectrogram)—EmoDB.

Table 3. Mean values for

U A

,

W A

, and their changes

{W A}_{Δ}

,

{U A}_{Δ}

for each M and

β

(Mel spectrogram)—EmoDB.

M	$WA$	$UA$	$β$	${WA}_{I}$	${UA}_{I}$	${WA}_{Δ}$	${UA}_{Δ}$
32	60.78	59.52	-	-	-	-	-
64	62.19	61.95	-	-	-	-	-
128	58.44	56.91	0.77	64.69	62.69	6.24	5.78
256	58.44	56.48	0.39	65.02	62.89	6.58	6.41
512	62.5	61.15	0.19	63.75	62.02	1.24	0.87
1024	58.28	56.73	0.10	66.25	65.48	7.82	8.75
2048	59.22	56.39	0.05	63.28	61.79	4.06	5.4

Table 4. Mean values for

W A

, and their changes

{W A}_{Δ}

, for each M and

β

values (CQT and Mel spectrogram)—RAVDESS.

Table 4. Mean values for

W A

, and their changes

{W A}_{Δ}

, for each M and

β

values (CQT and Mel spectrogram)—RAVDESS.

	CQT Spectrogram			Mel Spectrogram
$M$	$WA$	${WA}_{I}$	${WA}_{Δ}$	$WA$	${WA}_{I}$	${WA}_{Δ}$	$β$
32	66.04	-	-	56.25	-	-	-
64	63.85	-	-	54.58	-	-	-
128	64.06	66.49	2.43	53.96	55.10	1.14	0.77
256	64.79	69.06	4.27	54.27	56.35	2.08	0.39
512	63.44	65.21	1.77	56.77	56.67	−0.1	0.19
1024	62.71	67.19	4.48	53.85	57.40	3.55	0.10
2048	64.90	68.65	3.75	54.37	56.98	2.61	0.05

Table 5. Differences between classification accuracy

W A

of individual emotions with and without injection mechanisms for the best models trained on the EmoDB database.

Table 5. Differences between classification accuracy

W A

of individual emotions with and without injection mechanisms for the best models trained on the EmoDB database.

	neu	ang	fea	hap	sad	dis	bor
Mel	21	0	38.9	−16.6	0	0	0
CQT	−5.3	6.7	16.7	0	0	30	−11.1

Table 6. Differences between classification accuracy

W A

of individual emotions with and without injection mechanisms for the best models trained on the RAVDESS database.

Table 6. Differences between classification accuracy

W A

of individual emotions with and without injection mechanisms for the best models trained on the RAVDESS database.

	neu	ang	fea	hap	sad	dis	cal	sur
Mel	0	0	4.2	−8.3	0	12.5	−4.2	4.2
CQT	−12.5	−8.3	8.4	16.7	−4.1	16.7	12.5	−4.1

Table 7. Average accuracy of all models without

W A

and with

{W A}_{I}

injection mechanism and their changes

{W A}_{Δ}

for individual emotional states—EmoDB.

Table 7. Average accuracy of all models without

W A

and with

{W A}_{I}

injection mechanism and their changes

{W A}_{Δ}

for individual emotional states—EmoDB.

	CQT Spectrogram			Mel Spectrogram
	$WA$	${WA}_{I}$	${WA}_{Δ}$	$WA$	${WA}_{I}$	${WA}_{Δ}$
neu	64.0	70.5	6.5	48.6	57.2	8.7
ang	91.2	92.4	1.2	78.3	81.7	3.5
fea	47.1	49.3	2.2	35.4	41.3	5.9
hap	49.9	64.5	14.6	41.3	46.9	5.6
sad	88.6	90.7	2.1	88.8	87.7	−1.0
dis	50.0	54.0	4.0	51.1	55.2	4.1
bor	77.9	79.6	1.6	65.7	70.7	5.0

Table 8. Average accuracy of all models without

W A

and with

{W A}_{I}

injection mechanism and their changes

{W A}_{Δ}

for individual emotional states—RAVDESS.

Table 8. Average accuracy of all models without

W A

and with

{W A}_{I}

injection mechanism and their changes

{W A}_{Δ}

for individual emotional states—RAVDESS.

	CQT Spectrogram			Mel Spectrogram
	$WA$	${WA}_{I}$	${WA}_{Δ}$	$WA$	${WA}_{I}$	${WA}_{Δ}$
neu	76.8	77.3	0.5	77.0	81.8	4.8
ang	77.3	78.7	1.4	62.1	65.2	3.0
fea	61.8	63.8	2.1	46.7	46.8	0.2
hap	37.9	45.3	7.5	25.2	26.3	1.1
sad	49.8	56.8	7.1	40.6	46.8	6.2
dis	57.7	57.7	−0.1	41.4	43.2	1.8
cal	71.7	75.2	3.5	76.1	69.5	−6.6
sur	81.2	83.7	2.5	69.8	72.3	2.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Smietanka, L.; Maka, T. Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition. Appl. Sci. 2025, 15, 2598. https://doi.org/10.3390/app15052598

AMA Style

Smietanka L, Maka T. Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition. Applied Sciences. 2025; 15(5):2598. https://doi.org/10.3390/app15052598

Chicago/Turabian Style

Smietanka, Lukasz, and Tomasz Maka. 2025. "Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition" Applied Sciences 15, no. 5: 2598. https://doi.org/10.3390/app15052598

APA Style

Smietanka, L., & Maka, T. (2025). Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition. Applied Sciences, 15(5), 2598. https://doi.org/10.3390/app15052598

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Embedded Space with Low–Level Features for Speech Emotion Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Methodology

2.2. Audio Representations

2.3. Deep Neural Network Architectures

3. Experiments and Results

3.1. Datasets

3.2. Evaluation Metric

3.3. Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI