1. Introduction
Humans are able to produce a wide range of sounds in conversations, pertaining to a variety of multilingual words through their mouths [
1,
2]. The understanding of these words among humans, while communicating with each other is obvious, but when it comes to the human—machine emulation, it requires considerable speech signal processing. Humanoid speech signals reveal exploitable information about the language, overarching message content, intonation, trait, accent, gender, and other distinct properties of the speaker. To make machines capable to infer emotional information from speakers’ utterances by processing and classifying the speech signals is referred to as Speech Emotion Recognition (SER) [
3].
Researchers have recently paid a lot of attention to the creation of an efficient human-technology communication interface, which falls under the category of human-computer interaction (HCI). In a broader perspective, emotion recognition from speech signals has been used in a variety of applications, including speech synthesis, robotics, automobile industry, multimedia technologies, healthcare, education, forensics, customer service, and entertainment [
4]. The taxonomy of text-to-speech aids in the development of screen readers for the blind [
5]. Nowadays, robotic technologies are increasingly evolving sensor data that comes from internal and external environmental factors, impacting decision-making through emotion recognition. Experimental paradigms that identify emotions from publicly available datasets must be carefully evaluated since they may fail when used in a real-world context [
6]. In the automobile industry advanced image/video and voice-based signal processing methods are employed to identify drowsiness and lassitude in drivers [
7,
8,
9].
The abstraction of hidden features of an individual or a system extracted from heterogeneous multimedia content for recommendation systems has received the attention of SER researchers [
10]. Intensity detection of virus attacks through patients’ voice signals has opened a new challenge area in medical science since the current COVID-19 spread [
11]. Speech recognition techniques have recently been incorporated into online learning platforms for emotional analysis. Positive emotions consequently influence students’ inspiration, dedication, and self-learning abilities. On the other hand, negative emotions leave an adverse impact on students’ performance [
12]. A recent study has discovered several significant facts in human brain images, demonstrating how they do cognitive-based speech synthesis, which has given rise to new research trends in automatic speech recognition (ASR) and forensic systems [
11,
12,
13]. In order to provide high-quality customer care services, SER techniques are being applied to estimate the emotions of telemarketing agents and customers from their conversations.
The desirable emotion detection can be achieved directly from speech signals. A classic SER system includes signal pre-processing, feature extraction, feature selection, and classifier to emotional information like happy, sad, fear, surprise, anger, etc. Each phase of the SER process is detailed as follows.
Speech Signal Processing: Speech signals data collected from speakers’ utterances are generally uneven and cumbersome, which requires apt pre-processing and post-processing for extraction of appropriate acoustic features [
13]. Pre-emphasis, speech segmentation or block framing, window weighing function, voice activity detection (VAD), normalization, noise reduction, and feature selection are examples of commonly used speech processing techniques.
Feature Extraction: The success ratio of SER classifiers mainly depends on the extraction of discriminative and relevant feature sets. There is currently no set of traits for SER that is accepted globally, according to experimental studies. Therefore, whether to use global or local features for better classification outcomes relies on the research methodology. In addition, inappropriate features can be overhead for the classifier. The continuous and variable length speech signals enclose rich information about the emotions. Global features signify overall statistical facts like minimum or maximum values, mean, median, and standard deviation. These are also known as long-term or suprasegmental features. Local features, on the other hand (also known as short-term or segmental features) [
14] describe the chronologically changing aspects in which a stationary state is estimated [
15].
In recent research work, handcrafted feature extraction remained immensely under consideration, however, it required sufficient domain knowledge for the experts. The four main categories listed below [
16,
17] can be used to analyze speech features in SER systems [
18,
19].
Prosodic Features: Innovation of prosodic or para-linguistic features has carried out the most distinguishing properties of emotional information for SER systems. Prosodic features like intonation and rhythm are extracted from large units of human utterances, thus categorized in long-term features. Widely employed characteristics on which typical prosodic features are based are pitch, energy, length, and frequency.
Spectral Features: Humanoid vocal tract is a kind of filter that controls the shape of sound produced during speech utterance. Vocal tract properties are better explained when presented in the frequency domain. The process of transformation of time domain signals into equivalent frequency domain signals through Fourier transform contributes to spectral features extraction. The segmented acoustic data of variable length (usually 20 to 30 ms) obtained from window weighting function is used as input in spectral filters. Various types of spectral parameters in speech have been explored using manifold dimensional reduction methods, such as Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficients (LPCC), Log Frequency Power Coefficient (LFPC), Gamma tone Frequency Cepstral Coefficients (GFCC) and Mel-Frequency Cepstral Coefficients (MFCC).
Among all the above-mentioned filter-banks, MFCC is the most implemented in SER because of its robustness and noisy background resistance. After segmentation of utterances, MFCC features are obtained that describe the short-term power spectrum in acoustic signals. Frequencies of the acoustic tone of vocal tract are known as Formants, which define phonetic features of vowels.
Voice Quality Features: Voice Quality features define the physical characteristics of the glottal source. And there exists a strong correlation between the quality of voice and the emotional data of the speech. The set of features associated with voice quality includes jitter (i.e., per cycle frequency variation parameter), shimmer (defined as amplitude variation parameter for soundwave), fundamental frequency and harmonics to noise ratio (HNR) [
20].
TEO Based Features: Teager Energy Operator (TEO) was presented by Teager [
21] for the detection of stress rate in speech signals. In a stressful state, due to muscular tension in the glottal of a speaker, changes may occur in airflow while producing sound. The non-linear TEO was mathematically formulated by Kaiser as shown in Equation (1). Where,
represents the TEO and
depicts the acoustic signal under consideration.
SER Classification: Emotion classification in the SER system is accomplished by a variety of classifiers trained based on the master feature vector. The traditional Machine Learning (ML) classifiers [
4], specifically used in SER systems include K-Nearest Neighbours (k-NN), Support Vector Machine (SVM), Hidden Markov Model, Gaussian Mixture Model (GMM) and Artificial Neural Network (ANN). Additionally, instead of using an individual classifiers, different classifiers have also been used in combination to improve the efficacy of emotion detection process [
22].
There are few limitations of classic ML based techniques when applied for speech emotion detection [
23]. First, there is no definite ML based algorithm, capable of extracting discriminative and accurate features for multilingual speech databases. Moreover, the frequency and time domain acoustic features are unable to train a model for multilingual SER. Second, manually extracted features through knowledgeable domain experts are specifically domain dependent that cannot be reused in other identical domains. Third, efficient modelling of machine learning classifiers necessitates a massive number of labelled handcrafted features to attain maximum performance for speech emotion detection. Construction of database for the languages, for instance Arabic, collecting large amount of labelled data may require a soundproof eco-friendly environment. Fourth, because of the generic nature of ML based SER classifiers, the classification performance may drastically decrease for variant linguistic content, speakers, and acoustic settings.
In contrast to the longstanding ML techniques, Deep Learning (DL) methods reduced the manual efforts for the extraction of discriminative features from acoustic signals. Deep Learning comprises of manifold layered architecture of artificial neural networks that learns the features in hierarchical manner. The most practicing speech emotion detection classifiers based on DL are Deep Neural Networks (DNN), restricted Boltzmann machine (RBM), Convolutional Neural Networks (CNN), Long-Short Term Memory Networks (LSTM) and Recurrent Neural Networks (RNN). Additionally, some enhanced DL methods such as Deep Autoencoder (AE), Multitask Learning, Attention Mechanism, Transfer Learning and Adversarial Training have also been employed for SER. Most importantly, multiple DL classification methods can be collectively employed in different layers for model construction, which is robust and improve overall performance and flexibility. Thus, in this study, a novel transformer based method was proposed to improve the recognition rate of emotion detection and reduce the model training time under limited computation resources. The performance of proposed transformer is compared with baseline deep learning methods in terms of accuracy. In addition, this also investigate the optimize set of features and data augmentation methods to reduce the overfitting problem due to inadequate data.
The rest of this paper is structured as follows.
Section 2 presents the literature review on SER techniques.
Section 3 describes the datasets employed to perform experiments, features extraction, transformer model adopted for SER and evaluation measures.
Section 4 presents the experimental results and the significance of the observed findings. The conclusion and future research directions are provided in
Section 5.
2. Literature Review
Numerous works has been done for emotion detection from the speech signals using both machine and deep learning techniques. However, to a great extent, performance and efficacy of the proposed model depends on dataset used for model training. This section is divided into two sections; first analyses few pre-constructed speech datasets for SER along with their shortcomings and second reviews some latest research contributions for the development of effective SER classification models using DL techniques.
An Urdu database for speech emotion recognition have been introduced for the training of ML models. The data collection comprises of recorded audios for seven emotions with 734 Urdu and 701 Sindhi utterances [
24]. Moreover, the authors employed baseline classification using OpenSmile feature extraction toolkit in terms of UAR. The results show that the regression models applied on ComParE performs better classification, achieving 56.03% accuracy for Urdu and Sindhi SER. The collection of Urdu-Sindhi Corpus encompasses only 10 scripted sentences for every type of emotion. The major limitation of such recordings is that, for instance, a specific scripted sentence utterance may represent more than one emotion. Moreover, in such cases speaker is not representing real world situation. In another acted Urdu SER database, ref. [
25] collected utterances of 25 words from 50 speakers (25 male and 25 female) based on three emotions. Nonetheless, the employment of such small-scale Urdu databases using DL models would be inappropriate and may not be useful for the development of real time SER applications. Ref. [
26] formulated first custom URDU dataset for Urdu language. In URDU dataset 400 audio records were collected from different Urdu television talk shows representing 4 basic emotions. Though, the collected data contained unscripted spontaneous conversations of anchors with their guests, representation of actual emotions in TV talk shows is quite unusual because the most dominant emotion in such conversations is neutral or aggressive. Since the training of deep learning models require huge amount of data, the small URDU dataset may lead to over fitting or under fitting problems.
In a study, ref. [
27] used an endways multi-learning trick (MLT) based on 1D enhanced CNN model for automatic extraction of local and global emotional features from acoustic signals. For the enhancement of recognition rate, the proposed solution extracted the discriminative features using dynamic fusion framework. The proposed multi-learning model evaluated both short as well as long-term relative dependencies over two benchmark SER databases, IEMOCAP and EMO-DB, with 73% and 90% accuracy rates, respectively. However, the method relatively takes more time to train and test the real time speech signals as compared to other models. Exploring two datasets, RECOLA (to employ regression) and IEMOCAP (for classification task), ref. [
28] detected the emotions in speech using a novel end-to-end DNN algorithm. The authors found that SER performance in simulation results was optimum when proposed method applied with RMS aggregation and context stacking. The proposed DiCCOSER-CS model improved the arousal CCC by 9.5% and the valence CCC by 12.7% in regression task as compared with CNN-LSTM.
Recurrent Neural Networks models have also been implemented with slight variations for feature learning in SER systems. A similar study [
29] illustrates deep RNN, learning frame-level categorization as well as temporal collection into long time intervals. Also, the authors presented weighted time-pooling scheme to classify emotionally prominent fragments from the speech signals using simple attention mechanism. The model achieved 5.7% weighted and 3.1% unweighted classification accuracy using IEMOCAP dataset, when compared with conventional SVM-SER model. Although, the model introduced novel technique for emotion recognition, the experiments were conducted using only one database.
For the imbalanced SER [
30] introduced Attention-Integrated Convolutional RNN. The experimental results on Emo-DB and IEMOCAP demonstrate better performance for imbalanced speech emotions. In another related work, a self-attention mechanism [
31] is added to the Bidirectional Long Short-Term Memory (BLSTM) based classification. The proposed solution not only accomplished correlation of speech signals in the statements but also improved the diverse information using directional self-attention (DSA). The algorithm performed well on IEMOCAP (comprising complete, script, and spontaneous) and EMO databases.
There are few works on the domain of Arabic spoken speech emotion recognition. Therefore, this study focused on the recognition of Arabic speech emotion. In 2018, according to [
32], the first study on emotion recognition in spoken Arabic data is proposed. The speech corpus is collected from Arabic TV shows. Videos were labeled into three categories happy, angry or surprised. The study used 35 classification methods. The best classification performance is provided by the Sequential Minimal Optimization (SMO) classifier, which achieves 95.5%.
Another study on Arabic speech emotion was introduced in [
33]. This study provides a dataset called “Semi-natural Egyptian Arabic speech emotion (EYASE)”. This study was focused on Egyptian sentences. This study adopts SVM and KNN classifiers. The highest accuracy achieved was 95% using SVM. The highest accuracy was obtained in Male emotion recognition. Happiness was the most difficult to detect, while anger was the easiest.
A study by [
34] that built dataset based on real-world Arabic speech dialogs for detecting anger in Arabic conversation. The result revealed that acoustic sound features such as fundamental frequency, energy, and formants are more suitable for detecting anger. The experimental findings demonstrated that support vector machine classifiers can identify anger in real time at detection rate of more than 77%.
3. Proposed Methodology
This section discusses the detailed research methodology adopted for emotion recognition from audio files. In the proposed research methodology, the augmentation methods (white noise, time stretch) were utilized on each file to enhance the training data. Secondly, a variety of features including MFCCs, spectrograms, chroma, and tonnetz, and spectral contrast are extracted from each file and fed as input to train the transformer model for SER. Finally, the proposed method is evaluated using four datasets based on robustness, accuracy, and time. The proposed transformer model has been implemented using Keras library that helps to perform simulations easily. Moreover, for SER, Librosa library is used to perform experiments and used to evaluate the sound signals. This library packages helps to implement the data augmentation and feature extraction methods. All the experiments are performed on a laptop with 2.5 GHz Dual-Core Intel® Core i5 processer, 12 GB memory and 512 GB SSD hard drive. The detail description of each subsection is presented below.
3.1. Datasets
As the initial process of the proposed methodology, four publicly available datasets are collected. In this work, Basic Arabic Vocal-Emotions Dataset (BAVED) is selected to evaluate the proposed method. This dataset is the collection of Arabic audio/wav including seven words ([“like”] أعجبني, [“dislike”] لم يعجبني, [“this”] هذا, [“film”] الفيلم, [“fabulous”] رائع, [“good”] معقول, [“bad”] سيئ) to assess the emotions starting from 0 to 6 respectively. Furthermore, each these words was divided in three higher classes presenting the human emotions and these classes are 0, 1 and 2. The first corresponding level 0 defines low emotions that present tired or exhausted, level 1 defines neutral emotions and level 2 number correspondence represents higher level of emotion either it is positive or negative like happiness or sadness enjoying or feeling angry. The complete set of recordings in BAVED dataset have 1935 instances recorded from 61 individuals selected 45 men and 16 women voices in different occasions.
We also used the following datasets to evaluate the proposed model: EMO-DB, SAVEE, and EMOVO.
EMO-DB [
35] is employed as one of the datasets for our experiments as it is widely used by the researchers in the field of SER. This is a German dataset that comprises 535 audio files of distinct duration belonging to 5 male and 5 female’s speakers. These files are further classified into seven classes: disgusting, angry, boredom, happy, fear, neutral, and sad. As voice quality depends on the sample rate, encoded method, bit rate, and file format, all the files are consisting of wav format, mono-channel, 16 kHz sample rate, and 16-bits per sample.
Surrey Audio-Visual Expressed Emotion (SAVEE) is another famous multimodal dataset which contains seven types of emotions recorded from 4 male actors. The total utterances in this dataset are 480 created in British English. Ten evaluators examine the audio, video, and combined audio-visual conditions to assess the quality of recordings. Labels of used emotion utterances in our study are neutral, happiness, sadness, anger, surprise, fear, and disgust. The standard TIMIT corpus script was used to record statements.
EMOVO dataset contains seven emotional states that are disgust, fear, anger, joy, surprise, sadness, and neutral. EMOVO is the first dataset that contains Italian language utterances. This dataset has recordings of 6 actors speaking the content of 14 sentences. All 588 utterances in this dataset are further annotated and divided into two groups, including 24 annotators in each group. The Fondazione Ugo Bordoni laboratories were used to record all utterances.
3.2. Data Augmentation
Data augmentation is a technique that is commonly used to prevent overfitting while applying machine learning based techniques for classification of speech emotions. Some common techniques like addition of Gaussian noise, stretching time or shifting the pitch are used for data augmentation applied while training the data. Besides the overfitting, data augmentation is also used for different purposes like generalization of accuracy, robust performance, and perfect data distribution while having less variance in data. To enhance the performance of proposed transformer model, this study used above mentioned data augmentation techniques. Gaussian noise technique was used to increase the training data considering the amplitude of noise value denoted as “σ”. It is important to select the value of σ because higher or low value can affect the performance of model. If it is low, it decreases the performance while a higher value of σ increase the difficulty in model optimization. The pitch shift technique was used to generate new speech signals whose pitch wavelength was shifted by a series up to n steps. Pitch shifting does not affect the duration of signal. Time stretching approach was used in last that altered the tempo and pitch of the signal. These three data augmentation techniques increase the training data to improve the performance of proposed model and keep away from overfitting. The results of the three augmentation methods applied in our proposed model is shown in
Figure 1.
3.3. Feature Extraction
In the sake of speech emotion classification feature extraction can help to diminish computational errors, computational time, and model complexity. However, it is necessary to extract such acoustic feature who provide valid information about the emotion. For the proposed model, Mel-spectrogram, Chromagram, MFCCs, delta MFCCs, delta-delta MFCCs and Tonnetz representation and Spectral Contrast features were extracted from BAVED, EMO-DB, SAVEE and EMOVO datasets. The following list provide some details about extracted features.
Mel-spectrogram features: to extract these features audio signals broken down into frames and for each frame Fast Fourier transform is applied. The frequency spectrum is separated by the frequency of equal space to get the Mel-scale of signal frame.
Chromagram features: 12 distinct pitch classes used to extract Chromagram features using STFT and binning method. These features help to differentiate harmony and pitch classes.
MFFCs, delta MFCCs, delta–delta MFCCs based features: MFCCs features represent short-term power spectrum while Mel-frequency Cepstrum have equally spaced frequency band and provide insights for better classification by measuring the capacity of human ear to bear the audio signal. There are 40 sets of MFCCs, delta MFCCs, delta-delta MFCCs in the three Mel-cepstrum extracted features. All audio signals were divided into equal length frames to extract MFCC features. These frames revised to remove silence portion for both the start and end of the frame by applying windowing operations. Furthermore, the time domain signal converted to the frequency domain using Fast Fourier Transform (FFT). A mean scale filter bank applied to compute values of FFT computed frequencies. Equation of Mean scale filter bank is
After the computation of frequency values power logs are further computed for each Mel-frequency. At the end Discrete Cosine Transform (DCT) used to transform all log-Mel-spectrums into time domain and the extracted amplitudes known as MFCCs.
Tonnetz based features: the relation in between fall and rise of speech signal of harmonic network is described by pitch space of six-dimensions. To distinguish Q1 environmental sounds the tonal features of audio frames play an important role.
Spectral contrast-based features: the root mean square difference between spectral proof and the peak of signal frames computed to attain Spectral features.
The total number of features extracted for this study is 273 which includes 128 + 12 + 40 + 40 + 40 + 6 + 7 features of Mel-spectrogram, chromagram, MFFC, delta-MFCC, delta-delta MFCC, tonnetz, and spectral respectively. These extracted features are further used for the training of the proposed model.
3.4. Evaluation Metrics
Different evaluation metrics like Accuracy, precision, recall, and F1-score were applied to evaluate each ESC model. For each prediction, a confusion matrix was used to detect True-positive (TP), False-positive (FP), True-negative (TN), and False-negative (FN).
Accuracy: this metric is used to calculate the frequency of sound classes that can be accurately determined from the entire speech signal. The following equation is used to calculate the accuracy of results:
Recall: to check the number of positive instances that are accurately detected by the proposed model is checked by recall using the following equation.
Precision: to check the correctly detected actual utterances precision method is used using the following equation.
The evaluation matrices used to evaluate the proposed transformer model are widely used to measure the performance of models used for detection, classification, and prediction systems.
3.5. Transformer Model
Initially, the transformer model was erected to use for machine-level translation. However later it becomes popular in Natural Language Processing (NLP) and replaces recurrent neural networks (RNN) with its efficient performance [
36,
37]. The transformer model uses a different internal attention mechanism also known as the self-attention mechanism eliminate the recurrence of any type of processing. It uses a linear transformation to generate significant features for a given statement or utterance.
Exploring its structure, it consists of a block having sub-blocks of encoders and decoders. Both sub-blocks perform different activities like encoder transforming audio signals into an order of transitional illustrations. These illustrations are further passed to the decoder block to produce the desired output. Encoder encodes feature vectors into X = (x1, …, xT) and decoder decodes encoded feature illustrations into W = wm = (w1, …, wM). Because of the autoregressive nature of the transformer model, each stage takes input that is the output of the previous stage and each block has multiple self-attention layers that are connected with each other. Both encoder and decoder blocks worked individually.
The traditional speech recognition models used attention mechanisms that include simple blocks of encoder and decoder. First, the sound signals given to the encoder encode into alternative signals that are further decoded by the decoder that predicts the input features and converts the labels into the sequence provided by the encoder. To predict the output and recognition of speech the attention mechanism is used and highlights the significant parts of speech. As compared to all these traditional models the transformer model has multiple encoders and decoders and each block has an internal self-attention mechanism, and these attention layers are interconnected.
According to the studies, an encoder of a transformer model contains six layers of coders placed in a sequence of top to bottom. All coders have the same structures, and several coders are not fixed and can vary in different models for the encoder block. The structure of coders is similar to a block, but weights can be different. Usually, the encoder of the transformer model takes extracted features as input that were extracted by frequency coefficients or convolutional neural networks. Each encoder in a sequence performs a similar task and transforms extracted features into an alternative signal using self-attention and feed-forward transformed vectors to the next encoder using a simple Artificial neural network. The transformed vectors are transmitted to the decoder block by the last encoder.
The decoder block has the same numbers of decoders as in the encoder block arranged in a sequence. Each decoder in the block has a similar two-layered structure to encoders where the first layer takes input and transforms features into vectors and the second layer feeds these encoded vectors to the subsidiary encoder. However, the difference in the decoder layer is that there is an additional attention layer that helps the decoder to focus on a significant part of given frames of encoded feature vectors. Here the self-attention layer considers the prior data to decode the feature vectors and predict the incoming sentences. Then the second layer outputs the posterior probabilities of decoded words or characters and the same procedure is applied for each word of the statement.
The self-attention mechanism makes the transformer model successful in speech recognition however this mechanism can be further extended to improve the performance of the model. It can be expanded to a multi-head self-attention mechanism. It can be extended in a way that it split the input sequence into different chunks and these chunks can be augmented in multiple dimensions and projections. The other way to extend the mechanism is that each chunk of input features passed through an individual and unbiased attention mechanism. The third way to extend the mechanism is that before projecting the output encoding each chunk is concatenated while encoding. The third way of extending the self-attention mechanism can be described as follows
where e
i = Attention (QW
Qi, KW
Ki, V W
Vi)
However, in the above equation of MHA H describes the total number of heads, the input sequence of dimensions is din while dk= din/H, and the encoding generated by head i is ei, Wo ∈ Rdin×din, W×WQi ∈ Rdin×dk, WKi ∈ Rdin×dk and WVi ∈ Rdin×dv.
Transformer models provide higher dimensional representations because of the multi-head self-attention mechanism from multiple subspaces. The structure of the proposed transformer model is presented in
Table 1.