**2. Related Studies**

Speech emotion recognition research has been exhilarating for a long time and several papers have presented different ways of developing systems for recognizing human emotions. The authors in [3] presented a majority voting technique (MVT) for detecting speech emotion using fast correlation based feature (FCBF) and Fisher score algorithms for feature selection. They extracted 16 low-level features and tested their methods over several machine learning algorithms, including artificial neural network (ANN), classification and regression tree (CART), support vector machine (SVM) and K-nearest neighbor (KNN) on berlin emotion speech database (Emo-DB). Kerkeni et al. [11] proposed a speech emotion recognition method that extracted MFCC with modulation spectral (MS) features and used recurrent neural network (RNN) learning algorithm to classify seven emotions from Emo-DB and Spanish database. The authors in [15] proposed a speech emotion recognition model where glottis was used for compensation of glottal features. They extracted features of glottal compensation to zero crossings with maximal teager (GCZCMT) energy operator using Taiyuan University of technology (TYUT) speech database and Emo-DB.

Evaluation of feature selection algorithms on a combination of linear predictive coefficient (LPC), MFCC and prosodic features with three different multiclass learning algorithms to detect speech emotion was discussed in [19]. Luengo et al. [20] used 324 spectral and 54 prosody features combined with five voice quality features to test their proposed speech emotion recognition method on the Surrey audio-visual expressed emotion (SAVEE) database after applying the minimal redundancy maximal relevance (mRMR) to reduce less discriminating features. In [28], a method for recognizing emotions in an audio conversation based on speech and text was proposed and tested on the SemEval-2007 database using SVM. Liu et al. [29] used the extreme learning machine (ELM) method for feature selection that was applied to 938 features based on a combination of spectral and prosodic features from Emo-DB for speech emotion recognition. The authors in [30] extracted 988 spectral and prosodic features from three different databases using the OpenSmile toolkit with SVM for speech emotion recognition. Stuhlsatz et al. [31] introduced the generalized discriminant analysis (GerDA) based on DNN to recognize emotions from speech using the OpenEar specialized software to extract 6552 acoustic features based on 39 functional of 56 acoustic low-level descriptor (LLD) from Emo-DB and speech under simulated and actual stress (SUSAS) database. Zhang et al. [32] applied the cooperative learning method to recognize speech emotion from FAU Aibo and SUSAS databases using GCZCMT based acoustic features.

In [33], Hu moments based weighted spectral features (HuWSF) were extracted from Emo-DB, SAVEE and Chinese academy of sciences - institute of automation (CASIA) databases to classify emotions from speech using SVM. The authors used HuWSF and multicluster feature selection (MCFS) algorithm to reduce feature dimensionality. In [34], extended Geneva minimalistic acoustic parameter set (eGeMAPS) features were used to classify speech emotion, gender and age from the Ryerson audio-visual database of emotional speech and song (RAVDESS) using the multiple layer perceptron (MLP) neural network. Pérez-Espinosa et al. [35] analyzed 6920 acoustic features from the interactive emotion dyadic motion capture (IEMOCAP) database to discover that features based on groups of MFCC, LPC and cochleagrams are important for estimating valence, activation and dominance emotions in speech respectively. The authors in [36] proposed a DNN architecture for extracting informative feature representatives from heterogeneous acoustic feature groups that may contain redundant and unrelated information. The architecture was tested by training the fusion network to jointly learn highly discriminating acoustic feature representations from the IEMOCAP database for speech emotion recognition using SVM to obtain an overall accuracy of 64.0%. In [37], speech features based on MFCC and facial on maximally stable extremal region (MSER) were combined to recognize human emotions through a systematic study on Indian face database (IFD) and Emo-DB.

Narendra and Alku [38] proposed a new dysarthric speech classification method from a coded telephone speech using glottal features with a DNN based glottal inverse filtering method. They considered two sets of glottal features based on time and frequency domain parameters plus parameters based on principal component analysis (PCA). Their results showed that a combination of glottal and acoustic features resulted in an improved classification after applying PCA. In [39], four pitch and spectral energy features were combined with two prosodic features to distinguish two high activation states of angry and happy plus low activation states of sadness and boredom for speech emotion recognition using SVM with Emo-DB. Alshamsi et al. [40] proposed a smart phone method for automated facial expression and speech emotion recognition using SVM with MFCC features extracted from SAVEE database.

Li and Akagi [41] presented a method for recognizing emotions expressed in a multilingual speech using the Fujitsu, Emo-DB, CASIA and SAVEE databases. The highest weighted average precision was obtained after performing speaker normalization and feature selection. The authors in [42] used MFCC related features for recognizing speech emotion based on an improved brain emotional learning (BEL) model inspired by the emotional processing mechanism of the limbic system in human brain. They tested their method on CASIA, SAVEE and FAU Aibo databases using linear discriminant analysis (LDA) and PCA for dimensionality reduction to obtain the highest average accuracy of 90.28% on CASIA database. Mao et al. [43] proposed the emotion-discriminative and domain-invariant feature learning method (EDFLM) for recognizing speech emotion. In their method, both domain divergence and emotion discrimination were considered to learn emotion-discriminative and domain-invariant features using emotion label and domain label constraints. Wang et al. [44] extracted MFCC, Fourier parameters, fundamental frequency, energy and ZCR from three di fferent databases of Emo-DB, Chinese elderly emotion database (EESDB) and CASIA for recognizing speech emotion. Muthusamy et al. [45] extracted a total of 120 wavelet packet energy and entropy features from speech signals and glottal waveforms from Emo-DB, SAVEE and Sahand emotional speech (SES) databases for speech emotion recognition. The extracted features were enhanced using the Gaussian mixture model (GMM) with ELM as the learning algorithm.

Zhu et al. [46] used a combination of acoustic features based on MFCC, pitch, formant, short-term ZCR and short-term energy to recognize speech emotion. They extracted the most discriminating features and performed classification using the deep belief network (DBN) with SVM. Their results showed an average accuracy of 95.8% on the CASIA database across six emotions, which is an improvement when compared to other related studies. Álvarez et al. [47] proposed a classifier subset selection (CSS) for the stacked generalization to recognize speech emotion. They used the estimation of distribution algorithm (EDA) to select optimal features from a collection of features that included eGeMAPS and SVM for classification that achieved an average accuracy of 82.45% on Emo-Db. Bhavan et al. [48] used a combination of MFCCs, spectral centroids and MFCC derivatives of spectral features with a bagged ensemble algorithm based on Gaussian kernel SVM for recognizing speech emotion that achieved an accuracy of 92.45% on Emo-DB. Shegokar et al. [49] proposed the use of continuous wavelet transform (CWT) with prosodic features to recognize speech emotion. They used PCA for feature transformation with quadratic kernel SVM as a classification algorithm that achieved an average accuracy of 60.1% on the RAVDESS database. Kerkeni et al. [50] proposed a model for recognizing speech emotion using empirical mode decomposition (EMD) based on optimal features that included the reconstructed signal based on Mel frequency cepstral coe fficient (SMFCC), energy cepstral coe fficient (ECC), modulation frequency feature (MFF), modulation spectral (MS) and frequency weighted energy cepstral coe fficient (FECC). They achieved an average accuracy of 91.16% on the Spanish database using the RNN algorithm for classification.

Emotion recognition is still a grea<sup>t</sup> challenge because of several reasons as previously alluded in the introductory message. Further reasons include the existence of a gap between acoustic features and human emotions [31,36] and the non-existence of a solid theoretical foundation relating the characteristics of voice to the emotions of a speaker [20]. These intrinsic challenges have led to the disagreement in the literature on which features are best for speech emotion recognition [20,36]. The trend in the literature shows that a combination of heterogeneous acoustic features is promising for speech emotion recognition [20,29,30,36,39], but how to e ffectively unify the di fferent features is highly challenging [21,36]. The importance of selecting relevant features to improve the reliability of speech emotion recognition systems is strongly emphasized in the literature [3,20,29]. Researchers frequently apply specialized software such as OpenEar [31,32,35], OpenSmile [30,32,34,36] and Praat [35] for extraction, selection and unification of speech features from heterogeneous sources to ease the intricacy inherent in the processes. Moreover, the review of numerous literature has revealed that different learning algorithms are trained and validated with the specific features extracted from public databases for speech emotion recognition. Table 1 shows the result of the comparative analysis of our method with the related methods in terms of the experimental database, type of recording, number of emotions, feature set, classification method and maximum percentage accuracy results obtained. It is conspicuous from the comparative analysis that our emotion recognition method is highly promising because it has achieved the highest average accuracy of 99.55% across two dissimilar public experimental databases.


### **3. Materials and Methods**

Materials used for this study included speech emotion multimodal databases of RAVDESS [51] and SAVEE [52]. The databases, because of their popularity were chosen to test for the effectiveness of the proposed HAF against two other sets of features. The study method followed three phases of feature extraction, feature selection and classification, which were subsequently described. The method was heralded by the feature extraction process that involves the abstraction of prosodic and spectral features from the raw audio files of RAVDESS and SAVEE databases. This phase was subsequently followed by the feature selection process that involved the filtration, collection and agglutination of features that have high discriminating power of recognizing emotions in human speech. Classification was the last phase that involved the application of the selected learning algorithm to recognize human emotions and a comparative analysis of experimental results of classification. In the classification phase, several experiments were carried out on a computer with an i7 2.3GHz processor and 8 GB of random access memory (RAM). The purpose of the experiments was to apply standard metrics of accuracy, precision, recall and F1-score to evaluate the effectiveness of the proposed HAF with respect to a given database and ensemble learning algorithm.
