*3.4. Classification*

Emotion classification is aimed at obtaining an emotional state for the input feature vector using a machine learning algorithm. In this study, classification experiments were conducted using ensemble learning algorithms to verify the e ffectiveness of the proposed HAF. Ensemble learning is considered a state-of-the-art approach for solving di fferent machine learning problems by combining the predictions of multiple base learners [75]. It improves the overall predictive performance, decreases the risk of obtaining a local minimum and provides a better fit to the data space by combining the predictions of several weak learners into a strong learning algorithm. In this study, we applied bagging and boosting machine learning because they are widely used e ffective approaches for constructing ensemble learning algorithms [76–79]. Bagging is a technique that utilizes bootstrap sampling to reduce the variance of a decision tree and improve the accuracy of a learning algorithm by creating a collection of learning algorithms that are learned in parallel [79–81]. It performs random sampling with replacement over a simple averaging of all the predictions from di fferent decision trees to give a more robust result than a single decision tree. Boosting creates a collection of learning algorithms that are learned sequentially with early learners to fit decision trees to the data and analyze data for errors. It performs random sampling with replacement over a weighted average and reduces classification bias and variance of a decision tree [79]. One of the bagging ensemble algorithms investigated in this study was random decision forest (RDF), which is an extension over bagging that is popularly called random forest [75,82]. RDF can be constituted by making use of bagging based on the CART approach to raise trees [83]. The other bagging ensemble learning algorithms investigated in this study were Bagging with SVM as the base learner (BSVM) and bagging with the multilayer perceptron neural network as the base learner (BMLP). The boosting ensemble algorithms investigated in this study were the gradient boosting machine (GBM), which extends boosting by combining the gradient descent optimization algorithm with boosting technique [75,82,84], and AdaBoost with CART as the base learner (ABC), which is one of the most widely used boosting algorithm to reduce sensitivity to class label noise [79,85]. AdaBoost is an iterative learning algorithm for constructing a strong classifier by enhancing weak classification algorithms and it can improve data classification ability by reducing both bias and variance through continuous learning [81].

The standard metrics of accuracy, precision, recall and F1-score have been engaged to measure the performances of the learning algorithms with respect to a particular set of features. The accuracy of a classification algorithm is often judged as one of the most intuitive performance measures. It is the ratio of the number of instances correctly recognized to the total number of instances. Precision is the ratio of the number of positive instances correctly recognized to the total number of positive instances. Recall is the ratio of the number of positive instances correctly recognized to the number of all the instances of the actual class [86]. F1-score constitutes a harmonic mean of precision and recall [87]. Moreover, the training times of the learning algorithms are also compared to analyze the computational complexities of training different ensemble learning classifiers. The flow chart for different configurations of the methods of experimentation is illustrated in Figure 3.

**Figure 3.** Flowchart for different configurations of emotion recognition methods.

### **4. Results and Discussion**

The present authors have extracted the set of 404 HAFs from RAVDESS and SAVEE experimental databases. The features were used to train five famous ensemble learning algorithms in order to determine their effectiveness using four standard performance metrics. The performance metrics were selected because the experimental databases were not equally distributed. Data were subjected to 10 fold cross-validation and divided into training and testing groups as the norm permits. Gaussian noise was added to the training data to provide a regularizing effect, reduce overfitting, increase resilience and improve generalization performance of the ensemble learning algorithms.

Table 3 shows the result of the CPU computational time of training an individual ensemble learning algorithm with three different sets of features across two experimental databases. The training time analysis shows that RDF spent the least amount of training time when compared to other ensemble learning algorithms. The least training time was recorded with respect to MFCC1 and MFCC2 irrespective of databases. However, the training time of RDF (0.43) was slightly higher than that of BSVM (0.39) with respect to SAVEE HAF, but it claimed superiority with respect to the

RAVDESS HAF. The reason for this inconsistent result might be because of the smaller data size (480) of SAVEE when compared to the size (1432) of RAVDESS. The experimental results of a study that was aimed at fully assessing the predictive performances of SVM and SVM ensembles over small and large scale datasets of breast cancer show that SVM ensemble based on BSVM method can be the better choice for a small scale dataset [78]. Moreover, it can be observed that GBM and BMLP were generally time consuming when compared to the other ensemble learning algorithms because they took 1172.3 milliseconds to fit MFCC2 RAVDESS and 1319.39 milliseconds to fit MFCC2 SAVEE features respectively.


**Table 3.** Processing time of learning algorithms trained with Mel-frequency cepstral coe fficient 1 (MFCC1), Mel-frequency cepstral coe fficient 2 (MFCC2) and a hybrid acoustic feature (HAF).

The experimental results of the overall recognition performance obtained are given in Table 4. The percentage average recognition performance rates di ffer depending on critical factors such as the type of database, the accent of the speakers, the way data were collected, the type of features and the learning algorithm used. In particular, RDF consistently recorded the highest average accuracy, while BMLP consistently recorded the lowest average accuracy across the sets of features, irrespective of databases. These results indicate that RDF gave the highest performance because the computed values of the precision, recall, F1-score and accuracy of the ensemble learning algorithm are generally impressive when compared to other ensemble learning algorithms. However, average recognition accuracies of 61.5% and 66.7% obtained by RDF for the SAVEE MFCC1 and MFCC2 were respectively low, but the values were better than that of BMLP (55.4% and 60.7%) for SAVEE MFCC1 and MFCC2 respectively. In general, an improvement can be observed when a combination of features (MFCC2) was used in an experiment drawing inspiration from the work done in [46], where MFCC features were combined with pitch and ZCR features. In addition, we borrowed the same line of thought from Sarker and Alam [3] and Bhaskar et al. [28] where some of the above mentioned features were combined to form hybrid features.


**Table 4.** Percentage average precision, recall, F1-score and accuracy with confidence intervals of learning algorithms trained with MFCC1, MFCC2 and HAF.

The results in Table 4 can be used to demonstrate the effectiveness of the proposed HAF. The values of the precision, recall, F1-score and accuracy performance measures can be seen to be generally high for the five ensemble learning algorithms using the HAF irrespective of databases. Specifically, RDF and GBM concomitantly recorded the highest accuracy of 99.3% for the SAVEE HAF, but RDF claims superiority with 99.8% accuracy when compared to the 92.6% accuracy for the HAF RAVDESS. Consequently, the results of RDF learning of the proposed HAF were exceptionally inspiring with an average accuracy of 99.55% across the two databases. The ranking of the investigated ensemble learning algorithms in terms of percentage average accuracy computed for HAF across the databases is RDF (99.55%), GBM (95.95%), ABC (95.00%), BSVM (93.90%) and BMLP (92.95%). It can be inferred that application of ensemble learning of HAF is highly promising for recognizing emotions in human speech. In general, high accuracy values computed by all learning algorithms across databases indicate the effectiveness of the proposed HAF for speech emotion recognition. Moreover, the confidence factors show that the true classification accuracies of all the ensemble learning algorithms investigated lie in the range between 0.002% and 0.045% across all the corpora. The results in Table 4 show that HAF significantly improved the performance of emotion recognition when compared to the methods of the SAVEE MFCC related features [42] and RAVDESS wavelet transform features [49] that recorded the highest recognition accuracies of 76.40% and 60.10% respectively.

In particular, the results of precision, recall, F1-score and accuracy per each emotion, learning algorithm and database were subsequently discussed. Table 5 shows that angry, calm and disgust emotions were easier to recognize using RDF and GBM ensemble learning of RAVDESS MFCC1 features. The results were particularly impressive with RDF ensemble learning with perfect precision, recall and F1-score computed by the algorithm, which also recorded highest accuracy values across all emotions. However, BMLP generally gave the least performing result across all emotions. RAVDESS MFCC1 features achieved a precision rate of 92.0% for the fear emotion using RDF ensemble learning, which is relatively high when compared to the results in [27].


**Table 5.** Percentage precision, recall and F1-score and accuracy on RAVDESS MFCC1.

Table 6 shows the results of percentage precision, recall, F1-score and accuracy performance analysis of RAVDESS MFCC2 features. The performance values were generally higher for recognizing angry, calm and disgust emotions using RDF. However, the performance values were relatively low for happy and neutral emotions showing that the hybridization of MFCC, ZCR, energy and fundamental frequency features yielded a relatively poor result in recognizing happy and neutral emotions using ensemble learning. The MFCC2 combination of acoustic features achieved an improved recognition performance of the fear (90.0%) emotion using RDF in comparison with the report in [44], where 81.0% recognition was achieved for the fear emotion using the CASIA database.


**Table 6.** Percentage precision, recall and F1-score and accuracy on RAVDESS MFCC2.

Emotion recognition results shown in Table 7 with respect to the RAVDESS HAF were generally impressive across emotions, irrespective of the ensemble learning algorithms utilized. RDF was highly effective in recognizing all the eight emotions because it had achieved perfect accuracy value by 100.0% on almost all the emotions, except fear (98.0%). Moreover, perfect precision and accuracy values were achieved in recognizing the neutral emotion, which was really di fficult to identify using other sets of features. In addition, 98.0% accuracy and perfect values obtained for precision, recall and F1-score for the fear emotion is impressive because of the di fficulty of recognizing the fear emotion as reported in the literature [25–27]. All the other learning algorithms performed relatively well, considering the fact that noise was added to the training data. These results attest to the e ffectiveness of the proposed HAF for recognizing emotions in human speech. In addition, the results point to the significance of sourcing highly discriminating features for recognizing human emotions. They further show that the fear emotion could be recognized e fficiently when compared to other techniques used in [44]. It can be seen that the recognition of fear emotion was successfully improved using the proposed HAF. In addition, HAF gave better performance than the bottleneck features used for six emotions from CASIA database [27] that achieved a recognition rate of 60.5% for the fear emotion.


**Table 7.** Percentage precision, recall and F1-score and accuracy on RAVDESS HAF.

Table 8 shows the results of percentage precision, recall, F1-score and accuracy analysis of SAVEE MFCC1 features. In the results, the maximum percentage precision F1-score of the neutral emotion had achieved 94.0% using the RDF ensemble algorithm. This F1-score was quite high when compared to the F1-score obtained from RAVDESS database using the same features. The percentage increase in overall precision of neutral emotion was observed to be 133.33%, 77.36%, 18.57%, 12.66% and 10.39% using GBM, RDF, BMLP, ABC and BSVM respectively. Moreover, RDF and GBM algorithms achieved perfect prediction (100.0%) for recognizing the surprise emotion. The results show that MFCC features still achieved low recognition rates for the fear emotion as reported by other authors [25]. However, RDF (94.0%) and GBM (91.0%) show higher precision rates for the neutral emotion, which is higher than what was reported in [11].


**Table 8.** Percentage precision, recall and F1-score and accuracy on SAVEE MFCC1.

Table 9 shows the analysis results of the percentage precision, recall, F1-score and accuracy obtained after performing the classification with SAVEE MFCC2 features. It can be observed from the table that all the learning algorithms performed well for recognizing the surprise emotion, but performed poorly in recognizing other emotions. In general, all learning algorithms achieved a low prediction of emotions with the exception of surprise emotion in which the perfect F1-score was obtained using the RDF algorithm. The lowest precision value of 93.0% was achieved for the surprise emotion using MFCC2 features with BMLP. These results indicate the unsuitability of MFCC, ZCR, energy and fundamental frequency features for recognizing human emotions. However, RDF and GBM gave higher recognition results for happy and sad emotions, RDF gave a higher recognition result for fear emotion while ABC and BSVM gave higher recognition results for the sad emotion that the traditional and bottleneck features [27].


**Table 9.** Percentage precision, recall and F1-score and accuracy on SAVEE MFCC2.

Table 10 shows the analysis results of the percentage precision, recall, F1-score and accuracy obtained after performing classification with SAVEE HAF. Accordingly, all the emotions were much easier to recognize using the HAF that had tremendously improved the recognition of the fear emotion reported in the literature to be di fficult with lower performance results recorded with traditional and bottleneck features for fear emotion [27]. Moreover, Kerkeni et al. [11] obtained a recognition accuracy rate of 76.16% for the fear emotion using MFCC and MS features based on the Spanish database of seven emotions. These results show that using HAF with ensemble learning is highly promising for recognizing emotions in human speech. The study findings were consistent with the literature that ensemble learning gives a better predictive performance through the fusion of information knowledge of predictions of multiple inducers [75,85]. The ability of ensemble learning algorithms to mimic the nature of human by seeking opinions from several inducers for informed decision distinguish them from inducers [73]. Moreover, it can be seen across Tables 4–10 that the set of HAF presented the most e ffective acoustic features for speech emotion recognition, while the set MFCC1 presented the worst features. In addition, the study results supported the hypothesis that a combination of acoustic features based on prosodic and spectral reflects the important characteristics of speech [42,53,54].


**Table 10.** Percentage precision, recall and F1-score and accuracy on SAVEE HAF.

Tables 4–10 show that HAF features had a higher discriminating power because higher performance results were achieved across all emotion classes in the SAVEE and RAVDESS corpora. The performance of each ensemble learning algorithm was generally low when MFCC1 and MFCC2 were applied because the sets of the features had a much lower discriminating power compared to HAF. The recognition results across the experimental databases were di fferent because RAVDESS had more instances than SAVEE. Moreover, SAVEE constitutes male speakers only while RAVDESS is comprised of both male and female speakers. According to the literature, this diversity has an impact on the performance of emotion recognition [35]. The literature review revealed that most emotion recognition models have not been able to recognize the fear emotion with a higher classification accuracy of 98%, but the results in Table 4 show that HAF significantly improved the recognition performance of emotion beyond expectation.
