**1. Introduction**

Emotion plays an important role in the daily interpersonal interactions and is considered an essential skill for human communication [1]. It helps humans to understand the opinions of others by conveying feelings and giving feedback to people. Emotion has many useful benefits of affective computing and cognitive activities such as rational decision making, perception and learning [2]. It has opened up an exhilarating research agenda because constructing an intelligent robotic dialogue system that can recognize emotions and precisely respond in the manner of human conversation is presently arduous. The requirement of emotion recognition is steadily increasing with the pervasiveness of intelligent systems [3]. Huawei intelligent video surveillance systems, for instance, can support real-time tracking of a person in a distressed phase through emotion recognition. The capability to recognize human emotions is considered an essential future requirement of intelligent systems that are inherently supposed to interact with people to a certain degree of emotional intelligence [4]. The necessity to develop emotionally intelligent systems is exceptionally important for the modern society of the internet of things (IoT) because such systems have grea<sup>t</sup> impact on decision making, social communication and smart connectivity [5].

Practical applications of emotion recognition systems can be found in many domains such as audio/video surveillance [6], web-based learning, commercial applications [7], clinical studies, entertainment [8], banking [9], call centers [10], computer games [11] and psychiatric diagnosis [12]. In addition, other real applications include remote tracking of persons in a distressed phase, communication between human and robots, mining sentiments of sport fans and customer care services [13], where emotion is perpetually expressed. These numerous applications have led to the

development of emotion recognition systems that use facial images, video files or speech signals [14]. In particular, speech signals carry emotional messages during their production [15] and have led to the development of intelligent systems habitually called speech emotion recognition systems [16]. There is an avalanche of intrinsic socioeconomic advantages that make speech signals a good source for affective computing. They are economically easier to acquire than other biological signals like electroencephalogram, electrooculography and electrocardiograms [17], which makes speech emotion recognition research attractive [18]. Machine learning algorithms extract a set of speech features with a variety of transformations to appositely classify emotions into different classes. However, the set of features that one chooses to train the selected learning algorithm is one of the most important tools for developing effective speech emotion recognition systems [3,19]. Research has suggested that features extracted from the speech signal have a grea<sup>t</sup> effect on the reliability of speech emotion recognition systems [3,20], but selecting an optimal set of features is challenging [21].

Speech emotion recognition is a difficult task because of several reasons such as an ambiguous definition of emotion [22] and the blurring of separation between different emotions [23]. Researchers are investigating different heterogeneous sources of features to improve the performance of speech emotion recognition systems. In [24], performances of Mel-frequency cepstral coefficient (MFCC), linear predictive cepstral coefficient (LPCC) and perceptual linear prediction (PLP) features were examined for recognizing speech emotion that achieved a maximum accuracy of 91.75% on an acted corpus with PLP features. This accuracy is relatively low when compared to the recognition accuracy of 95.20% obtained for a fusion of audio features based on a combination of MFCC and pitch for recognizing speech emotion [25]. Other researchers have tried to agglutinate different acoustic features with the optimism of boosting accuracy and precision rates of speech emotion recognition [25,26]. This technique has shown some improvement, nevertheless it has yielded low accuracy rates for the fear emotion in comparison with other emotions [25,26]. Semwal et al. [26] fused some acoustic features such as MFCC, energy, zero crossing rate (ZCR) and fundamental frequency that gave an accuracy of 77.00% for the fear emotion. Similarly, Sun et al. [27] used a deep neural network (DNN) to extract bottleneck features that achieved an accuracy of 62.50% for recognizing the fear emotion. The overarching objective of this study was to construct a set of hybrid acoustic features (HAFs) to improve the recognition of the fear emotion and other emotions from speech signal with high precision. This study has contributed to the understanding of the topical theme of speech emotion recognition in the following unique ways.


The content of this paper is succinctly organized as follows. Section 1 provides the introductory message, including the objective and contributions of the study. Section 2 discusses the related studies in chronological order. Section 3 designates the experimental databases and the study methods. The details of the results and concluding statements are given in Sections 4 and 5 respectively.
