*3.2. Feature Extraction*

The speech signal carries a large number of useful information that reflects emotion characteristics such as gender, age, stuttering and identity of the speaker. Feature extraction is an important mechanism in audio processing to capture the information in a speech signal and most of the related studies have emphasized on the extraction of low-level acoustic features for speech recognition. Feature representation plays a prominent role in distinguishing the speech of speakers from each other. Since no universal consensus on the best features for speech recognition, certain authors have mentioned that prosody carries most of the useful information about emotions and can create a suitable feature representation [20]. Moreover, spectral features describe the properties of a speech signal in the frequency domain to identify important aspects of speech and manifest the correlation between vocal movements and changes of channel shape [35,42]. Prosodic and spectral acoustic features were suggested as the most important characteristics of speech because of the improved recognition results of their integration [25,42,53,54]. The blend of prosodic and spectral features is capable of enhancing the performance of emotion recognition, giving better recognition accuracy and a lesser number of emotions in a gender dependent model than many existing systems with the same number of emotions [25]. Banse et al. [54] examined vocal cues for 14 emotions using a combination of prosodic speech features and spectral information in a voiced segmen<sup>t</sup> and unvoiced segmen<sup>t</sup> to achieve impressive results.

The present authors have applied the selective approach [35] to extract prosodic and spectral features, taking cognizance of features that could be useful for improving emotion recognition. These are features that have produced positive results in the related studies and those from the other comparable tasks. We carefully applied the specialized jAudio software [55] to construct MFCC1, MFCC2 and HAF as three sets of features representing several voice aspects. MFCC1 is the set of MFCC features inspired by the applications of MFCC features [24,40,42]. MFCC2 is the set of features based on MFCC, energy, ZCR and fundamental frequency as inspired by the fusion of MFCC with other acoustic features [11,25,26,44,46]. The HAF is the proposed set of hybrid acoustic features of prosodic and spectral carefully selected based on the interesting results in the literature [11,20,24–26,29,30,39,40,42,44,46,48].

The feature extraction process was aided by the application of a specialized jAudio to effectively manage the inherent complexity in the selection process. The jAudio software is a digital signal processing library of advanced audio feature extraction algorithms designed to eliminate effort duplication in manually calculating emotion features from an audio signal. It provides a unique method of handling multidimensional features, dependency and allows for iterative development of new features from the existing ones through the meta-feature template. Moreover, it simplifies the intrinsic stiffness of the existing feature extraction methods by allowing low-level features to be fused to build increasingly high-level musically meaningful features. In addition, it provides audio samples as simple vectors of features and offers aggregation functions that aggregate a sequence of separate feature sets into a single feature vector.
