Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Appl. Sci. 2021, 11(21), 9897; https://doi.org/10.3390/app11219897

by Huiyun Zhang^1,2, Heming Huang^1,2,* and Henry Han³

Reviewer 1: Anonymous

Reviewer 2:

Gianni Pantaleo

Appl. Sci. 2021, 11(21), 9897; https://doi.org/10.3390/app11219897

Submission received: 15 August 2021 / Revised: 3 October 2021 / Accepted: 11 October 2021 / Published: 22 October 2021

(This article belongs to the Special Issue Selected Papers from 16th National Conference on Man-Machine Speech Communication (NCMMSC2021))

Round 1

Reviewer 1 Report

The article is within the scope of the journal. The subject of the article is interesting, and the results constitute an advance in the area of knowledge.

It is well written and easy to read.

However, it is necessary to make some changes:
a) The related work is short. An extension of it should be done.
b) The Discussion and Conclusions section should be separated. The discussion section should be aimed at comparing the results presented with other similar works showing the progress and limitations. The conclusions section should be a synthesis of the scientific contribution. Likewise, future lines of work must be included.
c) The bibliography does not follow the style of the journal. For example, in journals, the year should appear in bold.

Author Response

Original Manuscript ID: applsci-1362292

Original Article Title: “A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition”

To: Applied Sciences Editorial Office

Dear Editor,

It is our pleasure to receive your helpful review. We sincerely thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments. We could learn much from the reviewers’ comments, which are reasonable, encouraging and constructive. After having carefully studied the comments, we have made the corresponding changes. Moreover, the revised manuscript and the detailed responses to the comments have attached to this letter.

We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with ‘*’ highlighting indicating changes, and (c) a clean updated manuscript without highlights.

Best regards,

Huiyun Zhang, Heming Huang*, and Henry Han

Re: Response to reviewers

It is our pleasure to receive your helpful comments and advice which offer great support for our work. The following are revisions we have made in response to the Editors’ suggestions on an item-by-item basis.

Reviewer#1, Concern # 1: The related work is short. An extension of it should be done.

Author response: The authors are grateful for your valuable suggestion!

Author action: To response reviewer’s comment, the authors have supplemented the contents of related work according to expert opinions and have reorganized Section II. Firstly, in the revised manuscript, we have added some detail explains about the contents of speech emotion feature extraction and acoustic modeling. All the modifications are marked up using the *Track Changes* (See related work for details).

That is, we revised “Generally, SER contains the undermentioned steps: corpus recording, signal pre-processing, emotion feature extraction, and classifier construction [7], etc. Among which emotion feature extraction is a principal step that extracts representative features for the downstream classification, and the classifier is the key part of a SER system that produces final SER results.

So far, a variety of Low-Level Descriptor (LLD) features have been used for SER [8], and MFCC is one of them [9]. There are other approaches (e.g., openSMILE) that extract the higher-level derivatives of the LLD features to seek deep feature extractions [9]. Besides, Chroma features can also represent emotion well. Compared to their peers, these features are more representative in capturing the affective information both from frequency and time domain for each frame in SER [7].

Traditionally, features are fed into acoustic models, and the recognition results are acquired through such machine learning based acoustic models as Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machine (SVM), and so on [10-12]. These models usually achieve good performance on small-scale data rather than large-scale data.

With the development of deep learning technology, a variety of Artificial Neural Network (ANN) [13] is introduced to construct speech recognition classifiers. Compared with the early methods, when handling large-scale data, ANNs have better performances for their powerful capabilities in feature extraction and learning. Some representative deep acoustic models are proposed, such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) [14-16].

The successful applications of the deep learning models have obtained exciting outcome in SER research, and it motivates us to develop more powerful models to rec-ognize speech emotion. The recognition capability of a single network is usually limited. Therefore, the combination of different neural networks is suggested in quite a few pre-vious works. Chen et al. proposed an ACRNN model that integrated CNN with LSTM, and 3D spectral features were used as the input of the acoustic model [17]. Trigeorgis et al. combined CNN with RNN, and they segmented the original audio data into equal-length speech fragments as the input of the classifier [18]. Sainath et al. proposed a CLDNN model consisted of a few convolution layers, LSTM layers, and fully connected layers in the respective order [19]. The CLDNN model, trained on the log-Mel filter bank energies [20] and on the raw waveform speech signals [21], outperformed both CNN and LSTM.

Inspired by the above research works, and to exploit the spatiotemporal information more effectively, a distinctive classifier, called Heterogeneous Parallel Convolution Bi-LSTM and abbreviated as HPCB hereafter, is proposed for SER. To exploit the spatio-temporal information more effectively, HPCB employs novel heterogeneous parallel learning structures. Furthermore, multi-features are used to dig and learn the complete emotional details in a more robust and effective way. HPCB demonstrates an advantage over the previous methods in literature on the benchmark databases EMODB, CASIS, and SAVEE [22-24].

The remainder of the study is arranged as follows. Section 2 describes the details of the proposed model HPCB. Section 3 presents the experimental and Section 4 concludes advantages of the proposed model and possible research directions.”

as “*As the fundamental research of affective computing, SER has become an active re-search area.* Generally, SER contains the undermentioned steps: corpus recording, signal preprocessing, emotion feature extraction, and classifier construction [9], etc. Among which emotion feature extraction is a principal step that extracts representative features for the downstream classification, and the classifier is the key part of a SER system that produces final SER results.

*Conventionally, emotion recognition systems are trained with supervised learning solutions. The generalization of the models is often emphasized by training on a variety of samples with diverse labels [10]. Generally, labels for emotion recognition tasks are collected with perceptual evaluations from multiple evaluators. The raters annotate samples by listening or watching to the stimulus. This evaluation procedure is cognitively intense and expensive [11]. Therefore, standard benchmark datasets for SER have limited number of sentences with emotional labels, often collected from a limited number of evaluators.*

*2.1. Emotion feature extraction

One important step towards intelligent HCI is SER, since human emotion contains important information for machine response generation. Two of the most challenging problems in feature extraction are the extraction of frame-based high-level feature rep-resentations and the construction of utterance-level features [12, 13]. Speech signals are considered to be approximately stationary in small frames. Some acoustic features ex-tracted from short frames, e.g., pitch, mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and prosodic features, are believed to be influenced by emotions and can provide detailed emotionally relevant local information. These frame-based features are often referred to as low-level features [14].* So far, a variety of Low-Level Descriptor (LLD) features have been used for SER. *For example, Schmitt et al. used a bag-of-audio-words (BoAW) approach that was created from MFCCs and energy Low Level Descriptors (LLDs), as feature vector and a simple Support Vector Regressions (SVR) to predict the emotion [15].*

*Based on the frame-based low-level features, neural networks are utilized to extract neural representations frame by frame, which are referred to as frame-based high-level feature representations [16]. Recently, SER has made great progress by introducing neural networks to extract high-level neural hidden representations. For example, [16, 17] propose to apply neural network on low-level features, e.g., pitch, energy, to learn high-level features, i.e., the neural network outputs. Trigeorgis et al. proposed an end-to-end model which comprised of a CNN architecture used to extract features before feeding a Bi-directional LSTM (BLSTM) to model the temporal dynamics in the data [18]. Neumann et al. propose an attentive convolutional neural network (ACNN) that combines CNNs with attention [19].

However, the emotion recognition at utterance-level requires a global feature rep-resentation, which contains both detailed local information and global characteristics re-lated to emotion. Based on the frame-based high-level features learned, various methods are used to construct the utterance-level features. [17] proposes to use extreme learning machine (ELM) upon utterance-level statistical features. The utterance-level features are statistics of segment-level deep neural networks (DNNs) output probabilities, where each segment is a stack of neighboring frames. [20] proposes to introduce recurrent networks to increase the ability of model to capture temporal information. However, only the final states of recurrent layers are used for classification, which may lead to loss of detailed information for emotion classification, since all information is stored in the fixed-size final states. [21] explores pooling utterance-level features from high-level features output by CNNs with attention weights, because not all regions in the spectrogram contain information useful for emotion recognition. By pooling, it also avoids to squeeze all the information in one fixed-size vector.*

Based on this, we propose a new multi-features fusion representation method for discrete affect recognition. Despite most of the studies in the literature, the method of our features fusion was inspired by the way conventional speech features like Mel-Frequency Cepstral Cofficients (MFCCs), are computed. That is, 32D Low-Level Descriptor (LLD) features, including 12D Chroma [22] and 20D MFCC [23], are extracted. The High-Level Statistical Functions (HSF), such as the mean of Chroma and the mean, variance, and maximum of MFCC, are calculated. Totally, 72 D acoustic features are used as the input of the model. Compared to their peers, these features are more representative in capturing the affective information both from frequency and time domain for each frame in SER [24]. *Finally, our model surpasses the state-of-the-art studies for the databases EMODB [25], CASIA [26], and SAVEE [27].*

*2.2. Speech emotion recognition model*

Traditionally, features are fed into acoustic models, and the recognition results are acquired through such machine learning-based acoustic models as Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machine (SVM), and so on [28-30]. These models usually achieve good performance on small-scale data rather than large-scale data.

In recent years, with the development of deep learning technology, a variety of Artificial Neural Network (ANN) [31] is introduced to construct SER classifiers. A number of studies in the literature have focused on predicting emotion from speech using DNNs. For example, Wllmer et al. [32] were one of the first to propose a DNN architecture for affective computing which comprised of a three layer LSTM and was trained on functionals of acoustic Low-Level Descriptors (LLDs). Stuhlsatz et al. [33] used Restricted Boltzmann Machinrd (RBM) to extract discriminative features from the raw signal and proposed a Generalized Discriminant Analysis (GerDA). Sainath et al. [34, 35] proposed a convolutional, Long short-term memory deep neural network (CLDNN) model for a speech recognition task, that is able to reduce temporal and frequency variations.

Compared with the early methods, when handling large-scale data, ANNs have better performances for their powerful capabilities in feature extraction and learning. Some representative deep acoustic models are proposed, such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) [36-38]. Among the mentioned state-of-the-art deep learning-based models, the CNN-based models show more powerful performance in representation learning. The structure of CNN usually consists of pooling and convolutional layers. In which, Max-pooling layer, one of the most common pooling layers, drops out the not ‘max-information’ inevitably, which is also useful in SER. The convolutional layer con-siders the information in receptive field and extracts features in this local region. However time information in the emotional speech is ignored. To overcome this disadvantage of CNNs, Bi-LSTM is introduced to process time series information.

The successful applications of the deep learning models have obtained exciting outcome in SER research, and it motivates us to develop more powerful models to rec-ognize speech emotion. The recognition capability of a single network is usually limited. Therefore, the combination of different neural networks is suggested in quite a few pre-vious works. Chen et al. proposed an ACRNN model that integrated CNN with LSTM, and 3D spectral features were used as the input of the acoustic model [39]. Trigeorgis et al. combined CNN with RNN, and they segmented the original audio data into equal-length speech fragments as the input of the classifier [18]. Sainath et al. proposed a CLDNN model consisted of a few convolution layers, LSTM layers, and fully connected layers in the respective order [40]. The CLDNN model, trained on the log-Mel filter bank energies and on the waveform speech signals, outperformed both CNN and LSTM [41].

*Inspired by the above research works, and to exploit the spatiotemporal information more effectively, a distinctive classifier, called Heterogeneous Parallel Convolution Bi-LSTM and abbreviated as HPCB hereafter, is proposed for SER. To exploit the spatiotemporal information more effectively, HPCB employs novel heterogeneous parallel learning structures. Furthermore, multi-features are used to dig and learn the complete emotional details in a more robust and effective way. Besides, HPCB demonstrates an advantage over the previous methods in literature on the benchmark databases EMODB, CASIS, and SAVEE [25-27].

The model with Heterogeneous Parallel achieves improvements over the baselines in most cases. With respect to this study, including the following contributions:

(1) 32D LLD features at the frame-level are extracted, and its HSF features at the sentence-level are computed. Totally, 72D acoustic features are used as the input of the model.

(2) The proposed HPCB architecture can be trained with features extracted at the sentence-level (high-level descriptors), or at the frame-level (low-level descriptors).

(3) We provide a comprehensive analysis of training Heterogeneous Parallel net-works for SER, showing its capability in within corpus evaluations, where we observe performance gains.

The rest of the paper is organized as follows. Section III presents the proposed Heterogeneous Parallel architecture. Section IV gives details on the experimental setup including the databases and features used in this study. Section V presents the exhaustive experimental evaluations, showing the benefits of the proposed architecture. Finally, Section VI provides the concluding remarks, discussing potential areas of improvements.*.

Reviewer#1, Concern # 2: The Discussion and Conclusions section should be separated. The discussion section should be aimed at comparing the results presented with other similar works showing the progress and limitations. The conclusions section should be a synthesis of the scientific contribution. Likewise, future lines of work must be included.

Author response: Thank you very much for your helpful comment.

Author action: We have written the discussion and conclusion separately according to the opinions of reviewer. That is, the content of the discussion section is newly added, as follows:

*In this section, we further analyzed the effectiveness and robustness of the proposed system over the databases CASIA, EMODB, and SAVEE. We used the common evaluation matrices, such as weighted, unweighted, and F1-score to estimate the class level and the overall accuracy. In order to measure the model prediction performance among the actual and the predicted labels, the confusion matrix of each dataset is showed. The confusion among the actual and the predicted labels of each class is shown in certain rows and columns in the confusion matrix. We conducted comprehensive experimentation for three the datasets to show the model prediction performance in terms of precision, recall, F1-score, weighted, and unweighted results. We choose an optimal model combination for an efficient SER system.

The experimental results show that, for SER in different datasets, HPCB achieves higher performance compared with the other methods. The advantage of the CNN net-work is that it shares convolution kernels and automatically performs feature extraction, making it suitable for the high-dimensional data. But at the same time, the pooling layer loses a lot of valuable information by ignoring the correlation between the local and the whole. This makes CNN fail to obtain high accuracy in learning time series. When dealing with tasks related to time series, LSTM is usually more appropriate. However, in terms of classification (including SER), LSTM has an obvious disadvantage in performance. Therefore, HPCB gets an obvious advantage in the variable mapping, to preserve the more valuable information. And it also performs well on the task sensitive to time series. It can be concluded that the proposed model has excellent generalization ability for SER tasks.*

Reviewer#1, Concern # 3: The bibliography does not follow the style of the journal. For example, in journals, the year should appear in bold.

Author response: Thank you very much for your suggestion.

Author action: To response reviewer’s comment, we have revised the formats of all bibliographies in the paper according to the requirements of experts. See the references for details.

Author Response File: Author Response.pdf

Reviewer 2 Report

The authors present a system for speech emotion recognition, based on Parallel Convolution Bi-LSTM deep learning model. The paper is quite well written and the results seem very promising. However, it is not completely clear to me the author’s rationale behind the adopted model design. Recently, in literature, we are seeing an increasing adoption of deep learning models as black-box systems; this is the reason why explainable deep learning and explainable artificial intelligence topics are gaining more and more interest in the scientific community. In the specific case of the presented paper, it would be interesting to better understand why the authors have chosen a parallel branch (in the specific, a two dense-layer and BiLSTM branch, and a dense-convolutional-BiLSTM branch). As the authors state: “The purpose of designing the two heterogeneous branches is to project the original data into different transformation spaces for calculation, so as to better represent the original emotional speech”. Why the parallel representation “better represents” the original signal? Actually, the dense-convolutional-BiLSTM branch can extract spatio-temporal features of the input signal, so what is the rationale of adding a parallel dense-BiLSTM branch for an additional temporal features extraction? The proposed model has been derived by simply tuning/optimizing different combinations of parameters, hyperparameters, layer architectures etc. (and this could lead to results which may depend on the specific datasets used for training, possibly not being able to generalize well with other datasets?), or is there a more theoretical background behind this choice? It would be interesting to have further information on this. Moreover, it is not clear to me how the procedure adopted by authors for splitting the datasets into train and test sets can preserve the temporal information to be learned by the BiLSTM layers. Actually, as the authors state, “The samples of each dataset are randomly divided into 5 equal parts, and 4 parts are used as the training data while the remaining one is used as the testing set”. According to the input data preparation described by the authors, “Each speech is segmented into frames with a 25ms window and 10ms shifting step 166 size”; does this mean that each sample is represented by a single frame? Or, rather, the input features are computed on sequences of signal frames, as typically done in LSTM implementation? I think this should be better clarified, since random split for train and test sets for cross-validation is not typically implemented in the design of recurrent and LSTM neural network architectures, at least if you want to keep the long term memory (i.e., to preserve the state cell information). Is this not the case? Further explanation regarding these aspects would be of interest, in my opinion. Finally, some minor comments regarding the English form: the form is clear, I would only suggest an additional proof-reading, there are just a few typos which I am listing in the following: Line 110 (page 3): “…it also contributes to capture and retrieval…” should be “…it also contributes to capture and retrieve…” Line 153 (page 4): “…if the samples of database EMODB is projected…” should be “…if the samples of database EMODB are projected…”

Author Response

Original Manuscript ID: applsci-1362292

Original Article Title: “A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition”

To: Applied Sciences Editorial Office

Dear Editor,

Best regards,

Huiyun Zhang, Heming Huang*, and Henry Han

Re: Response to reviewers

Comments and Suggestions for Authors: The authors present a system for speech emotion recognition, based on Parallel Convolution Bi-LSTM deep learning model. The paper is quite well written and the results seem very promising.

Reviewer#2, Concern # 1: However, it is not completely clear to me the author’s rationale behind the adopted model design. Recently, in literature, we are seeing an increasing adoption of deep learning models as black-box systems; this is the reason why explainable deep learning and explainable artificial intelligence topics are gaining more and more interest in the scientific community. In the specific case of the presented paper, it would be interesting to better understand why the authors have chosen a parallel branch (in the specific, a two dense-layer and BiLSTM branch, and a dense-convolutional-BiLSTM branch). As the authors state: “The purpose of designing the two heterogeneous branches is to project the original data into different transformation spaces for calculation, so as to better represent the original emotional speech”. Why the parallel representation “better represents” the original signal?

Author response: The authors are grateful for your valuable suggestion!

Author action: The structure of the two branches is different, which means that two different operations are carried out on the original data, that is, the original data is projected into two subspace for transformation operation, so that speech emotion features can be extracted from different levels; In addition, a single branch may lose the useful information for emotion recognition during dropout. The two heterogeneous branches can achieve information complementarity, so as to better characterize the emotion features of speech.

Reviewer#2, Concern # 2: Actually, the dense-convolutional-BiLSTM branch can extract spatio-temporal features of the input signal, so what is the rationale of adding a parallel dense-BiLSTM branch for an additional temporal features extraction?

Author response: Thank you very much for your helpful comment.

Author action: Firstly, the dense-convolutional-BiLSTM can extract the temporal and spatial information of speech emotion. This branch first extracts the low-level features of the original data, and then extracts its spatial information and temporal information; while dense-BiLSTM branch first extracts the low-level features of the original data, and then extracts its temporal information. The operation processes of the two branches are different, that is, the original data are transformed differently; Secondly, the ablation research shows that adding parallel branches can improve the network performance.

Reviewer#2, Concern # 3: The proposed model has been derived by simply tuning/optimizing different combinations of parameters, hyperparameters, layer architectures etc. (and this could lead to results which may depend on the specific datasets used for training, possibly not being able to generalize well with other datasets?), or is there a more theoretical background behind this choice? It would be interesting to have further information on this.

Author response: Thank you very much for your helpful comment.

Author action: The author agrees with you very much. Indeed, strict parameter setting on a single database will lead to strong model dependence and weak generalization ability. However, the author only fine-tuned the parameters, and carried out experiments with the same parameter configuration on three different databases. The final experiments show that the proposed model has achieved ideal experimental performance on the three databases after parameter tuning.

Reviewer#2, Concern # 4: Moreover, it is not clear to me how the procedure adopted by authors for splitting the datasets into train and test sets can preserve the temporal information to be learned by the BiLSTM layers. Actually, as the authors state, “The samples of each dataset are randomly divided into 5 equal parts, and 4 parts are used as the training data while the remaining one is used as the testing set”. According to the input data preparation described by the authors, “Each speech is segmented into frames with a 25ms window and 10ms shifting step 166 size”; does this mean that each sample is represented by a single frame? Or, rather, the input features are computed on sequences of signal frames, as typically done in LSTM implementation? I think this should be better clarified, since random split for train and test sets for cross-validation is not typically implemented in the design of recurrent and LSTM neural network architectures, at least if you want to keep the long term memory (i.e., to preserve the state cell information). Is this not the case? Further explanation regarding these aspects would be of interest, in my opinion.

Author response: Thank you very much for your constructive comment.

Author action: For the database division: the database is randomly divided into five parts each experiment, of which four parts are used as training data and the remaining part is used as test data. Experiments are repeated 10 times and the average value of all trials are computed. There is no direct causal relationship between database division and model construction. Generally, data is the preparation stage of model training, and models are trained based on data.

For the data preprocessing stage: each speech is segmented into frames with a 25ms window and 10ms shifting step 166 size. Low-level descriptor (LLD) features, such as MFCC and chroma, are extracted on each frame data; then, the high-level statistical functions (HSF) of these LLD features are calculated, that is, the mean and variance are calculated, so as to calculate the whole sentence-level representation. The LLD features here are extracted by librosa instead of the deep features extracted by Bi-LSTM. The feature extraction using heterogeneous branches mentioned in this paper refers to the recalculation of the manually extracted features (i.e., LLD and its HSF); it plays two roles: reprocessing features and acting as a classifier.

Reviewer#2, Concern # 5: Finally, some minor comments regarding the English form: the form is clear, I would only suggest an additional proof-reading, there are just a few typos which I am listing in the following: Line 110 (page 3): “…it also contributes to capture and retrieval…” should be “…it also contributes to capture and retrieve…” Line 153 (page 4): “…if the samples of database EMODB is projected…” should be “…if the samples of database EMODB are projected…”

Author response: Thank you very much for your kind comment.

Author action: The author has revised the grammatical structure of the paper according to reviewer opinions, and the modifications are marked with *.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The paper can be accepted in current form.

Reviewer 2 Report

I would like to thank the authors for having provided answers to the proposed comments.

Article Menu

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI