Next Article in Journal
Dynamic Lognormal Shadowing Framework for the Performance Evaluation of Next Generation Cellular Systems
Next Article in Special Issue
Convolutional Two-Stream Network Using Multi-Facial Feature Fusion for Driver Fatigue Detection
Previous Article in Journal
Analysis of the Structure and Use of Digital Resources on the Websites of the Main Football Clubs in Europe
 
 
Article
Peer-Review Record

Combining Facial Expressions and Electroencephalography to Enhance Emotion Recognition

Future Internet 2019, 11(5), 105; https://doi.org/10.3390/fi11050105
by Yongrui Huang, Jianhao Yang, Siyu Liu and Jiahui Pan *
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 4: Anonymous
Future Internet 2019, 11(5), 105; https://doi.org/10.3390/fi11050105
Submission received: 28 February 2019 / Revised: 10 April 2019 / Accepted: 16 April 2019 / Published: 2 May 2019
(This article belongs to the Special Issue Special Issue on the Future of Intelligent Human-Computer Interface)

Round 1

Reviewer 1 Report

The structure of the paper is difficult to comprehend. The Introduction section is expected to be followed by Materials and Methods section. However, the Materials and Methods is presented after the Discussion section. Please review the manuscript structure to improve readability for the audience.

Order of Citations are not in numerical ascending order. This is probably because the Materials and Methods section is placed after the Discussion? Unless this is the format of the journal, it is not clear why the authors adopted such a format.

Certain paragraphs in the Results section should be more appropriate if described in the Materials Methods (such as description of the Datasets)

What is the novelty of this study? Previous studies exist in literature using the same database (such as MAHNOB HC) and used a similar data / sensor fusion methods:

 Fusion of facial expressions and EEG for implicit affective tagging 

Continuous emotion detection using EEG signals and facial expressions.

There is no consistency in the title names. Titles should always start in Capital letters.

The state of the art is described very briefly. A lot of studies have performed EEG-based, EMG-based, video-based or other modality based emotion recognition with better or equal performances reported in the current study. The literature study needs to be more elaborate.

In the introduction section, you argue that non-physiological signals (video of faces) can be easily faked. While on the other hand, you use pictures of faces for the CNN training, so the same argument can be made on the proposed approach. It would be good to argue why the multi-modal approach can circumvent such issues.

Line 53-54: Facial expressions (EMG, bio-electrical), speech (bio-accoustic) and movement are still considered physiological signals. Maybe they should be classified as voluntary and involuntary bio-signals.

Line 60: Again, it should be involuntary signals instead of physiological. Most body signals can be classified under physiological.

Can you add a figure or table describing the results of the statistical tests (paired t-tests) to understand better the differences between performance?

Author Response

Responses to the First Reviewer

We would like to thank you for your constructive comments and suggestions to our manuscript “Combining Facial Expressions and EEGs to Enhance Emotion Recognition”. In light of your comments and suggestions, the paper has been revised. In our responses, each paragraph extracted from the present manuscript is enclosed in double quotation marks, in which the newly revisions are marked in bold face; and the new changes are also marked in the revised manuscript. Please see our point to point responses in the following.

 

1.       The structure of the paper is difficult to comprehend. The Introduction section is expected to be followed by Materials and Methods section. However, the Materials and Methods is presented after the Discussion section. Please review the manuscript structure to improve readability for the audience.

Response: In our previous version, we put Materials and Methods section after Discussion according to the recommended template given by the journal website. According to your suggestions, we have put the Materials and Methods section right after Introduction section to improve readability.

 

2.       Order of Citations are not in numerical ascending order. This is probably because the Materials and Methods section is placed after the Discussion? Unless this is the format of the journal, it is not clear why the authors adopted such a format.

Response: According to your suggestions, we have adjusted the order of Citations in numerical ascending order.

 

3.       Certain paragraphs in the Results section should be more appropriate if described in the Materials Methods (such as description of the Datasets)

Response: According to your suggestions, we have moved the description of the Datasets to the Materials and Methods section (in 2.1 Dataset Description and Data Acquisition)

 

4.       What is the novelty of this study? Previous studies exist in literature using the same database (such as MAHNOB HC) and used a similar data / sensor fusion methods:

Fusion of facial expressions and EEG for implicit affective tagging

Continuous emotion detection using EEG signals and facial expressions.

Response: Comparing with the two studies mentioned above, our novelty is two-fold. First, we used transfer learning technology to study more reliable features for facial expression to improve model performance. Our result also suggested that our performance surpasses the performance of the first study “Fusion of facial expressions and EEG for implicit affective tagging”. Specifically, in face-based method, the averaged accuracy of 64.5% was achieved in valence space in the first study, while the averaged accuracy of 73.33% was achieved in our study. Note that we can’t compare the performance with the second study since it used regression model, but their methods for facial expression is quite similar with us. Second, the fusion method used in both studies is our first fusion method. We developed the second fusion method. As we mentioned in Discussion section, the second fusion can reduce computational cost with the growth of the fusion modal (such as EMG).

According to your suggestions, we have compared our study with the two mentioned studies, and stated the highlights in Discussion section. Please take a look at the following paragraph extracted from the present version.

 

“In this study, we explored two methods for the fusion of facial expressions and EEGs for emotion recognition. Two data sets containing facial videos and EEG data were used to evaluate these methods. Additionally, an online experiment was conducted to validate the robustness of the model. In a binary classification of valence and arousal, significant results were obtained for both single modalities. Moreover, two fusion methods outperformed the single modalities. Compared with studies in the recent literature [5-7], which used single modalities to detect emotion, our major novelty is combining EEG with facial expression. In addition, in emotion experiments, deep learning methods are often difficult to apply due to the limits of the samples. Our other highlight was solving this problem by pretraining our deep learning model based on a large data set before fine tuning with the target data set. Thus, we achieved high performance in detecting emotion comparing to [9], especially for face-based method, [9] achieved 64.5% accuracy while ours achieved 73.33% accuracy in valence space. In addition to implementing the widely used fusion method based on enumerating different weights between two models [9-10], we also explored a novel fusion method applying boosting technique, which can reduce computational cost with the growth of the fusion modal (such as EMG). ”

(p. 15. the first paragraph)

 

5.       There is no consistency in the title names. Titles should always start in Capital letters.

Response: According to your suggestions, we have made all the titles start in Capital letters.

 

6.       The state of the art is described very briefly. A lot of studies have performed EEG-based, EMG-based, video-based or other modality based emotion recognition with better or equal performances reported in the current study. The literature study needs to be more elaborate.

Response: According to your suggestion, we have added some contents in Introduction and Discussion sections to describe the state-of-the-art. Specifically, we added some EMG-based, video-based or other model-based emotion recognitions in the Introduction. Some studies may perform better or equal performances in those datasets using more modality. However, it should be stressed that studies only used facial expression and EEG for emotion recognition, like “Fusion of facial expressions and EEG for implicit affective tagging”, is difficult to achieve such high performance. In the current version of Introduction section, we mainly focused on the studies using EEG-based emotion recognition and analyzed their corresponding methods. In the Discussion section, we have described more context about our results. Please take a look at the following paragraph extracted from the present vision.

 

“In recent years, with the development of multisource heterogeneous information fusion processing, it has become possible to fuse features from multicategory reference emotion states. Jinyan Xie el al. proposed a new emotion recognition framework based on multi-channel physiological signals including ECG, EMG and SCL using the dataset of Bio Vid Emo DB and evaluated a series of feature selection methods and fusion methods. Finally, they achieved 94.81% accuracy in their dataset [37]. In study using MAHNOB-HCI dataset, Sander Koelstra et al. performed binary classification based on the valence-arousal-dominance emotion model using a fusion of EEG and facial expression and found that the accuracies of valence, arousal, and control were 68.5%, 73%, and 68.5%, respectively [9]. Mohammad et al. proposed a method for continuously detecting valence from EEG signals and facial expressions in response to videos and studied the correlation among features from EEG and facial expressions with continuous valence in MAHNOB-HCI dataset[10]. In addition, our previous study proposed two multimodal fusion methods between brain and peripheral signals for emotion recognition and reached 81.25% and 82.75% accuracy for four categories of emotion states (happiness, neutral, sadness, and fear), and both of accuracies were higher than the accuracies of facial expression (74.38%) and EEG detection (66.88%) [8]. Specially, in our previous study, we applied principal component analysis to analyze facial expression data and extract high-level features, and we used fast Fourier transform to extract various power spectral density (PSD) features from raw EEG signals. Due to the limited amount of data, we used a simple model, namely, a two-layer neutral network for facial expression and a support vector machine (SVM) for EEG rather than a deep learning model, and two simple fusion rules were applied to combine EEG and facial expressions. Simply, the method can help prevent overfitting in cases of limited data availability.”

(p. 2. the third paragraph)

“The early works of emotion recognition were more performed by voluntary signals, such as facial expression, speech and gesture. In one study, Abhinav Dhall et al. used videos of faces to classify facial expressions into seven categories (anger, disgust, fear, happiness, neutral, sadness, and surprise) and achieved 53.62% accuracy based on a test set [5]. Pavitra Patel et al. applied the Boosted-GMM (Gaussian mixture model) algorithm to speech-based emotion recognition and classified emotion into five categories (angry, happy, sad, normal, and surprise) [6]. Although non-physiological signals are easier to obtain, they are less reliable. For example, we can fake our facial expression to cheat the machine. Recently, more researches were done based on involuntary signals . Wei-Long Zheng el al. investigated stable patterns of EEG over time for emotion recognition. They systematically evaluate the performance of various popular feature extraction, feature selection, feature smoothing and pattern classification methods with the DEAP dataset and a newly developed dataset called SEED for this study[7]. They found that discriminative graph regularized extreme learning machine with differential entropy features can achieves the best average accuracies of 69.67% and 91.07% on the DEAP and SEED datasets, respectively.  Yong-Zhang el al. present the attempt to investigate feature extraction of electroencephalogram (EEG) based emotional data by focusing on empirical mode decomposition (EMD) and autoregressive (AR) model, and construct an EEG-based emotion recognition method to classify these emotional states and reported an average accuracy between 75.8% to 86.28% in DEAP dataset [33]. The methods based on physiological signals seem to be more effective and reliable. But physiological signals often mix with noisy signals. A common instance is that the movement of facial muscle often causes the fluctuation of EEG.”

(p.2. the first paragraph)

“Our performance for both valence and arousal surpassed that of previous study who used EEG-based, face-based emotion recognition in in terms of binary valence/arousal classification based on two public datasets, partly because we performed end-to-end deep leaning mapping of the face from image pixels directly to emotion states instead of trying to manually extract features from the images. CNN is widely used in image-based recognition tasks and often achieves better performance than traditional methods [24]. On the other hand, as we argued in introduction section, non-physiological signals (video of faces) can be easily faked. In this respect, the drawbacks of facial expression detection can be compensated for by the EEG detection to a very large extent. Thus, the facial expression detection and EEG detection were irreplaceable and complementary to each other, and the multimodal fusion should achieve higher accuracies using both detections than using one of the two detections.”

(p.15. the second paragraph)

 

7.       In the introduction section, you argue that non-physiological signals (video of faces) can be easily faked. While on the other hand, you use pictures of faces for the CNN training, so the same argument can be made on the proposed approach. It would be good to argue why the multi-modal approach can circumvent such issues.

Response: According to your suggestion, we have modified the Discussion section to argue why the multi-modal approach can circumvent the issues that non-physiological signals (video of faces) can be easily faked. The drawbacks of facial expression detection can be compensated for by the EEG detection to a very large extent. In this respect, the facial expression detection and EEG detection were irreplaceable and complementary to each other, and the multimodal fusion should achieve higher accuracies using both detections than using one of the two detections. Please see the following paragraph extracted from the present vision.

“Our performance for both valence and arousal surpassed that of previous study who used EEG-based, face-based emotion recognition in in terms of binary valence/arousal classification based on two public datasets, partly because we performed end-to-end deep leaning mapping of the face from image pixels directly to emotion states instead of trying to manually extract features from the images. CNN is widely used in image-based recognition tasks and often achieves better performance than traditional methods [24]. On the other hand, as we argued in introduction section, non-physiological signals (video of faces) can be easily faked. In this respect, the drawbacks of facial expression detection can be compensated for by the EEG detection to a very large extent. Thus, the facial expression detection and EEG detection were irreplaceable and complementary to each other, and the multimodal fusion should achieve higher accuracies using both detections than using one of the two detections.”

(p.15. the second paragraph)

 

8.       Line 53-54: Facial expressions (EMG, bio-electrical), speech (bio-accoustic) and movement are still considered physiological signals. Maybe they should be classified as voluntary and involuntary bio-signals.

Line 60: Again, it should be involuntary signals instead of physiological. Most body signals can be classified under physiological.

Response: Yes, we have replaced the word ‘physiological/non-physiological’ with voluntary/involuntary.  Please see the following paragraph extracted from the present vision.

“The early works of emotion recognition were more performed by voluntary signals, such as facial expression, speech and gesture. In one study, Abhinav Dhall et al. used videos of faces to classify facial expressions into seven categories (anger, disgust, fear, happiness, neutral, sadness, and surprise) and achieved 53.62% accuracy based on a test set [5]. Pavitra Patel et al. applied the Boosted-GMM (Gaussian mixture model) algorithm to speech-based emotion recognition and classified emotion into five categories (angry, happy, sad, normal, and surprise) [6]. Although non-physiological signals are easier to obtain, they are less reliable. For example, we can fake our facial expression to cheat the machine. Recently, more researches were done based on involuntary signals. Wei-Long Zheng el al. investigated stable patterns of EEG over time for emotion recognition [7].  Yong-Zhang el al. proposed to take advantage of feature extraction technique based on EMD and autoregressive (AR) model to improve the emotion recognition performance and reported an average accuracy between 75.8% to 86.28% in DEAP dataset [33]. The methods based on physiological signals seem to be more effective and reliable. But physiological signals often mix with noisy signals. A common instance is that the movement of facial muscle often causes the fluctuation of EEG.”

(p.2. the first paragraph)

 

9.       Can you add a figure or table describing the results of the statistical tests (paired t-tests) to understand better the differences between performance?

Response: According to your suggestion, we have added some tables describing the results of the statistical test. Specifically, we first performed a normality test of accuracy for each of the four detectors, and the data were considered normal when the result of the normality test was below 0.05. This test was followed by a paired t-test (normal data) or Nemenyi procedure (not normal). We provided Table 4,5,6,7,8,9 to show the statistical test matrix between signal modal and multiply modal. Please see the following paragraph extracted from the present vision.


Author Response File: Author Response.pdf

Reviewer 2 Report

This paper considers a multimodal emotion recognition framework by combining facial expression and EEG based on valence-arousal emotional model. The multitask convolutional neural network (CNN) architectures are employed. The paper contains a sufficient materials including significant experimental result based on some public datasets.

So I would like to recommend the paper for acceptance.

Author Response

Thanks a lot! 

Reviewer 3 Report

Some issues along the manuscript must be corrected. Some examples are:

1) Equations requires edition along the document.

2) Position of figure 4 affect explanation of section 4.


The results section requires showing the comparison with other emotions recognition techniques reported in the field to provide a context to the readers about the improvements that the authors report with the experiments they performed. 

The conclusions section can be improved to provide more options that invite research or experiments on the case study.

Author Response

Responses to the Thrid Reviewer

We would like to thank you for your constructive comments and suggestions to our manuscript “Combining Facial Expressions and EEGs to Enhance Emotion Recognition”. In light of your comments and suggestions, the paper has been revised. In our responses, each paragraph extracted from the present manuscript is enclosed in double quotation marks, in which the newly revisions are marked in bold face; and the new changes are also marked in the revised manuscript. Please see our point to point responses in the following.

 

1. Some issues along the manuscript must be corrected. Some examples are:

1) Equations requires edition along the document.

2) Position of figure 4 affect explanation of section 4.

Response: According to your suggestions, we have corrected the Equations and corrected the position of figure.

 

2.The results section requires showing the comparison with other emotions recognition techniques reported in the field to provide a context to the readers about the improvements that the authors report with the experiments they performed.

Response: According to your suggestions, we have moved the comparison with other emotions recognition techniques reported in the field from Discussion section to the Results section. Specifically, we compared with the study using the same dataset and the same modal. Our performance for both valence and arousal space surpassed that of previous results in terms of binary valence/arousal classification based on two public datasets. Please see the following paragraphs extracted from the present version.

3.3. Comparison with Other Literature

In the literature, based on the MAHNOB-HCI data set, the authors of [9] developed a method of mapping face action units to high-level emotional states for facial feature extraction and used SVM-RFE to select EEG features before classifying them using a Gaussian naive Bayes classifier. Following fusion, they achieved 73.0% valence detection and 68.5% arousal detection rates. The authors of [22] proposed a novel emotion recognition method using a hierarchical Bayesian network to accommodate the generality and specificity of emotions simultaneously and achieved 56.9% and 58.2% accuracy for valence and arousal, respectively. In the literature, based on the DEAP data set, the authors of [23] addressed the single-trial binary classification of emotion dimensions (arousal, valence, dominance, and liking) using EEG signals and achieved 76.9% and 68.4% accuracy for valence and arousal, respectively. The authors of [22] also tested their model on the DEAP data set and achieved 58.0% and 63.0% accuracy for valence and arousal, respectively. Our performance for both valence and arousal space surpassed that of previous results

(p.11. the second paragraph)

 

3. The conclusions section can be improved to provide more options that invite research or experiments on the case study.

Response: According to your suggestions, we have modified the Conclusions section. We provided more options that we can invite research or experiments on the case study. For example, applying emotion recognition technology to special population like infant. Moreover, people who suffered from disorders of consciousness, is also an interesting area we can work on. Please see the following paragraphs extracted from the present version

 

Certain limitations of this study should be considered in the future. Currently, due to the limited time participants can use the equipment before fatigue sets in and before the effectiveness of the electrode gel is degraded, it is often challenging to obtain a large data set for EEG. Thus, the number of samples was insufficient to provide a definitive answer regarding the benefits of fusion in this study. In the future, we will attempt to either collect more EEG data or generate EEG data using a generative model (e.g., a generative adversarial network [32]) for semi-supervision learning. Furthermore, identifying the additional benefits of information fusion for multimodal emotion recognition is an interesting topic that we intend to investigate. Finally, applying emotion recognition technology to special population, such as infant, people who suffer from disorders of consciousness, is also an interesting area we can work on.

(p.17. the first paragraph)


Author Response File: Author Response.pdf

Reviewer 4 Report

This paper extends previous published results from the same authors using state-of-the-art CNN methods and multimodal signal analyses. The topic is interesting and challenging, but there is a number of unclear points in the current version of the manuscript that would require revision, as follows:

- Minor: The current version contains some weird texts such as the first paragraph of the section 2 (Results);

- Minor: Although the authors state that the time cost of the CNN is acceptable, it has not been clear how long such training has been taken in practice;

- Major: How the face image and EEG signals have been co-registered? This can be a major problem in multimodal signal analysis and such correspondence must be established to properly allow signal comparison;

- Major: Why the online and offline electrodes analysed have been different? It is quite difficult to understand how the PDS feature extraction have worked and compared in such different context of training and testing;

- Major: The convex parameter k seems to me one of the most important information of the multimodal method proposed since it regulates the multimodality fusion and might highlight the importance and benefits of combining face and EEG data. What was the value of k obtained experimentally? Is there an optimal value for this parameter?

Author Response

Responses to the Fourth Reviewer

We would like to thank you for your constructive comments and suggestions to our manuscript “Combining Facial Expressions and EEGs to Enhance Emotion Recognition”. In light of your comments and suggestions, the paper has been revised. In our responses, each paragraph extracted from the present manuscript is enclosed in double quotation marks, in which the newly revisions are marked in bold face; and the new changes are also marked in the revised manuscript. Please see our point to point responses in the following.

1.       This paper extends previous published results from the same authors using state-of-the-art CNN methods and multimodal signal analyses. The topic is interesting and challenging, but there is a number of unclear points in the current version of the manuscript that would require revision, as follows:

Response: Thanks a lot!

 

2.        Minor: The current version contains some weird texts such as the first paragraph of the section 2 (Results);

Response: According to your suggestions, we have removed those texts.

 

3.       Minor: Although the authors state that the time cost of the CNN is acceptable, it has not been clear how long such training has been taken in practice;

Response: According to your suggestions, we have added the time for CNN pre-training and fine-tine. For the training, it took about 6 hours and 30 minutes in pre-train step. In fine-tune step, it cost 4~6 minutes for each subject.  Please see the following paragraphs extracted from the present version.

In terms of the computational cost, the proposed CNN model contains 1,019,554 parameters, and it took 0.0647 seconds to conduct one forward analysis on a single sample. For the training, it took about 6 hours and 30 minutes in pre-train step. In fine-tune step, it cost 4~6 minutes for each subject. All the experiments were conducted with a GeForce GTX 950.

(p.15, the third paragraph)

 

4.       Major: How the face image and EEG signals have been co-registered? This can be a major problem in multimodal signal analysis and such correspondence must be established to properly allow signal comparison;

Response: We conducted a decision level fusion instead of a feature level fusion, which means we didn’t have to guarantee face image and EEG signals co-registered in each moment. It took 2~3 minutes for each trial and a trial is considered as a unique sample during fusion. Between two trials, there were a break that both face image and EEG sensor stop recording and the data between two breaks was considered a sample in fusion. We believed such correspondence can allow signal comparison. Besides, we believed when the time for collecting EEG and face data is long, this problem won’t be a major problem. Because we didn’t need to merge the features in feature level. According to your suggestions, we have added more context in Results sections.

During the 10-second countdown, the sensors stop recording signals. The data between two 10-second countdowns were used as a sample for fusion level. For facial expression, the face data was used to fine-tune CNN model. For EEG, the EEG data was used to train SVM model. Two sub-models were trained independently and fusion in decision level. It guaranteed the face and EEG data have been co-registered for a trial.

(p.10. the second paragraph)

 

5.       Major: Why the online and offline electrodes analysed have been different? It is quite difficult to understand how the PDS feature extraction have worked and compared in such different context of training and testing;

Response: There are some misunderstandings because of our unclear description. In fact, our offline experiment refers to using two public dataset MAHNOB-HCI and DEAP and our online experiment refers that we first conducted 20 trials collecting data for training (using 5 electrodes), following by another 20 trials for testing (using the same 5 electrodes). We did not use 14 electrodes for training and then use 5 electrodes for testing. As for the reason we used 5 electrodes in online experiment instead of 14 electrodes is that we believed we could collect more trials with high-quality data. Using more electrodes may cause subjects more uncomfortable during the experiment and it is easier to get tired when the time we spent in experiment is long. According to your suggestions, we have modified some contents in the Methods and Results sections. Please see the following paragraphs extracted from the present version.

3.1.1. Experiment Based on public datasets

In MAHNOB-HCI dataset, for each subject, leave-one-trial-out cross-validation was performed for binary classification. In this process, tests were performed on one trail, and the other 19 trails were used for training. The video data were used to fine tune the facial expression classifier (CNN) pretrained based on the fer2013 data set, and the EEG data were used to train the EEG classifier (SVM). For each subject, we used the number of trials that were predicted correctly compared to the total number of trials as a metric to measure model performance.

In DEAP dataset, for each subject, we randomly selected 20 trials as a training set, and the rest of the 20 trials were used as a test set. A leave-one-trial-out cross-validation was performed for the training set to select the best hyperparameter for the model and then the trained models were tested in test set. For each subject, we used the accuracy of the test set as a metric to assess model performance.

3.1.2. Online Experiment

Twenty subjects (50% female), whose ages ranged from 17 to 22 (mean = 20.15, std = 1.19), volunteered to participate in the experiment. We first introduced the meanings of “valence” and “arousal” to the subjects. The subjects were instructed to watch video clip and reported their emotion status of valence and arousal at the end of each video. During the experiments, the subjects were seated in a comfortable chair and instructed to avoid blinking or moving their bodies. We also conducted device testing and corrected the camera position to ensure that faces of subjects appeared in the center of the screen.

In a preliminary study, 40 video clips were manually selected from commercially produced movies as stimuli. They were separated into 2 part for calibration run and evaluation run, respectively. Each part contained 20 videos. The movie clips range in duration from 70.52 to 195.12 seconds (mean = 143.04, std = 33.50).

In order to conduct an evaluation run, we first need data to train model. Therefore, we first conducted calibration run to collect data before performing an evaluation run. In the calibration run, the collected data consisted of 20 trials for each subject. At the beginning of each trial, there was a 10-second countdown in the center of the screen to capture each subject’s attention and to serve as an indicator of the next clip. After the countdown was complete, movie clips that included different emotional states were presented on the full screen. During this time, we collected 4 human face images in each second using a camera and 10 groups of EEG signals each second using an Emotive mobile device. Each movie clip was presented for 2-3 minutes. At the end of each trial, a SAM appeared in the center of the screen to collect the subject’s label of valence or arousal [21]. Subjects were instructed to fill in the entire table and to click the “submit” button to proceed to the next trial. There was also a 10-second countdown in the center of the screen between two consecutive trials for emotional recovery. The collected data (EEG signal, face image and corresponding valence, and arousal label) were used to train the model described above.

The evaluation run was composed of 20 trials for each subject. The procedure of each trial was similar to that of data collection. Note that different movie clips were used for stimuli because the reuse of these stimuli would have reduced the impact of the movie clips by increasing the knowledge that subjects had of them. At the end of each trial, four different detectors (face expression detector, EEG detector, first fusion detector, and second fusion detector) were used to predict the valence and arousal based on the face image and EEG signal. By comparing the predicted result and the ground truth label, the accuracy was subsequently calculated.

During the 10-second countdown, the sensors stop recording signals. The data between two 10-second countdowns were used as a sample for fusion level. For facial expression, the face data was used to fine-tune CNN model. For EEG, the EEG data was used to train SVM model. Two sub-models were trained independently and fusion in decision level. It guaranteed the face and EEG data have been co-registered for a trial.

(p.9. 6th -9th paragraph)

 

6.       Major: The convex parameter k seems to me one of the most important information of the multimodal method proposed since it regulates the multimodality fusion and might highlight the importance and benefits of combining face and EEG data. What was the value of k obtained experimentally? Is there an optimal value for this parameter?

Response: We trained models with different parameter for different subject because we treated it as a subject-dependence problem. It means that the value of k is different for each subject. There is an optimal value for this parameter but it is different from subject to subject. According to your suggestions, to make it more clearly, we have added some context in Materials and Methods section. Please see the following paragraphs extracted from the present version.

 

2.3. Classification Fusion

After the two classifiers for facial expressions and EEG data were obtained, various modality fusion strategies were used to combine the outputs of these classifiers at the decision level. We employed two fusion methods for both EEG and facial expression detection as follows.

For the first fusion method, we applied the enumerate weight fusion approach for decision-level fusion. It was widely used in a lot of research about multimodal fusion [9] [10]. Specifically, the output result of this method is given in (8).

                                                 (8)

 

where  represents the predicted result (high or low),  and  represent the predicted output scores for the facial expression and EEG, respectively, and k (ranging from 0 to 1) represents the importance degree of the facial expression. The key objective of this method is to find a proper k that can lead to satisfactory performance. To achieve this objective, k is varied between 0 and 1 in steps of 0.01, and the value that provides the highest accuracy for the training samples is selected. We applied this method separately for the two learning tasks (valence and arousal) and obtained two different k, one for valence space while the other for arousal space. Note that we trained models with different parameter for different subject because we treated it as a subject-dependence problem. It means the value of k is different for each subject. There is an optimal value for this parameter but it is different from subject to subject.

(p.7. the 4th – 5th paragraph)


Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Minor typos were found. Please proof-read the manuscript.

Order of citation (numbers) are not in ascending order. For example, Reference 33 appears after Reference 7 in the Introduction.

Abbreviations should be described only the first instance, for next time onwards, only the abbreviations should be used.

Lines 303-352 should be under Materials and Methods

Figure references are not correct. Example: Line 457.


Author Response

 

Responses to the First Reviewer

The authors are grateful to the first reviewer for the insightful comments and constructive suggestions. In light of your comments and suggestions, the paper has been revised. Please see our point to point responses in the following.

 

1.Order of citation (numbers) are not in ascending order. For example, Reference 33 appears after Reference 7 in the Introduction.

Response: According to your suggestions, we have modified all citation in this paper.

 

2.Abbreviations should be described only the first instance, for next time onwards, only the abbreviations should be used.

Response: According to your suggestions, we have modified all unfit abbreviations in this paper. Specifically, we did the following modification.

1. We added some descriptions of the instance between line 66 and line 67.

2. We removed the description of EEG in line 61.

3. We removed the description of PSD in line 133.

 

3.Lines 303-352 should be under Materials and Methods

Response: According to your suggestions, we have put the contents in Section Materials and Methods.

 

4.Figure references are not correct. Example: Line 457.

Response: According to your suggestions, we have corrected the Figure references


Reviewer 4 Report

The authors have properly answered my main points previously raised and I have no further comments.

Author Response

Thank you.

Back to TopTop