Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessFeature PaperArticle

Peer-Review Record

Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining

Electronics 2025, 14(20), 4015; https://doi.org/10.3390/electronics14204015

by Younghun Song¹

and Kyungyong Chung^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2025, 14(20), 4015; https://doi.org/10.3390/electronics14204015

Submission received: 28 August 2025 / Revised: 2 October 2025 / Accepted: 6 October 2025 / Published: 13 October 2025

(This article belongs to the Special Issue Application of Data Mining in Social Media)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes a novel multimodal emotion recognition framework that effectively captures the dynamic flow and transitions of emotions in conversations by applying Sequential Pattern Mining (SPM) to fused emotion sequences derived from facial expressions and speech transcriptions. This approach goes beyond traditional static, point-wise emotion classification. The experimental design is sound, and the results clearly demonstrate the effectiveness of the proposed method. The work is well-structured, and the contributions are clearly articulated.

To further enhance the academic rigor and readability of the manuscript prior to acceptance, several revisions are recommended:

1. In the abstract and introduction, when "speech transcription" and "text emotion classification" are first mentioned, the authors should explicitly clarify that the "text" refers to transcribed speech output, rather than an independent text modality, to avoid potential confusion.

2. The term "Sequential Pattern Mining (SPM)" should be used consistently throughout the paper. The variant "Sequence Pattern Mining" appearing in the text should be corrected to the standard terminology "Sequential Pattern Mining," and the abbreviation "SPM" should be defined upon first use and used consistently thereafter.

3. The figures in the manuscript are currently of low resolution and appear blurry. All images should be replaced with high-resolution versions to ensure clarity and professional presentation.

4. The authors should carefully review all mathematical formulations (e.g., Equations 2, 4, etc.), as they appear distorted or improperly rendered in the PDF version. Proper formatting and correct notation must be ensured for all equations.

5. The logical flow and technical details in the methodology section require strengthening. Specifically, the phrases “we first constructed sequence data based on the Emotion Transition Matrix” and “the resulting transition probability sequence” are ambiguous. The authors should provide a clearer explanation in Section 3.3.2 on how the transition matrix is utilized to generate the sequence of transition probabilities, including the exact computational steps involved.

6. The paper states that the LSTM input incorporates both the “simple emotion sequence” and “CUSUM-based change-point information,” but this integration is not adequately explained. The authors should precisely describe how these two types of information are combined and formatted for input into the LSTM. Additional clarification is needed on: (i) how the minimum support threshold (min_support) in the PrefixSpan algorithm is determined; (ii) how the mined frequent patterns are transformed into features suitable for LSTM processing; and (iii) how the change-point information is aligned with and fused into the emotion sequence.

7. The current “Multimodal (Late Fusion)” baseline is static, while the proposed model leverages temporal dynamics, making the comparison potentially unfair. It is recommended to introduce an additional baseline: “Unimodal + Transition Matrix” (e.g., constructing a transition matrix solely from the textual modality and feeding it into the LSTM), to better isolate and evaluate the individual contributions of multimodal fusion and temporal modeling.

8. The ablation study should be expanded to include a condition labeled “+ Sequence Pattern Mining Features,” allowing for a direct quantification of the performance gain attributable specifically to the SPM component.

9. While the paper cites a large number of references, several key works directly related to sequential pattern mining—both classical and recent—are missing. The authors are encouraged to include recent advances in emotion recognition and pattern mining in the introduction and related works, such as: EMRNet: enhanced micro-expression recognition network with attention and distance correlation(Artificial Intelligence Review), Anomaly Detection and Localization via Reverse Distillation with Latent Anomaly Suppression(TCSVT).

Furthermore, references [27], [28], and [29] appear to be tangential to the core topic: [27] concerns fashion recommendation systems, while [28] and [29] focus on AI applications in computer graphics and recommendation systems. The authors should re-evaluate the relevance of these citations and consider replacing them with more pertinent literature.

10. The English language and grammar require careful proofreading and refinement. In particular, unnecessary capitalization within sentences should be corrected, and overall academic writing style should be improved to enhance clarity, coherence, and professionalism.

Author Response

Reviewer Comment 1:

In the abstract and introduction, when "speech transcription" and "text emotion classification" are first mentioned, the authors should explicitly clarify that the "text" refers to transcribed speech output, rather than an independent text modality, to avoid potential confusion.

Author response and action: We thank the reviewer for this important suggestion. To enhance clarity, we have revised the Abstract and Introduction to explicitly state that the text used for emotion analysis is derived directly from the transcribed speech.

Revision in Abstract:

Before

We propose a multimodal emotion recognition framework that integrates facial expressions and speech transcription, with a particular focus on effectively modeling the continuous changes and transitions of emotional states during conversation.

After

We propose a multimodal emotion recognition framework that integrates facial expressions and speech transcription (where text is derived from the transcribed speech), with a particular focus on effectively modeling the continuous changes and transitions of emotional states during conversation.

Revision in Section 1 (Introduction):

Before

For example, one approach uses Whisper to transcribe speech signals and then applies BERT-based text emotion classification, while another infers emotions from facial images using models such as DeepFace [4].

After

For example, one approach uses Whisper to transcribe speech signals into text and then applies a BERT-based classifier for emotion analysis on the transcribed text, while another infers emotions from facial images using models such as DeepFace [4].

Reviewer Comment 2:

The term "Sequential Pattern Mining (SPM)" should be used consistently throughout the paper. The variant "Sequence Pattern Mining" appearing in the text should be corrected to the standard terminology "Sequential Pattern Mining," and the abbreviation "SPM" should be defined upon first use and used consistently thereafter.

Author response and action: We agree with the reviewer's comment regarding terminological consistency. We have corrected all instances of "Sequence Pattern Mining" to "Sequential Pattern Mining" throughout the manuscript, including the title. We have also defined the abbreviation (SPM) upon its first use in the Abstract and have used it consistently thereafter.

Reviewer Comment 3:

The figures in the manuscript are currently of low resolution and appear blurry. All images should be replaced with high-resolution versions to ensure clarity and professional presentation.

Author response and action: We appreciate the reviewer pointing this out. All figures in the manuscript, including the updated Figure 1, have been replaced with high-resolution versions (300 DPI) to ensure clarity and professional quality.

Reviewer Comment 4:

The authors should carefully review all mathematical formulations (e.g., Equations 2, 4, etc.), as they appear distorted or improperly rendered in the PDF version. Proper formatting and correct notation must be ensured for all equations.

Author response and action: We apologize for the rendering errors, which may have been an issue during the PDF conversion process. We have carefully reviewed and re-formatted all mathematical equations (Equations 1-11) to ensure their structural integrity. To demonstrate that the issue has been resolved, and for your convenience, we have included a sample of the corrected equations in the table below. We have subsequently verified that all equations are now displayed clearly and without distortion in the newly generated PDF of the revised manuscript.

Reviewer Comment 5:

The logical flow and technical details in the methodology section require strengthening. Specifically, the phrases “we first constructed sequence data based on the Emotion Transition Matrix” and “the resulting transition probability sequence” are ambiguous. The authors should provide a clearer explanation in Section 3.3.2 on how the transition matrix is utilized to generate the sequence of transition probabilities, including the exact computational steps involved.

Author response and action: Thank you for this crucial feedback. We have replaced the ambiguous sentences in Section 3.3.2 with a more detailed explanation of the process. The revised section now clearly describes the computational steps for generating the transition probability sequence from the emotion label sequence and the master Emotion Transition Matrix, including a specific example for better understanding.

Revision in Section 3.3.2:

Before

In this study, we first constructed sequence data based on the Emotion Transition Matrix. Specifically, we calculated the transition probability matrix from utterance-level emotion labels and then arranged it along the time axis to quantify the continuous changes in the emotional flow. The resulting transition probability sequence reflects the emotional tendencies at each time point, making it suitable for use as input to the change-point de-tection algorithm.

After

In this study, to prepare the input for the CUSUM algorithm, we first constructed a sequence of utterance-level emotion labels. From this sequence, we calculated the master Emotion Transition Matrix as described in Algorithm 1. To analyze the temporal dynamics of these emotional shifts, we then generated a transition probability sequence. This was achieved by iterating through the emotion label sequence: for each transition from the emotion at utterance t (st) to the emotion at utterance t+1 (st+1), we looked up the corresponding probability Tij in the master Emotion Transition Matrix, where i is the index for emotion st and j is the index for st+1. For example, if a dialogue transitioned from 'joy' (i=0) to 'neutral' (j=6), the value recorded in the transition probability sequence for that time step would be the pre-calculated probability T0,6. The resulting numeric sequence, where each element represents the likelihood of an observed transition, served as the direct input X for the CUSUM change-point detection algorithm outlined in Algorithm 2.

Reviewer Comment 6:

The paper states that the LSTM input incorporates both the “simple emotion sequence” and “CUSUM-based change-point information,” but this integration is not adequately explained. The authors should precisely describe how these two types of information are combined and formatted for input into the LSTM. Additional clarification is needed on: (i) how the minimum support threshold (min_support) in the PrefixSpan algorithm is determined; (ii) how the mined frequent patterns are transformed into features suitable for LSTM processing; and (iii) how the change-point information is aligned with and fused into the emotion sequence.

Author response and action: We thank the reviewer for the detailed feedback and for the opportunity to clarify our methodology. We have revised the description in Section 3.3.4 to better highlight how our existing explanation addresses each of the three specific points raised.

Specifically, the description in the manuscript clarifies the following:

For point (i), the determination of the min_support threshold is explained within the "SPM Pattern Features" description, where we state in full: "To generate these, we first determined the minimum support threshold (min_support) for the PrefixSpan algorithm empirically by evaluating a range of values on the validation set, selecting the one that yielded the most meaningful patterns."
For point (ii), the transformation of mined patterns into features is detailed in the same section. We explain that: "Then, for each dialogue, we created a multi-hot binary vector where each dimension corresponded to one of the top-k frequent patterns. A '1' was marked in the vector if the dialogue's emotion sequence contained that specific pattern."
For point (iii), the alignment and fusion of the features is described in two parts. The "Change-Point Feature" is explicitly aligned at the utterance level ("at utterance t"). The final fusion step is clarified at the end of the section, where we state: "This dialogue-level SPM feature vector was then concatenated with the utterance-level features for every time step in that dialogue."

We believe this structured clarification now provides the precise details requested by the reviewer.

Revision in Section 3.3.4:

Before

In this study, the LSTM model was trained using the MELD (Multimodal Emotion-Lines Dataset). Training was conducted for a total of 300 epochs on the MELD training set, using the Adam optimizer. After training, the model's performance was validated on the dev and test sets. A noteworthy aspect of our approach is that the model's input included not only the simple emotion sequence but also the CUSUM-based change-point information derived earlier as an auxiliary feature. This allowed the model to consider not only the continuous flow of emotions but also the abrupt emotional transition points during the conversation, which helped to improve classification performance. For example, the model incorporating change-point information could distinguish between cases where emotions shifted gradually versus those that reversed suddenly, enabling a more precise understanding of the context.

Furthermore, this study integrated the results of the Sequential Pattern MiningSPM-based emotion transition analysis as an auxiliary metric in the LSTM training process. By pre-extracting frequently recurring emotional transition patterns within dialogues (e.g., neutral → sadness → anger or joy → surprise → neutral) and applying weights to sequences containing these patterns, the model was able to learn the general tendencies of the conversational flow. This integrated approach contributed to the LSTM not only tracking emotional changes between time points but also internalizing meaningful emotional transition patterns, thereby achieving a higher level of emotion prediction capability.

After

In this study, the LSTM model was trained for 300 epochs on the MELD training set using the Adam optimizer, with performance validated on the dev and test sets. To enable the model to learn the complex temporal dynamics of conversations, we designed a com-prehensive input feature vector for each utterance (time step t). The input to the LSTM model at each time step t was a feature vector constructed by concatenating three types of information:

Emotion Label: The primary emotion label for the utterance, represented as a one-hot encoded vector (e.g., 'joy' as [1, 0, 0, 0, 0, 0, 0]).

Change-Point Feature: A binary feature indicating an abrupt emotional shift. This value was set to '1' if the CUSUM algorithm detected a change-point at utterance t, and '0' otherwise. This allows the model to explicitly recognize moments of sharp transition.

SPM Pattern Features: Features derived from the mined frequent sequential patterns. To generate these, we first determined the minimum support threshold (min_support) for the PrefixSpan algorithm empirically by evaluating a range of values on the vali-dation set, selecting the one that yielded the most meaningful patterns. Then, for each dialogue, we created a multi-hot binary vector where each dimension corresponded to one of the top-k frequent patterns. A '1' was marked in the vector if the dialogue's emotion sequence contained that specific pattern. This dialogue-level SPM feature vector was then concatenated with the utterance-level features for every time step in that dialogue.

By aligning and fusing these distinct sources of information, the LSTM model could learn not only from the immediate emotion but also from the broader temporal context, including gradual flows, abrupt shifts, and recurring structural patterns. This integrated approach allows the model to move beyond simple single-utterance classification and learn dialogue-level emotional patterns, thereby achieving a higher level of emotion pre-diction capability.

Reviewer Comment 7:

The current “Multimodal (Late Fusion)” baseline is static, while the proposed model leverages temporal dynamics, making the comparison potentially unfair. It is recommended to introduce an additional baseline: “Unimodal + Transition Matrix”. to better isolate and evaluate the individual contributions of multimodal fusion and temporal modeling.

Author response and action: This is an excellent point. We thank the reviewer for this suggestion, which helps to create a more robust and fair comparison.

To better isolate the individual contributions of multimodal fusion and temporal modeling, we have conducted a new experiment with the suggested baseline, "Unimodal (Text) + Transition Matrix." The results of this new baseline have been added to Table 6 in the revised manuscript. Furthermore, we have updated the accompanying text in Section 4.2 to discuss the implications of these new results.

To provide a clear visual comparison of all models, Figure 7 has also been updated to include the performance of this new baseline, as shown below.

The results demonstrate that adding temporal information significantly improves the unimodal model, providing a stronger baseline and highlighting the distinct contributions of both multimodal fusion and temporal analysis in our final proposed model.

Revision in Section 4.2:

Before

Unimodal (Text): A unimodal model that recognizes emotions using only the BERT model based on text data transcribed by Whisper. As text data has been a primary modality in previous multimodal research, it serves as a useful performance baseline.

Unimodal (Face): A unimodal model that performs emotion recognition using only the DeepFace-based facial expression recognition model, recognizing emotions from video frame data. This serves as a baseline for evaluating the standalone performance of the visual modality.

Multimodal (Late Fusion): A multimodal fusion model that performs final emotion prediction by combining the emotion results extracted from the text and facial modal-ities using a simple voting method. This is a basic multimodal approach without any additional time-series analysis techniques applied.

Multimodal + Transition Matrix: A model that considers temporal features by apply-ing an Emotion Transition Matrix to the results of the Late Fusion-based multimodal approach. This setup is for evaluating whether utilizing emotion transition character-istics improves model performance.

Multimodal + Change Detection: An approach that aims to improve emotion predic-tion accuracy by adding CUSUM change-point detection to the multimodal emotion sequence to detect and reflect abrupt emotional changes. Multimodal + SPM (Pro-posed): The final model proposed in this study, which includes all the methods (mul-timodal fusion, emotion transition matrix, change-point detection, Emotion SPM, and LSTM-based time-series classification).

After

Unimodal (Text) + Transition Matrix: To better isolate the individual contributions of multimodal fusion and temporal modeling, we introduced this baseline. It constructs an emotion transition matrix solely from the text modality and uses this temporal information as a feature for classification, allowing for a more direct comparison against our proposed multimodal temporal models.

Multimodal (Late Fusion): A multimodal fusion model that performs final emotion prediction by combining the emotion results extracted from the text and facial modalities using a simple voting method. This is a basic multimodal approach without any additional time-series analysis techniques applied.

Multimodal + Transition Matrix: A model that considers temporal features by applying an Emotion Transition Matrix to the results of the Late Fusion-based multimodal approach. This setup is for evaluating whether utilizing emotion transition characteristics improves model performance.

Multimodal + Change Detection: An approach that aims to improve emotion prediction accuracy by adding CUSUM change-point detection to the multimodal emotion sequence to detect and reflect abrupt emotional changes. Multimodal + SPM (Proposed): The final model proposed in this study, which includes all the methods (multimodal fusion, emotion transition matrix, change-point detection, Emotion SPM, and LSTM-based time-series classification).

Revision in Section 4.2:

Before

In the case of the Multimodal (Late Fusion) model, a performance improvement to 67.7% was observed simply by combining the two modalities of text and facial expression. This demonstrates that a multimodal approach provides richer information than the limited information from a single modality, contributing to performance enhancement. For the Multimodal + Transition Matrix model, which additionally applied an Emotion Transition Matrix, the performance increased to 70.2% on the Macro F1-score, a rise of approximately 2.5 percentage points compared to the Late Fusion model. This proves that emotional transition patterns and temporal dependencies play a significant role in emotion prediction. Furthermore, the Multimodal + Change Detection model, which incorporated the CUSUM change-point detection technique, saw a further increase in its F1-score to 71.3% by detecting and reflecting abrupt emotional changes. This confirmed that effectively modeling the points of sharp transition within an emotion sequence can contribute to performance improvement.

Finally, the Multimodal + Sequence Pattern Mining model proposed in this paper, which integrates all the aforementioned approaches, recorded a Macro F1-score of 72.9%. This is an improvement of approximately 10.8 percentage points over the unimodal baseline and about 5.2 percentage points over the basic multimodal fusion model. In terms of Accuracy, Precision, and Recall, it also showed clearly superior results compared to all baseline models. This validates that explicitly analyzing the temporal patterns of multimodal data makes multimodal emotion recognition models more robust.

After

To better isolate the contributions of multimodal fusion and temporal modeling, we introduced the "Unimodal (Text) + Transition Matrix" baseline. This model, which incor-porates temporal information into the strongest unimodal baseline, achieved a Macro F1-score of 67.7%. This result demonstrates a significant performance gain over the static text-only model, confirming that modeling temporal dynamics is beneficial even within a single modality.

Indeed, when we applied the Emotion Transition Matrix to the multimodal results ("Multimodal + Transition Matrix"), the performance increased further to 70.2% on the Macro F1-score. This proves that emotional transition patterns and temporal dependen-cies play a significant role in emotion prediction, and their impact is amplified when ap-plied to richer, multimodal data. Furthermore, the "Multimodal + Change Detection" model saw a further increase in its F1-score to 71.3% by effectively modeling abrupt emo-tional changes. Finally, the "Multimodal + Sequence Pattern Mining (Ours)" model, which integrates all the aforementioned approaches, recorded the highest Macro F1-score of 72.9%. This is an improvement of approximately 10.8 percentage points over the initial unimodal baseline and about 5.2 percentage points over the basic multimodal fusion model. This validates that explicitly analyzing the temporal patterns of multimodal data makes multimodal emotion recognition models more robust.

Reviewer Comment 8:

The ablation study should be expanded to include a condition labeled “+ Sequence Pattern Mining Features,” allowing for a direct quantification of the performance gain attributable specifically to the SPM component.

Author response and action: We thank the reviewer for this excellent suggestion to better isolate the contribution of the SPM component. Following your recommendation, we have expanded our ablation study by updating Table 7. The table now includes a distinct step labeled "+ Sequence Pattern Mining Features" to directly quantify the performance gain from this specific component. The updated results in the table clearly demonstrate that adding SPM features increased the Macro F1-score from 71.3% to 72.3%, validating its importance within our proposed framework.

Reviewer Comment 9:

Several key works directly related to sequential pattern mining—both classical and recent—are missing... Furthermore, references [27], [28], and [29] appear to be tangential to the core topic... The authors should re-evaluate the relevance of these citations and consider replacing them with more pertinent literature.

Author response and action: We thank the reviewer for the insightful comments on our literature review.

Following your valuable suggestion, we have strengthened our 'Related Work' section by incorporating more recent and key studies in both sequential pattern mining and multimodal emotion recognition to provide a more comprehensive background.

We have also re-evaluated the specific citations you mentioned. We agree that references [28] and [29], which focus on general AI applications, are tangential to our core topic. Accordingly, we have removed them in the revised manuscript and replaced them with more relevant literature.

Regarding reference [27], we appreciate the opportunity to clarify its relevance. While its application domain is different (fashion textiles), this paper is a foundational study by one of the authors that pioneered the use of a sequential model (CRF) to analyze the flow of affective adjectives. The core idea of tracking how affective states transition over time, rather than performing static classification, was a direct conceptual precursor to the emotion flow analysis in our current work. We believe it is valuable to retain this citation to acknowledge this methodological lineage, and we have clarified this connection in Section 2.2 to make the link more explicit for the reader.

Reviewer Comment 10:

The English language and grammar require careful proofreading and refinement. In particular, unnecessary capitalization within sentences should be corrected, and overall academic writing style should be improved.

Author response and action: We appreciate this advice. The entire manuscript has undergone a thorough professional proofreading to correct all grammatical errors, refine phrasing, and ensure a consistent academic tone. We have paid special attention to correcting unnecessary capitalization and improving overall clarity and coherence.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Suggestion for improvement:

Discuss computational cost and scalability from the perspective of real-time interactive systems.
Cross-dataset validation should be considered to test robustness beyond MELD.
Provide ablation studies to measure the contribution of each methodological component (fusion strategies, change-point detection, pattern mining).

Comments on the Quality of English Language

Some sentences are long (3+ clauses). Breaking them down would improve readability without losing sophistication.

Example: “The methodological complexity, while technically impressive, raises concerns about interpretability and practical deployment. Without careful analysis of the contributions and computational implications of each component, the model risks being more of an academic exercise than a tool ready for broader application.”

2. A few phrases could be simplified and varying phrasing slightly could make the text more engaging.

idiosyncratic” → “dataset-specific”
“genuinely ambitious effort” → “a notably ambitious effort”

Author Response

Responses to Reviewer #2

We sincerely thank Reviewer #2 for the insightful feedback focused on improving the practical relevance, robustness, and readability of our work. We have addressed each of your suggestions as follows.

Reviewer Comment 1:

Discuss computational cost and scalability from the perspective of real-time interactive systems.

Author response and action: Thank you for this crucial suggestion. To address the practical applicability of our model, we have added a new subsection 4.3. Computational Cost and Scalability for Real-Time Systems in the revised manuscript. This section discusses the inference times and resource requirements of each component of our framework. We analyze the main computational bottlenecks and propose potential optimization strategies for real-time deployment, such as using distilled model variants or applying model quantization, to provide a clear path for future work in this area.

Revision in 4.3:

After

4.3. Computational Cost and Scalability for Real-Time Systems.

While our proposed framework demonstrates strong performance, its practical application in real-time interactive systems requires consideration of its computational cost. The framework integrates several deep learning models, including Whisper for speech transcription, BERT for text emotion analysis, and DeepFace for facial analysis, each with considerable computational demands. In its current implementation using base-sized models, the framework is better suited for the offline analysis of conversational data rather than live, low-latency interactions.

For real-time deployment, several optimization strategies could be pursued. Key bottlenecks include the inference times of the BERT and Whisper models. Employing smaller, distilled versions of these models (e.g., TinyBERT, DistilWhisper) or applying model quantization techniques could significantly reduce latency with a manageable trade-off in accuracy. Similarly, a more efficient frame-sampling strategy for facial analysis, rather than processing every frame, could lessen the computational load. Future work will focus on developing a lightweight version of this framework to balance high performance with the stringent requirements of real-time scalability.

Reviewer Comment 2:

Provide ablation studies to measure the contribution of each methodological component (fusion strategies, change-point detection, pattern mining).

Author response and action: We appreciate you highlighting the need for a clear breakdown of each component's contribution. As suggested, a detailed ablation study is provided in our manuscript. We have updated Table 7 and its accompanying description in Section 4.2 to better isolate the impact of each stage. The updated study now clearly quantifies the performance gain from each component, including fusion, the emotion transition matrix, change-point detection, and sequential pattern mining, demonstrating the effectiveness of each part of our framework.

Updated Table 7:

Before

Table 7. Ablation Study Results.

Model	Macro F1(%)
Multimodal (Fusion only)	67.7
Emotion Transition Matrix	70.2
Change-Point Detection	71.3
LSTM Classification (Full)	72.9

After

Table 7. Ablation Study Results.

Model	Macro F1(%)
Multimodal (Fusion only)	67.7
Emotion Transition Matrix	70.2
Change-Point Detection	71.3
SPM Features	72.3
LSTM Classification (Full)	72.9

Reviewer Comment 3:

Some sentences are long (3+ clauses). Breaking them down would improve readability without losing sophistication. A few phrases could be simplified and varying phrasing slightly could make the text more engaging.

Author response and action: Thank you for your valuable comments on the language quality. We have thoroughly revised the entire manuscript based on your suggestions. We have broken down long, complex sentences into shorter, clearer ones to improve readability, as exemplified by the changes made to the Abstract. We have also replaced overly idiomatic or jargon-heavy phrases with more direct and accessible terminology as recommended.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

In this paper, the sequential emotion analysis based on Whisper-based speech transcription, BERT-based text emotion analysis, and DeepFace-based facial emotion analysis. Based on the emotion sequence the type of emotional flow is classified with the use of the Long Short-Term Memory model. The Multimodal EmotionLines Dataset was used to evaluate the proposed approach.

My comments:
Not all variables appearing in the formulas are described
The presented results confirm that the proposed solution achieved higher classification accuracy. However, the model is unable to correctly classify some emotions. It would be worthwhile to add a confusion matrix for the selected model (e.g., Multimodal + Change Detection) to compare and demonstrate the advantages of the proposed approach. The study was conducted on only one dataset. I also suggest extending the analysis to other datasets.

Author Response

Responses to Reviewer #3

We sincerely thank Reviewer #3 for the positive assessment and constructive comments on our work. We have addressed each point as detailed below.

Reviewer Comment 1:

Not all variables appearing in the formulas are described.

Author response and action: Thank you for your comment. We have thoroughly re-examined all equations in the manuscript. While we believe that each variable was defined upon its first appearance within the text accompanying the equations, we have double-checked every formula and its description to ensure that all variables are explicitly explained for the reader to guarantee maximum clarity.

Reviewer Comment 2:

The presented results confirm that the proposed solution achieved higher classification accuracy. However, the model is unable to correctly classify some emotions. It would be worthwhile to add a confusion matrix for the selected model (e.g., Multimodal + Change Detection) to compare and demonstrate the advantages of the proposed approach.

Author response and action: Thank you for this insightful suggestion. We completely agree that a confusion matrix is essential for a detailed analysis of the model's performance on a per-class basis and for understanding misclassification patterns. In line with your recommendation, we would like to respectfully draw your attention to Figure 8 in Section 4.2 of our manuscript. This figure presents the confusion matrix for our final proposed model ("Multimodal + Sequence Pattern Mining") on the MELD test set. The accompanying text analyzes these results, noting that while the model performs well on majority classes like 'joy' and 'neutral', it exhibits some confusion on minority classes such as 'fear' and 'disgust'. We believe this analysis provides the deeper insight into the model's performance that you suggested was needed.

Figure 8. Confusion Matrix of the proposed model on MELD Test Set.

Reviewer Comment 3:

The study was conducted on only one dataset. I also suggest extending the analysis to other datasets.

Author response and action: We agree with the reviewer that testing our model on additional datasets is an important step to verify its generalization capabilities. While the current study provides an in-depth analysis of the challenging MELD dataset, we acknowledge the value of cross-dataset validation.

Reflecting this, we have revised the Conclusion (Section 5) to explicitly state this as a key direction for future work. The revised manuscript now includes the following statement:

"Future research will focus on verifying the robustness of our framework through extensive cross-dataset validation on other conversational emotion datasets (e.g., IEMOCAP). Additionally, securing multilingual datasets will be an important task to confirm its broader generalization performance."

Article Menu

Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining

Further Information

Guidelines

MDPI Initiatives

Follow MDPI