CDEA: Causality-Driven Dialogue Emotion Analysis via LLM

Zhang, Xue; Wang, Mingjiang; Zhuang, Xuyi; Zeng, Xiao; Li, Qiang

doi:10.3390/sym17040489

Open AccessArticle

CDEA: Causality-Driven Dialogue Emotion Analysis via LLM

by

Xue Zhang

^1,2,*

,

Mingjiang Wang

^1,2,

Xuyi Zhuang

^1,2,

Xiao Zeng

^1,2 and

Qiang Li

³

¹

Key Laboratory for Key Technologies of IoT Terminals, Harbin Institute of Technology, Shenzhen 518055, China

²

School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen 518055, China

³

Shenzhen Zhili Middle School, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(4), 489; https://doi.org/10.3390/sym17040489

Submission received: 14 February 2025 / Revised: 13 March 2025 / Accepted: 20 March 2025 / Published: 25 March 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid advancement of human–machine dialogue technology, sentiment analysis has become increasingly crucial. However, deep learning-based methods struggle with interpretability and reliability due to the subjectivity of emotions and the challenge of capturing emotion–cause relationships. To address these issues, we propose a novel sentiment analysis framework that integrates structured commonsense knowledge to explicitly infer emotional causes, enabling causal reasoning between historical and target sentences. Additionally, we enhance sentiment classification by leveraging large language models (LLMs) with dynamic example retrieval, constructing an experience database to guide the model using contextually relevant instances. To further improve adaptability, we design a semantic interpretation task for refining emotion category representations and fine-tune the LLM accordingly. Experiments on three benchmark datasets show that our approach significantly improves accuracy and reliability, surpassing traditional deep-learning methods. These findings underscore the effectiveness of structured reasoning, knowledge retrieval, and LLM-driven sentiment adaptation in advancing emotion–cause-based sentiment analysis.

Keywords:

dialogue sentiment analysis; emotion causes; reasoning; commonsense knowledge; LLM; prompt engineering

1. Introduction

Dialogue sentiment analysis is a crucial branch of natural language processing, focused on identifying and understanding emotional information expressed in dialogue content. Unlike general text-based sentiment analysis, dialogue sentiment analysis requires not only understanding the sentiment of individual sentences but also considering context, interactions between dialogue participants, and dynamic emotional shifts to accurately determine sentiment categories. Sentiment analysis has a wide range of applications, including emotional chatbots [1], social sentiment mining [2], healthcare [3], legal trials [4], and intelligent assistants [5].

Early sentiment analysis methods [6,7] relied on lexicon-based keyword matching, which struggled with sentences lacking explicit emotional cues. Later, feature engineering approaches improved performance but were complex and lacked generalization ability. Recently, deep learning models (e.g., CNNs, RNNs, Transformers) have enabled automatic feature extraction, but these approaches require accurately labeled datasets, and relying solely on context can lead to inconsistent annotations [8]. As shown in Figure 1, one of the most advanced strategies for enhancing the objectivity of dialogue sentiment analysis is incorporating the underlying causes of sentiment generation. If a model can identify the root causes of sentiment within a dialogue and, based on this information, accurately infer the sentiment category of a given statement, its reliability and interpretability would be significantly improved.

However, there has been limited work [9,10] explicitly considering emotional causes to identify sentiment categories, and this field still faces several challenges.

Lack of Explicit Reasoning Paths. Most sentiment analysis methods rely on semantic similarity or implicit feature extraction [8], capturing contextual associations without revealing the causal logic behind sentiment shifts. DialogueCRN [9] introduces Contextual Reasoning Networks (CRNs) to model contextual dependencies in conversations. While it effectively captures the sequential influence of emotions, its approach to emotional cause recognition remains implicit, as it does not explicitly construct reasoning paths. Consequently, its ability to establish causal relationships between dialogue utterances is limited, making it susceptible to errors in complex multi-turn interactions. Causal reasoning in sentiment analysis presents bi-directional symmetry, meaning that emotions arise not only from past contexts (forward causality) but are also influenced by the way speakers respond to them (backward causality) [11,12]. Current models struggle with this causal symmetry, leading to limited interpretability and reasoning accuracy. CauAIN [10] attempts to address this by introducing a causal-aware interaction network, which explicitly models inter-utterance causal dependencies. However, it primarily focuses on local causal inference, failing to capture global sentiment reasoning patterns across entire conversations.

Insufficient Commonsense Knowledge. While some studies incorporate sentiment causes, their reasoning remains constrained by incomplete commonsense knowledge bases and limited contextual understanding [13,14]. Human sentiment cognition relies not only on linguistic cues but also on background knowledge and cultural norms, such as typical emotional responses to specific events. DialogueCRN and CauAIN, despite their advancements in causal reasoning, rely solely on dialogue context, lacking external commonsense knowledge support. This weakens their ability to establish causal links between sentiment triggers and emotional expressions, ultimately reducing classification accuracy and robustness [15,16].

To address these limitations, we designed a dual-module framework. First, we extract causal relationship information from a structured machine commonsense knowledge graph to detect emotional triggers between historical and target utterances in a dialogue. Next, we employ an attention mechanism to transform the extracted sentiment causes into prompts, leveraging a pre-trained language model for integrated reasoning to accurately predict the sentiment category of the target utterance. This process mimics human intuition and deliberate reasoning, significantly enhancing the model’s ability to capture sentiment causality. To better reflect speaker interactions, our sentiment cause detection module differentiates between “other-induced reasoning paths” and “self-induced reasoning paths”, enabling a more precise analysis of the contributions of different roles to emotional expression. With the advancement of large language models (LLMs) such as ChatGPT (https://chat.openai.com/, accessed on 14 March 2023), GPT (version 4.0) [17], and Claude 2 (https://www.anthropic.com/product/, accessed on 11 July 2023), their powerful contextual understanding and commonsense reasoning capabilities offer new opportunities for complex sentiment analysis tasks. However, leveraging LLMs’ contextual learning for dialogue sentiment analysis remains challenging [18], as identifying sentiment causes requires both deep contextual understanding and commonsense reasoning to establish causal relationships [19].

To tackle this challenge, we propose a prompt engineering approach that guides LLMs with high-quality instructions, integrating dialogue context and sentiment knowledge to enhance sentiment cause reasoning and classification, while overcoming the commonsense limitations of traditional methods.

In summary, the main contributions of this paper are threefold:

We propose a dialogue emotion analysis method based on explicit reasoning for emotional causes. This method solves the problem of the lack of explicit reasoning path information in current methods by providing a clear reasoning path, allowing for accurate identification of emotional causes and establishing causal symmetry between emotional causes and emotion categories.
In addition, we leverage the rich knowledge embedded in GPT-4 and its powerful generalization ability to enhance the effectiveness of the emotion analysis method by constructing instructions that include historical content, emotional causes, and empirical examples. This approach not only effectively compensates for the lack of common knowledge support in current methods that consider emotional causes but also strengthens the accuracy and flexibility of emotional reasoning through a symmetry mechanism.
We conducted extensive experiments on three benchmark datasets to validate the model’s effectiveness and advantages in constructing explicit reasoning paths and LLM commonsense reasoning. The experimental results show that our model outperforms existing baseline methods in terms of accuracy and reliability, further highlighting the close connection between the emotion causal reasoning model and the concept of symmetry.

2. Related Work

In this section, we provide a comprehensive overview of existing techniques for sentiment analysis and conversational sentiment analysis, followed by a detailed description of current work related to explicitly considering emotional reasons in conversational sentiment analysis.

2.1. Sentiment Analysis Techniques

Early sentiment analysis relied on rule-based methods using sentiment lexicons [20]. While straightforward, these methods were highly dependent on lexicon quality and struggled with complex sentence structures. The introduction of machine learning improved sentiment classification by learning patterns from labeled data [21,22]. However, these approaches relied on manually extracted features and lacked strong contextual understanding, particularly in dialogues [23,24].

Bengio et al large corpora and can be fine-tuned for specific applications. Language models, which predict word sequences probabilistically [25], improve natural language processing (NLP) tasks by capturing contextual semantics. Pre-trained models primarily fall into two categories: LSTM-based and Transformer-based. LSTM-based models, such as ELMo [26], use multi-layer bi-directional LSTMs to create contextual word embeddings, effectively addressing polysemy issues. In contrast, Transformer-based models, including GPT [27] and BERT [28], leverage self-attention mechanisms to capture long-range dependencies, significantly improving sentiment classification. Further extensions of BERT have been optimized specifically for sentiment analysis in social media contexts, such as tweets [29,30,31].

Despite these advancements, effectively designing prompts to guide pre-trained models in sentiment reasoning remains challenging. This paper proposes a structured prompt engineering approach that integrates command statements, dialogue history, affective reasoning, task descriptions, and empirical examples. By incorporating conversation context with commonsense knowledge, our method enhances the model’s ability to accurately infer sentiment categories.

2.2. Conversational Sentiment Analysis Techniques

Unlike general text sentiment analysis, dialogue sentiment analysis must account for diverse factors such as dialogue context, speaker interactions, and external information.

Existing approaches typically employ Recurrent Neural Networks (RNNs), Graph Neural Networks (GNNs), and attention mechanisms to model speaker interactions and extract context-relevant sentiment representations. For example, ref. [32] introduced DialogueGCN to capture intra- and inter-speaker dependencies, though it struggles with sentence-level relationships. Similarly, ref. [33] proposed R-GAT, a location-aware graph attention network, while [34] modeled dialogue frames as directed acyclic graphs (DAG-ERCs) to represent speaker relationships and sentence positions. While GNN-based methods effectively model conversational context, they often fall short in capturing speaker-specific nuances and precise sequence information. Furthermore, emotion subjectivity leads to inconsistent annotations, undermining the reliability of existing methods. A promising strategy to address this issue is to focus on the objective causes of subjective emotions. For instance, ref. [9] designed a multi-round inference module using LSTMs to retrieve and integrate emotional causes based on sentence similarity, while [10] distinguished self- and other-induced causes by concatenating their respective features before implicitly integrating them via a fully connected network. However, these methods face two key challenges: (1) a lack of explicit inference paths for capturing accurate causal relationships and (2) insufficient commonsense support. To mitigate these limitations, ref. [35] developed KET, a Transformer-based model that integrates commonsense knowledge through a context-based emotion graph attention mechanism, though its limited external relationship modeling may cause it to miss certain semantic details. In contrast, ref. [36] introduced COSMIC, a commonsense-based architecture that better captures the complex interplay between different commonsense knowledge categories and emotions.

Building on these insights, we propose a dialogue sentiment analysis method based on explicit reasoning of sentiment causes, which addresses the lack of clear inference paths and insufficient commonsense support. Our approach leverages LLMs to enhance the model’s capacity for commonsense reasoning and improve the accuracy of sentiment cause identification.

3. Method

3.1. Mission Definition and Model Overview

Sentiment Analysis in Dialogue is a classification task where, given a continuous dialogue and the corresponding speaker information for each sentence as input, the goal is to identify and output the sentiment category of the target speaker’s utterance from a set of predefined sentiment categories. Specifically, assume that each dialogue consists of N consecutive sentence

C = {u_{1}, u_{2}, \dots, u_{n}}

and their corresponding sentiment labels

Y_{C} = {y_{1}, y_{2}, \dots, y_{N}} \in E

, where E denotes the sentiment category. For a sentence, it consists of M tokens

u_{i} = {w_{i, 1}, w_{i, 2}, \dots, w_{i, M}}

. Each sentence in a conversation C is uttered by a speaker and can be represented as

s (C) = [s (u_{1}), \dots, s (u_{i}), \dots, s (u_{N})]

, where

s (u_{i}) \in S

. The function s maps the index of a sentence to its corresponding speaker, and S denotes the category of the speaker. Thus, the whole problem can be formulated as follows: to obtain the sentiment label of each sentence based on the context and the corresponding speaker information in a conversation:

Y_{C} = f (C, s (C))

.

The LLM dialogue sentiment analysis model based on explicit reasoning for emotional reasons proposed in this study is shown in Figure 2.

3.2. Sentiment Cause Sentence Acquisition

Token representation. First, the context-independent sentence feature representation of the sentence is obtained. Here again in this section, the widely used pre-trained language model RoBERTa is used to extract the context-independent feature vector representations of sentences. Specifically, for each sentence

u_{i} = {w_{i, 1}, w_{i, 2}, \dots, w_{i, M}}

, a special token

[C L S]

is attached to the beginning of the sentence. After that, the sequence

{[C L S], w_{i, 1}, w_{i, 2}, \dots, w_{i, M}}

is sent as input to the pre-trained language model RoBERTa, which is trained for the sentence-level sentiment classification task, i.e., the fine-tuning of the context-independent sentiment recognition task, and the feature vectors corresponding to the last layer of the

[C L S]

tokens are sent to the pooling layer to be finally classified into the corresponding sentiment categories. After fine-tuning the RoBERTa model, each sentence is evaluated in the same format, i.e.,

{[C L S], w_{i, 1}, w_{i, 2}, \dots, w_{i, M}}

to obtain the context-independent feature vectors of the sentences corresponding to

[C L S]

markers of the sentences

c_{i}

:

c_{i} = RoBERTa ([C L S], w_{i, 1}, w_{i, 2}, \dots, w_{i, M})

(1)

where

c_{i} \in R^{d_{m}}

and

d_{m}

is the dimension of the labeled hidden state in RoBERTa. This section aligns with previous work [36] by also averaging the

[C L S]

markers of the last four layers to finally obtain a context-independent feature vector representation of each sentence.

Based on the above acquisition of context-independent representations of sentences, the next step is to acquire context-related representations of sentences. In a setting such as a dialog, the sentiment expressed by a sentence usually depends on the context of the whole dialog. Therefore, based on the context-independent feature representation of the sentence

c_{i}

, LSTM is used here to model the sequential dependencies between the sentences, and the reason for not using bi-directional long and short-term memory networks is that the scenario of the dialogue sentiment analysis task involved in this paper is biased towards real-time dialogue sentiment analysis in human–computer dialogue systems, where the information from the future context is not visible to the current sentence. Finally, the context-dependent feature representation

h_{i}

of the sentence is computed as follows:

h_{i} = LSTM (c_{i}, h_{i - 1})

(2)

where

h_{i} \in R^{d_{h}}

denotes the hidden state vector at the ith time step and dh is the vector dimension of the output of each cell of LSTM.

Inference path information acquisition. We use event-related reasoning from ATOMIC to address the lack of explicit reasoning paths in dialogues. Each sentence is treated as an event, and reasoning from historical sentences fills the gap in causal association with the target sentence. The emotional cause of the target sentence cannot stem from its own or future context. As shown in Table 1, we explore six relationship types: xReact, xEffect, and xWant (own reasoning paths, reflecting the speaker’s impact on themselves), and oReact, oEffect, and oWant (others’ reasoning paths, showing the effect on others). By incorporating these, we can more accurately identify emotional causes in conversation.

In this section, we use the Common Sense Transformer model COMET [37] to extract inferential commonsense information from a structured machine commonsense graph, ATOMIC (Action and Temporal Commonsense Knowledge Graph) [38]. COMET is an encoder–decoder model that uses a pre-trained autoregressive linguistic model GPT as a base generative model that is trained on multiple commonsense knowledge graphs to automatically construct a knowledge graph. ATOMIC is a large-scale commonsense knowledge graph designed to model if-then reasoning about everyday events. ATOMIC focuses on social and inferential knowledge, allowing models to predict possible causes, effects, and intent of human actions. It contains over 880 K tuples structured as

(s, r, o)

triplets, where s (subject) represents an event, r (relation) denotes a commonsense relationship, and o (object) refers to the inferred commonsense consequence. COMET obtains a ternary

s, r, o

from the graph and is trained to generate an object phrase o based on a subject phrase s and a relation phrase r. In order to accomplish the task of automatically generating structured general knowledge, COMET is trained on a structured machine general knowledge graph, ATOMIC. The given event (i.e., sentence

u_{i}

in the dialogue scenario) and the selected relation type are first stitched together with mask tags

[m a s k]

as inputs to the COMET, referring to previous work [36], where the representation of the hidden state of the last encoder layer of the COMET is also used as inference commonsense for the sentence

u_{i}

:

h_{C O M E T}^{r e l a t i o n} = COMET (u_{i}, [m a s k], r e l a t i o n)

(3)

Then, the generated three self-referential constants are spliced and mapped to a feature vector of dimension size

d_{h}

, which is used as the self-referential path information

i n f_{p a t h_{i}}^{i n t e r}

of the sentence

u_{i}

, where

[;]

denotes feature splicing:

i n f_{p a t h_{i}}^{i n t e r} = f_{i n t e r} (h_{C O M E T}^{o E f f e c t}; h_{C O M E T}^{o R e a c t}; h_{C O M E T}^{o W a n t})

(4)

Affective Cause Detection. In this stage, we assume all historical sentences before the target sentence in the dialogue are potential emotional causes, aiming to measure their causal correlation with the target sentence. The output is a causal correlation matrix for the dialogue. To explicitly model emotional interactions between speakers, we categorize causal sentences into self-caused and other-caused sentences.

For self-caused sentences, we focus on the causal influence of historical sentences from the same speaker on the target sentence’s emotion. We combine features from the historical sentence and its inference path with the target sentence for similarity calculation, using

α

as the causal correlation value between the self-caused sentence and the target sentence, calculated as follows:

α_{i, j}^{i n t r a} = \frac{[l_{q} (h_{i}) (l_{k} (h_{j}) + l_{ν} (i n f_p a t h_{j}^{i n t r a}))] m a s k_{i, j}^{i n t r a}}{\sqrt{d_{h}}}

(5)

where

l_{q} (x)

,

l_{k} (x)

and

l_{v} (x)

are all linear transformations, and

d_{h}

is the dimension size of the key vector.

m a s k

serves two purposes, one is to ensure that the detected historical causal sentences

h_{j}

and the target sentences

h_{i}

are all the same speaker, i.e., to ensure that the causal sentences

h_{j}

detected by the process are self-causal sentences. The second is to ensure that the detected self-causal sentences are from the dialogue context before the target sentence, and there will be no self-causal sentences from the future dialogue context, which is in line with the nature of causality, and the

m a s k

is expressed as follows:

m a s k_{i, j}^{i n t r a} = \{\begin{matrix} 1, & i f j < = i a n d s (h_{i}) = s (h_{j}) \\ 0, & o t h e r w i s e \end{matrix}

(6)

where s is used to map the index of the sentence to the corresponding speaker index.

For other people’s cause sentence, focusing on the degree of causal influence of historical sentences from different speakers on the emotion expressed in the target sentence, firstly, the historical sentence features and other people’s reasoning path information features are spliced with the target sentence for the similarity calculation, and the score is taken as the value of the degree of causal correlation between the other people’s cause sentence and the target sentence

s c o r e

. The specific calculation method is as follows:

α_{i, j}^{i n t e r} = \frac{[l_{q} (h_{i}) (l_{k} (h_{j}) + l_{ν} (i n f_p a t h_{j}^{i n t e r}))] m a s k_{i, j}^{i n t e r}}{\sqrt{d_{h}}}

(7)

where

l_{q} (x)

,

l_{k} (x)

,

l_{v} (x)

, and

d_{h}

are as described earlier. Here again, the mask serves two purposes: first, it ensures that the detected historical causative sentence

h_{j}

comes from a different speaker than the target sentence

h_{i}

, i.e., it ensures that the causative sentence

h_{j}

detected by the process is an other person’s causative sentence. Second, it also ensures that the detected others are from the dialogue context before the target sentence, which is represented by the

m a s k

as follows:

m a s k_{i, j}^{i n t e r} = \{\begin{matrix} 1, & i f j < i a n d s (h_{i}) = s (h_{j}) \\ 0, & o t h e r w i s e \end{matrix}

(8)

After obtaining the values of causal influence of own cause statements as well as other people’s cause statements on the target statement, it is necessary to assess the degree of causal influence of the cause statements on the target statement under the same criterion, and the values in the matrix of the degree of causal correlation that are finally obtained are calculated in the following way:

α_{i, j}^{j o i n t} = s o f t m a x (α_{i, j}^{i n t r a} + α_{i, j}^{i n t e r})

(9)

Ultimately, we obtain the m other-cause statements and n self-cause statements in the dialogue context that have the highest degree of causal association with the target statement.

3.3. Dynamic Retrieval of Experience Examples

For LLMs, we use GPT-4, which has demonstrated a strong ability to learn with fewer examples, excelling at adapting to new tasks with minimal context. However, their performance depends on the selection of demo examples [39]. To leverage GPT-4’s powerful generalization capabilities while mitigating biases introduced by manual example selection, this study dynamically retrieves examples tailored to each input query, improving learning efficiency and contextual adaptation.

We begin by constructing an empirical database,

D B_{e x p}

, for conversational sentiment analysis, based on the EmoryNLP dataset [40]. The EmoryNLP dataset is a widely used benchmark for conversational sentiment analysis, originally derived from the TV show Friends. It contains 897 dialogues and 12,606 utterances, where each utterance is labeled with one of seven sentiment categories: Neutral, Joyful, Peaceful, Powerful, Sad, Mad, and Scared. These labels provide a fine-grained emotional understanding of conversational exchanges.

In this study, we preprocess the dataset in the following ways to ensure effective example retrieval:

Speaker information removal: To prevent speaker identity bias, all speaker metadata are removed, ensuring that the retrieved examples are selected purely based on textual content rather than specific speakers’ emotional tendencies.
Sentiment category balancing: Since some emotion categories (e.g., Neutral) are more frequent than others, we apply category balancing techniques to ensure that all sentiment classes have a uniform distribution within $D B_{e x p}$ . This prevents the model from over-relying on dominant categories during retrieval.
Text normalization: To reduce variability in sentence structure, we perform basic text preprocessing, such as lowercasing, punctuation normalization, and stop-word removal.

For a target sentence

u_{i}

, the most relevant sentiment analysis examples are retrieved from

D B_{e x p}

in two steps, based on semantic similarity, to serve as empirical examples

d_{e x p}

for context learning in the LLMs.

First, to ensure the semantic similarity between the experience example

d_{e x p}

and the target statement

u_{i}

, we use the text searcher BERTScore [41] to compute the semantic similarity between the target statement

u_{i}

and the statement

u_{d b}

in the experience database

D B_{e x p}

. The k most similar statements are selected as the candidate experience examples set

D_{e x p}^{c a n d}

. The similarity calculation is defined as follows:

s i m (u_{i}, u_{d b}) = BERTScore (u_{i}, u_{d b})

(10)

BERTScore is used here because it provides fine-grained semantic similarity by comparing token-level contextual embeddings of sentences. This allows it to capture synonymy and paraphrasing effects, making it well suited for identifying semantically similar sentences regardless of word choice.

Then, since the same sentence can express different emotions in different dialogue contexts, we use cosine similarity to calculate the contextual semantic similarity between the target sentence

u_{i}

and the candidate experience examples

u_{e x p}^{c a n d}

from

D_{e x p}^{c a n d}

. The statement with the highest similarity score is selected as the final empirical example

d_{e x p}

for the target sentence

u_{i}

, as determined by the following formula:

d_{e x p} = \underset{u_{e x p}^{c a n d} \in D_{e x p}^{c a n d}}{argmax} SIM (u_{i}, u_{e x p}^{c a n d})

(11)

3.4. Prompt Instruction Construction

To better leverage the rich commonsense knowledge in the LLM for dialogue sentiment analysis based on sentiment reasons, this section reconstructs the task in a generative framework by fine-tuning the LLM. As shown in Figure 3, we design a five-part prompt template—comprising an instruction statement, dialogue history, sentiment reasons, task statement, and experience example—to guide the LLM in analyzing the sentiment category of a sentence based on sentiment reasons, contextual dialogue, and relevant commonsense knowledge. The prompt components, except for the instruction statement, are tailored around the target sentence

u_{i}

.

Instruction Statement: Defines the model’s role, details the dialogue sentiment analysis task, and standardizes the input format.

Dialogue History: Accurate emotion detection relies on context. Unlike studies that use future dialogue [41,42], only prior dialogue is used here, limited by a hyperparameter (i.e., the history window, denoted as w) that is used in this section to denote the number of dialogue history sentences to be considered. For the target sentence

u_{i}

, the details of its dialogue history

u_{(i, H)}

are shown in Figure 3.

Emotional Reasons: This study focuses on sentiment analysis based on emotional reasons. Previous methods lacked commonsense support, making it challenging to analyze sentiment categories based on emotional cues. To address this, the LLM is guided to better recognize the emotion of

u_{i}

by including m relevant other’s reasons and n self-reasons in the prompt, totaling

m + n

reason statements. However, retrieved reasons may be incomplete or lack explicit causal connections between historical utterances and the target sentence. To refine and enhance these reasons, BART is employed to generate an augmented explanation for each retrieved reason, leveraging its pre-trained generative capability to fill in missing causal links and improve reasoning coherence. The enhanced reason is formulated as follows:

r_{j}^{B A R T} = BART (r_{j}, u_{i}, h_{j})

(12)

where

r_{j}

is the original retrieved reason, and

h_{j}

represents its corresponding historical context. This augmentation ensures that the provided reasons are more interpretable, logically structured, and contextually aligned, improving sentiment classification. Details are shown in Figure 3.

Task Statement: The task statement reconstructs the sentiment analysis task by combining emotional reasons with the LLM generative. It confines the LLM’s output to a predefined set of sentiment categories

L = {l_{1}, l_{2}, \dots, l_{λ}}

, facilitating statistical analysis. The task statement

u_{(i, T)}

focuses the LLM on categorizing the target sentence’s sentiment. Details are shown in Figure 3.

Example of Experience: To enhance the LLM’s emotional understanding through context learning, the prompt provides a dynamically retrieved experience example

u_{(i, E)}

from the experience database. This example, similar in contextual semantics to the target sentence

u_{i}

, enables the LLM to analyze emotion more accurately using commonsense knowledge. Additionally, BART is leveraged to reconstruct and refine the retrieved experience example, ensuring that it aligns more effectively with the specific contextual and emotional aspects of

u_{i}

. This further enhances the LLM’s ability to perform nuanced sentiment classification. Details are shown in Figure 3.

When analyzing the sentiment category of the target sentence

u_{i}

using the method based on combining the sentiment reasons with the LLM, we construct the input

x_{i}

of the target sentence

x_{i}

according to the prompt instruction template by splicing the instruction statement text

u_{(i, I)}

, the dialogue history text

u_{(i, H)}

, the BART-refined sentiment reason text

r_{j}^{B A R T}

, the task statement text

u_{(i, T)}

, and the experience example text

u_{(i, E)}

with the symbols

[;]

. The input

x_{i}

of the target sentence

x_{i}

is constructed as follows:

x_{i} = [u_{i, I}; u_{i, H}; r_{1}^{B A R T}, \dots, r_{m + n}^{B A R T}; u_{i, T}; u_{i, E}^{B A R T}]

(13)

To help the LLM better understand and adapt to dialogue sentiment analysis, we introduce an auxiliary task during fine-tuning: semantic interpretation of sentiment categories. This allows the LLM to deepen its understanding of sentiment categories and distinguish between similar ones, such as “happy” vs. “excited” or “sad” vs. “frustrated”. First, the most common semantic interpretations corresponding to the sentiment categories are obtained. For the set of sentiment categories in the dataset

L = {l_{1}, l_{2}, \dots, l_{λ}}

, retrieve the set of the most common semantic interpretations

S I = {s i_{1}, s i_{2}, \dots, s i_{λ}}

corresponding to the sentiment categories from the sentiment lexicon SentiWordNet3.0 [43]:

S I = SentiWordNet (L)

(14)

During fine-tuning, after the LLM generates the sentiment category e for the target sentence

u_{i}

, it generates the most common semantic interpretation

s i_{K (e_{i})}

for that sentiment category. This step helps the LLM differentiate sentiment categories. For example, if

e_{i}

: frustrated, the LLM generates the corresponding semantic interpretation

s i_{K (e_{i})}

: disappointedly unsuccessful. To incorporate this semantic task in fine-tuning, the task statement

u_{i, T}

in the prompt instruction is replaced with

u_{i, T}^{t r a i n}

, instructing the LLM: “Please first select the sentiment category of the sentence

u_{i}

from

< l_{1}, l_{2}, \dots, l_{λ} >

, followed by the semantic interpretation corresponding to that sentiment category from

< s i_{1}, s i_{2}, \dots, s i_{λ} >

”.

Therefore, during fine-tuning, the input

x_{i}^{t r a i n}

for the target sentence

u_{i}

is constructed as follows:

x_{i}^{t r a i n} = [u_{i, I}; u_{i, H}; u_{i, E C}; u_{i, T}^{t r a i n}; u_{i, E}]

(15)

3.5. Training and Loss Functions

In the LLM fine-tuning training phase, given an input sentence

u_{i}

, the instruction

x_{i}^{t r a i n}

constructed according to the prompted instruction template is taken as input, and the LLM inference framework for dialogue sentiment analysis returns the logits value

g_{i}

of the entire complete sentence and the corresponding generated text

y_{i}

:

y_{i}, g_{i} = LLM (x_{i}^{t r a i n}, θ)

(16)

where

θ

represents all trainable parameters of the LLM, including the Transformer layers, word embeddings, and output layers, controlling how the model processes input and generates output. The LLM optimizes

θ

to learn language patterns and improve the quality and accuracy of generated text.

g_{i} \in R^{L \times V}

, L and V denote the length of the entire sentence and the size of the vocabulary used by the LLM, respectively. The LLM predicts the conditional probability

p (t_{i} | x_{i}^{t r a i n}, θ)

of each token

t_{i}

in the generated text

y_{i}

until the output of the end token

< e o s >

. Consistent with the original training goal of the LLM, we use the next labeled predictive loss function to train the model. Thus, the loss function for the main task of dialogue sentiment analysis is defined as follows:

L_{m a i n} = - \frac{1}{N} \sum_{i = 1}^{N} log P (e_{i} | x_{i}^{t r a i n}, θ)

(17)

where

e_{i}

denotes the sentiment category labeling corresponding to the target sentence

u_{i}

generated by the LLM, and N denotes the number of all sentences contained in the dataset. The loss function used for the task of generating semantic interpretation of sentiment categories is defined as follows:

L_{a u x} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{r = 1}^{| S I_{i} |} log P (s i_{r + 1} | s i_{r}, e_{i}, θ)

(18)

where

s i_{r}

denotes the r-th token of the semantic interpretation corresponding to the sentiment category token

e_{i}

. Therefore, the overall loss function of the model training process is defined as follows:

L = L_{m a i n} + α * L_{a u x}

(19)

where the hyperparameter

α

is used to adjust the weight of the sentiment category semantic interpretation generation task loss in the overall fine-tuning training loss of the LLM.

4. Experiments

4.1. Setup

4.1.1. Models and Datasets

This study uses the datasets IEMOCAP [44], MELD [45], and DailyDialog [46] for the experiments, and the following is a detailed description of these three benchmark datasets.

(1) The IEMOCAP dataset includes 151 conversations, 7433 sentences, 10 conversational roles, and six emotion categories, with 77% non-neutral emotions. Created by the SAIL Lab at USC, it contains two-person conversations between 10 professional actors, spanning five sections and 12 h of multimodal audio and video data. The dialogues consist of both fixed scripts and improvised scenarios. IEMOCAP is widely used in dialogue sentiment analysis due to its rich multimodal data and high-quality items.

(2) The MELD dataset includes 1433 conversations, 13,708 sentences, and seven sentiment categories, with 53% non-neutral sentiment. This dataset, a multimodal extension of the EmotionLines dataset based on parts of the show Friends, contains both text and video. MELD is commonly used for conversation sentiment analysis due to its high-quality data and multimodal content.

(3) The DailyDialog dataset comprises 13,118 conversations across seven sentiment categories, four dialogue behavior types, and 10 topics, representing various daily life scenarios without fixed speaker roles. It is suitable for sentiment analysis, dialogue behavior analysis, and sentiment dialogue generation. The dataset’s main strength is its large volume and low noise, but it has a significant drawback—83% of the data is neutral sentiment. Only textual information was used in the experiments in this section. Detailed statistics are shown in Table 2 and Table 3.

Sentiment analysis of conversations based on emotional reasons is a novel and advanced research area with limited related work. To demonstrate the effectiveness of the large-model-based explicit inference of emotional reasons for conversational sentiment analysis, this section compares it with traditional deep network methods, small pre-trained language model-based methods, and emotional reason-based methods. The LLM used in this study is Llama2-7B [47], fine-tuned with the LoRA approach [48]. LoRA is a parameter-efficient fine-tuning method. Instead of updating all model parameters, LoRA injects low-rank adaptation matrices into the Transformer layers while keeping the pre-trained model weights frozen. This significantly reduces computational and memory costs while maintaining the model’s performance. By using LoRA, Llama2-7B can be effectively adapted for conversational sentiment analysis while requiring fewer trainable parameters compared to full fine-tuning. The baseline models for comparison are as follows:

COSMIC [36]: the first model that takes into account different categories of commonsense knowledge in a conversational sentiment analysis task and utilizes them to update conversational states.
DAG-ERC [34]: models the conversation structure as a directed acyclic graph, modeling both distant and proximate information interactions in a conversation.
DialogueCRN [9]: attempts to model intuitive retrieval and conscious reasoning processes by designing a multi-round reasoning module that iteratively performs the process of extracting and integrating emotional cues.
SKAIG [49]: uses the structure of a connectivity graph to enrich the representation of edges in the graph with commonsense knowledge, and enriches the representation of target utterances with past and future contextual information in the context.
CauAIN [10]: takes commonsense knowledge as the cause of emotion generation in dialogs and utilizes attentional mechanisms to update deeper representations of the target utterance in relation to emotion.
ERCMC [50]: uses the generated pseudo-future contexts in combination with historical contexts to improve emotion recognition in conversation.
UniMSE [51]: A multi-task learning framework for information extraction that leverages multiple data sources to generate structured outputs. It enhances extraction efficiency and accuracy by integrating a structured extraction language with a pre-trained text-to-structure model.
InstructERC [52]: is a model for dialogue emotion recognition that uses large-scale language models to improve the accuracy of emotion recognition. The model enhances its understanding of emotions with two auxiliary tasks—speaker identification and emotion prediction.
Ref. [53]: uses commonsense knowledge to complement the contextual information contained in utterances and enrich the extracted conversation information.
CKERC [54]: is a novel emotion recognition in conversation (ERC) model that improves the accuracy of emotion recognition by combining large-scale language models (LLMs) and commonsense knowledge.

4.1.2. Implementation Details

In this paper, we use the Llama-2-7b macromodel3 from the model library provided by Hugging Face and fine-tune it using the LoRA method, with the learning rate set to

2 \times 10^{- 2}

and the hyperparameter

α

set to 0.2. The size of the history window, w, is set to 1, 5, 10, 15, and 20, respectively. k is set to 5 to retrieve the number of candidate examples from the experience database, and k is set to 2 to retrieve the number of other reasons statements, while the number of own reason statements (n) varies depending on the dataset. Specifically, for the IEMOCAP dataset, both the other’s reasons and the own reasons are set to 2; for the MELD dataset, the other’s reasons are set to 1, and the own reasons are set to 3; for the DailyDialog dataset, other’s reasons remain 1, while the own reasons are set to 3. These settings are determined based on the characteristics of each dataset to optimize sentiment reasoning.

As shown in Table 4, referring to previous work [34], this section also selects the weighted average F1 scores as the evaluation metrics for the datasets IEMOCAP and MELD, and for the dataset DailyDialog, the micro-averaged F1 scores are used as the evaluation metrics, but statements labeled neutral are excluded from the calculation of the results.

4.2. Results and Analysis

4.2.1. Overall Results

The sentiment classification performance of the different models on the three publicly available benchmark test datasets is shown in Table 5. From the data in the table, it can be observed that the sentiment classification performance of the proposed dialogue sentiment analysis method based on combining sentiment reasons with the LLM is better than that of the comparative baseline models on all three datasets.

In the IEMOCAP dataset, dialogues contain numerous turns, rich contextual information, and frequent emotional interactions. Graph Neural Network (GNN)-based approaches perform well due to their ability to model these interactions. DAG-ERC uses a directed acyclic graph, closely matching the conversation patterns and achieving strong classification results. However, SKAIG, which incorporates external knowledge, performs worse due to the introduction of noisy information. Small pre-trained models struggle on this dataset due to limited input windows, while the method in this section mitigates this by leveraging structured external knowledge to identify emotional reasons. However, its performance is slightly behind GNN-based methods, though using an LLM with richer knowledge and generalization improves classification performance.

In the MELD dataset, conversations are shorter with many speakers, so methods using small-scale pre-trained models or external knowledge perform better. LLMs handle the dataset’s complexity well, achieving top results. DialogueCRN underperforms, while CauAIN models speaker interactions use external commonsense, improving results. CDEA, building on this, uses the BART model to enhance classification performance further.

In the DailyDialog dataset, real-world conversations make emotional category analysis more challenging, requiring precise speaker interaction modeling. Previous methods struggle here due to insufficient modeling of emotional interactions and lack of knowledge. SKAIG performs better by using a graph structure with external knowledge, but CDEA improves slightly by using the BART model for explicit reasoning. CDEA+Llama, leveraging the rich knowledge and generalization capabilities of LLMs, significantly boosts classification performance, demonstrating better reliability and generalization.

4.2.2. Ablation Study

w/o Inter-Path: In the emotion cause detection module, we do not use the other’s reasoning path information provided by the structured machine commonsense map, and only recognize the other’s cause statements based on the semantic similarity, but we still use the own reasoning path information provided by the structured machine commonsense map to recognize the self cause statements that are consistent with the causality.
w/o Intra-Path: In the emotion cause detection module, instead of using the own reasoning path information provided by the structured machine commonsense map, it only recognizes the own cause statements based on the semantic similarity, but it still uses the other’s reasoning path information provided by the structured machine commonsense map to recognize the other’s cause statements that are consistent with the causal relationship.
w/o Inf-Path: Instead of using the inference path information provided by the structured machine commonsense map in the emotion cause detection module, the emotion cause statements are identified only by the semantic similarity between the historical statements and the target statements.

In order to study the effect of using different inference path information in the structured machine general knowledge map to detect different emotional cause statements on recognizing emotional categories, the parts of the emotional cause detection module that recognize other people’s cause statements and own cause statements by other people’s inference path information and own inference path information, respectively, are removed in turn. Specifically, the corresponding inference path information in ATOMIC is discarded, and emotional reason statements are detected only by the semantic similarity between historical and target statements. The corresponding part of Table 6 shows that the results on all three datasets are somewhat reduced. This suggests that reasoning path information of others and self is crucial for recognizing causal reason statements, and further illustrates the importance of improving the model’s sentiment categorization performance by taking into account the sentiment reason statements.

At the same time, the importance of explicitly and comprehensively modeling the speaker’s own and inter-speaker dependencies is also demonstrated. In addition, on the MELD dataset, the performance degradation of the model is particularly noticeable in the case of removing information about one’s own reasoning paths compared to removing information about others’ reasoning paths, which corresponds to the fact that the MELD dataset contains fewer utterances per conversation and more speakers, and on the other hand, demonstrates the generalization performance of the methodology proposed in this section on different datasets. The removal of the sentiment cause detection module means that the inference path information generated by COMET from ATOMIC is not introduced, and the identification of sentiment cause utterances is based only on the semantic similarity between the historical utterances and the target utterances. The decrease in model results demonstrates the importance of improving the performance of the dialogue sentiment analysis task by identifying sentiment reasons that are more consistent with causality.

Further, in order to verify the validity of the LLM, this paper further develops the ablation experimental study, the results of which are shown in Table 7.

w/o Exper Demonstration: removing empirical examples from the input of the LLM, i.e., removing empirical examples dynamically selected from the empirical database based on the contextual semantics of a particular target utterance at the time of constructing the command.
w/o Label Paraphrasing: removing the auxiliary task of generating semantic interpretations of sentiment categories and only fine-tuning the LLM with the main task of dialogue sentiment analysis.
w/o LoRA: do not use LoRA to fine-tune the LLM, use the full-parameter fine-tuning approach.

From the results of the ablation studies, several conclusions can be drawn from this section: first, each module in the dialogue sentiment analysis approach based on combining sentiment factors with the LLM is an enhancement to the final performance of the LLM, i.e., by removing any module in the framework, the LLM’s ability to analyze the sentiments will be affected. After removing the empirical example retrieval module, the performance of the LLM drops dramatically on all datasets, which demonstrates the important role of retrieval examples in stimulating the LLM’s sentiment understanding. Second, after removing the auxiliary generation task of semantic interpretation of emotion categories, the performance decreases significantly, which is consistent with the conjecture of this paper, because the auxiliary generation task of semantic interpretation of emotion categories not only makes the LLM understand various emotion categories more deeply, but also enhances the ability to differentiate between various types of emotions; without this auxiliary task, the LLM’s understanding of emotion categories will be more vague, and the effect of emotion recognition will deteriorate. After fine-tuning the LLM without LoRA, the performance of the model also decreases, which indicates that LoRA can effectively prevent the LLM from overfitting.

4.2.3. Hyperparametric Study

Due to the limitation of the input size of the pre-trained language model, the number of historical statements, i.e., the window w, cannot be infinitely large when constructing the input for the LLM. To explore its impact, we examine different history window sizes. The Llama2-7B LLM supports 20 rounds of conversation, whereas small-scale models are limited to five rounds. As shown in Figure 4, on the IEMOCAP dataset, the best performance is achieved with 15 historical utterances, while MELD and DailyDialog perform optimally with 10 utterances. Performance improves as the history window expands, particularly in IEMOCAP, where dialogues are longer. However, beyond a certain point, excessive context introduces noise, reducing classification accuracy—most notably in MELD and DailyDialog.

Another key factor in conversational sentiment analysis is whether the distance of historical sentences within the window impacts sentiment classification. The results suggest that while earlier utterances in a conversation contribute to understanding the evolving emotional trajectory, their impact weakens as their distance from the target utterance increases. Specifically, recent utterances tend to have a stronger influence on sentiment prediction, while distant historical sentences may have diminished relevance. This is particularly evident in datasets such as MELD and DailyDialog, where shorter conversational structures mean that distant utterances are often less contextually relevant. In contrast, IEMOCAP, which features longer and more contextually connected conversations, benefits more from a longer history window before reaching its optimal performance.

Beyond historical context, the speaker’s own previous utterances may also play a crucial role in determining sentiment. Sentiment is not only shaped by contextual interactions but also by the internal emotional consistency of a speaker. If a speaker maintains a consistent emotional tone over multiple turns, the model can leverage this self-consistency to make more accurate predictions. However, in conversations where the speaker’s emotions shift frequently—such as emotionally intense discussions or conflict-driven dialogues—relying on self-referential utterances may introduce ambiguity. Datasets like IEMOCAP, which contain expressive dialogues with emotional transitions, highlight cases where both the speaker’s and the interlocutor’s utterances must be jointly considered for optimal classification.

Overall, these findings suggest that both the recency and relevance of historical sentences, as well as the emotional consistency of a speaker, impact the performance of conversational sentiment analysis models. While increasing the history window generally improves sentiment classification, careful balance is needed to prevent excessive noise and irrelevant information from degrading model performance.

4.2.4. Comparative Experiments with Different LLM in Different Supervised Scenarios

In order to gain a deeper understanding of the performance of different macromodels on the three benchmark datasets under different supervised scenarios, this section conducts experiments on the proposed method on the mainstream macromodels ChatGLM-6B and ChatGLM2-6B [55], and Llama-7B and Llama2-7B, respectively, under the settings of zero samples and LoRA. The experimental results are shown in Figure 5.

The classification of the different large models in the zero-sample and the method setup of this section is shown in Figure 5. Even with the instructions designed in this paper that include the sentiment reason sentences and experience examples, the LLM still performs mediocrely in the zero-sample scenario, which further confirms that the large model cannot take advantage of the LLM’s rich commonsense knowledge and powerful generalization ability when directly applied to the task of dialogue sentiment analysis.

Compared to the zero-sample context learning strategy, fine-tuning the LLM using LoRA not only preserves the rich common sense knowledge inherent in the LLM, but also significantly improves the model’s performance on the task of dialogue sentiment analysis. This demonstrates the effectiveness of the LoRA fine-tuning approach in enhancing the adaptability of large pre-trained models for specific tasks.

Finally, by applying the methodology proposed in this study to the LLMs under the LoRA setting, the performance of all four LLMs is significantly improved, especially on the IEMOCAP dataset. This demonstrates the effectiveness and generalization of the dialogue sentiment analysis framework based on the combination of sentiment reasons and LLMs in this section, which greatly enhances the ability of LLMs to understand the sentiments in long texts.

4.3. Case Study

In Figure 6, a case from the IEMOCAP test set is used as an example to illustrate the important role of accurately identifying cause sentences that are consistent with causality when detecting the sentiment of a target statement. The situation is the process of a man who fails to make a credit card payment and seeks help from the official human customer service. It is easy to notice that there is no direct sentiment descriptor in the target sentence #16. Therefore, the emotion “happy” should be inferred from the dialogue context. Through the emotion cause detection module, the emotion cause statements are detected. Self-caused statement #1 expresses the initial emotional state of anger, while other-caused statement #13 asks the man to call back if there is any problem with the bill after checking it, which causes the man’s worry. Self-caused statement #14 expresses that the man does not want to be tortured by the intelligent customer service anymore. However, other-cause statement #15 completely allays the man’s concerns, and the human customer service agent will give the man their number so that he can call back directly next time. The self-causal and other-causal statements that are consistent with the causal relationship provide an important discriminatory basis for the model.

Table 8 presents the results of our dialogue sentiment analysis method, which integrates affective reasons with the LLM, compared to three baseline models. The results demonstrate how LLM’s commonsense knowledge and generalization ability enhance sentiment classification reliability. Bolded sentences indicate target sentences whose sentiment categories are to be identified.

DialogueCRN struggles with causal reasoning, focusing too much on surrounding statements. CKERC and ECERN improve in this aspect but still lack sufficient commonsense support. In contrast, our method effectively leverages LLM’s commonsense knowledge to accurately identify emotion causes, leading to more precise sentiment classification.

4.4. Module Time Consumption Analysis

To evaluate the efficiency of our proposed model, we conducted a module-level time consumption analysis on the DailyDialog dataset, which consists of 13,118 conversations covering seven sentiment categories, four dialogue behavior types, and 10 topics. This dataset represents diverse real-life conversations is widely used in sentiment analysis, dialogue behavior analysis, and sentiment dialogue generation.

For this experiment, we randomly sampled 5000 sentences from the validation set of DailyDialog and measured the execution time of each system module, including sentiment cause sentence acquisition, dynamic retrieval of experience examples, and prompt instruction construction with model fine-tuning. The average processing time per sentence (ms/sentence) for each module is reported in Table 9.

The total system runtime averages 4209 ms/sentence, with the overall system efficiency meeting practical application requirements. Future optimizations in model inference speed could further enhance real-time performance, making the system more suitable for large-scale deployment.

5. Conclusions

In this paper, we propose a method for analyzing the sentiment of dialogues based on emotional reasons to improve the accuracy and reliability of sentiment classification in dialogue systems. First, a model based on explicit inference of sentiment reasons is proposed, which integrates and infers sentiment reasons by extracting inference path information from structured commonsense maps and combining causal associations between historical and target sentences. Secondly, the method of combining large-scale pre-trained language models is further proposed to accurately analyze the emotional causes by constructing an experience database and prompt engineering and utilizing the common sense knowledge and generalization ability of an LLM. Experimental results show that this paper’s method outperforms existing methods on multiple benchmark datasets, significantly improves the performance of the dialogue sentiment analysis task, and validates its effectiveness.

Although the sentiment analysis method for dialogue based on emotional reasons proposed in this paper achieves significant performance improvement on several benchmark datasets, it still has some limitations. First, the method in this paper is limited to textual data, whereas human emotions are expressed in a variety of forms, and in addition to text, multimodal information such as voice tones, facial expressions, and body movements are also important carriers of emotion transfer. Therefore, purely text-based sentiment analysis cannot fully capture the richness of human emotional expression. Future work should focus on developing a dialogue sentiment analysis model based on multimodal data, combining multiple data sources such as speech, image, and text. This will not only improve the machine’s ability to understand emotions, but also be more applicable to complex application scenarios in real life, such as emotional chatbots and intelligent assistants, to realize a higher level of emotional intelligence.

Author Contributions

Conceptualization, X.Z. (Xue Zhang) and M.W.; methodology, X.Z. (Xue Zhang); software, X.Z. (Xue Zhang), X.Z. (Xuyi Zhuang) and Q.L.; validation, X.Z. (Xue Zhang), X.Z. (Xue Zhang) and X.Z. (Xiao Zeng); formal analysis, Q.L.; investigation, M.W.; resources, M.W.; data curation, X.Z. (Xue Zhang); writing—original draft preparation, X.Z. (Xue Zhang); writing—review and editing, X.Z. (Xue Zhang) and M.W.; visualization, X.Z. (Xue Zhang); supervision, M.W.; project administration, M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Acknowledgments

During the course of this research and the preparation of the manuscript, the open-source large language model LLaMA2-7B was primarily employed to support emotion cause identification, empirical example retrieval, prompt template construction, and model fine-tuning for sentiment classification tasks. Additionally, by combining instruction tuning with structured prompt engineering, the large language model significantly enhanced the causal reasoning capability and classification accuracy in dialogue sentiment analysis. All generated content was reviewed and revised by the authors, who assume full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Kumar, A.; Dogra, P.; Dabas, V. Emotion analysis of Twitter using opinion mining. In Proceedings of the 2015 Eighth International Conference on Contemporary Computing (IC3), Noida, India, 20–22 August 2015; IEEE: New York, NY, USA, 2015; pp. 285–290. [Google Scholar]
Pujol, F.A.; Mora, H.; Martínez, A. Emotion Recognition to Improve E-Healthcare Systems in Smart Cities. In Proceedings of the Research & Innovation Forum 2019: Technology, Innovation, Education, and Their Social Impact, Athens, Greece, 10–12 April 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 245–254. [Google Scholar]
Poria, S.; Majumder, N.; Mihalcea, R.; Hovy, E. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access 2019, 7, 100943–100953. [Google Scholar]
König, A.; Francis, L.E.; Malhotra, A.; Hoey, J. Defining affective identities in elderly nursing home residents for the design of an emotionally intelligent cognitive assistant. In Proceedings of the 10th EAI International Conference on Pervasive Computing Technologies for Healthcare, Cancun, Mexico, 16–19 May 2016; pp. 206–210. [Google Scholar]
Strapparava, C. WordNet-Affect: An affective extension of WordNet. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, 26–28 May 2004. [Google Scholar]
Mohammad, S.M.; Turney, P.D. Crowdsourcing a word—Emotion association lexicon. Comput. Intell. 2013, 29, 436–465. [Google Scholar]
Lian, Z.; Sun, L.; Xu, M.; Sun, H.; Xu, K.; Wen, Z.; Chen, S.; Liu, B.; Tao, J. Explainable multimodal emotion reasoning. arXiv 2023, arXiv:2306.15401. [Google Scholar]
Hu, D.; Wei, L.; Huai, X. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. arXiv 2021, arXiv:2106.01978. [Google Scholar]
Zhao, W.; Zhao, Y.; Lu, X. CauAIN: Causal Aware Interaction Network for Emotion Recognition in Conversations. In Proceedings of the IJCAI, Vienna, Austria, 23–29 July 2022; pp. 4524–4530. [Google Scholar]
Schachter, S.; Singer, J. Cognitive, Social, and Physiological Determinants of Emotional State. Psychol. Rev. 1962, 69, 379–399. [Google Scholar] [CrossRef]
Scherer, K.R. Appraisal Processes in Emotion: Theory, Methods, Research; Oxford University Press: New York, NY, USA, 2001. [Google Scholar]
Majumder, N.; Poria, S.; Hazarika, D.; Mihalcea, R.; Gelbukh, A.; Cambria, E. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6818–6825. [Google Scholar]
Zhang, D.; Chen, X.; Xu, S.; Xu, B. Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020; pp. 4429–4440. [Google Scholar]
Jiao, W.; Yang, H.; King, I.; Lyu, M.R. Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. arXiv 2019, arXiv:1904.04446. [Google Scholar]
Ma, H.; Wang, J.; Qian, L.; Lin, H. HAN-ReGRU: Hierarchical attention network with residual gated recurrent unit for emotion recognition in conversation. Neural Comput. Appl. 2021, 33, 2685–2703. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Bhaumik, A.; Strzalkowski, T. Towards a Generative Approach for Emotion Detection and Reasoning. arXiv 2024, arXiv:2408.04906. [Google Scholar]
Xie, S.M.; Raghunathan, A.; Liang, P.; Ma, T. An explanation of in-context learning as implicit bayesian inference. arXiv 2021, arXiv:2111.02080. [Google Scholar]
Reddy, G.R.; Reddy, M.S.; Stanlywit, M.; Khaleel, S. Emotion detection from text and analysis of future work: A survey. Riv. Ital. Filos. Anal. Jr. 2023, 14, 59–73. [Google Scholar]
Zhou, Z.H. A brief introduction to weakly supervised learning. Natl. Sci. Rev. 2018, 5, 44–53. [Google Scholar] [CrossRef]
Sujadi, C.C.; Sibaroni, Y.; Ihsan, A.F. Analysis content type and emotion of the presidential election users tweets using agglomerative hierarchical clustering. Sink. J. Dan Penelit. Tek. Inform. 2023, 7, 1230–1237. [Google Scholar] [CrossRef]
Mahesh, B. Machine Learning Algorithms—A Review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
Rafath, M.A.H.; Mim, F.T.Z.; Rahman, M.S. An analytical study on music listener emotion through logistic regression. World Acad. J. Eng. Sci. 2021, 8, 15–20. [Google Scholar]
Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
Sarzyńska-Wawer, J.; Wawer, A.; Pawlak, A.; Szymanowska, J.; Stefaniak, I.; Jarkiewicz, M.; Okruszek, L. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021, 304, 114135. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report. 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 19 January 2025).
Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Wan, B.; Wu, P.; Yeo, C.K.; Li, G. Emotion-cognitive reasoning integrated BERT for sentiment analysis of online public opinions on emergencies. Inf. Process. Manag. 2024, 61, 103609. [Google Scholar] [CrossRef]
Abu Farha, I.; Magdy, W. A Comparative Study of Effective Approaches for Arabic Sentiment Analysis. Inf. Process. Manag. 2021, 58, 102438. [Google Scholar] [CrossRef]
Bello, A.; Ng, S.C.; Leung, M.F. A BERT framework to sentiment analysis of tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef]
Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv 2019, arXiv:1908.11540. [Google Scholar]
Ishiwatari, T.; Yasuda, Y.; Miyazaki, T.; Goto, J. Relation-Aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, Originally Scheduled in Punta Cana, Dominican Republic, 16–20 November; 2020; pp. 7360–7370. [Google Scholar]
Shen, W.; Wu, S.; Yang, Y.; Quan, X. Directed acyclic graph network for conversational emotion recognition. arXiv 2021, arXiv:2105.12907. [Google Scholar]
Zhong, P.; Wang, D.; Miao, C. Knowledge-enriched transformer for emotion detection in textual conversations. arXiv 2019, arXiv:1909.10681. [Google Scholar]
Ghosal, D.; Majumder, N.; Gelbukh, A.; Mihalcea, R.; Poria, S. Cosmic: Commonsense knowledge for emotion identification in conversations. arXiv 2020, arXiv:2010.02795. [Google Scholar]
Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; Choi, Y. COMET: Commonsense transformers for automatic knowledge graph construction. arXiv 2019, arXiv:1906.05317. [Google Scholar]
Sap, M.; Le Bras, R.; Allaway, E.; Bhagavatula, C.; Lourie, N.; Rashkin, H.; Roof, B.; Smith, N.A.; Choi, Y. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Honolulu, HI, USA, 2019; Volume 33, pp. 3027–3035. [Google Scholar]
Luo, M.; Xu, X.; Liu, Y.; Pasupat, P.; Kazemi, M. In-context learning with retrieved demonstrations for language models: A survey. arXiv 2024, arXiv:2401.11624. [Google Scholar]
Zahiri, S.M.; Choi, J.D. Emotion detection on TV show transcripts with sequence-based convolutional neural networks. In Proceedings of the AAAI Workshops, New Orleans, LA, USA, 2–7 February 2018; Volume 18, pp. 44–52. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 985–1000. [Google Scholar]
Baccianella, S.; Esuli, A.; Sebastiani, F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 17–23 May 2010; European Language Resources Association (ELRA): Valletta, Malta, 2010; pp. 2200–2204. [Google Scholar]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar]
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. arXiv 2018, arXiv:1810.02508. [Google Scholar]
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-Turn Dialogue Dataset. arXiv 2017, arXiv:1710.03957. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Li, J.; Lin, Z.; Fu, P.; Wang, W. Past, Present, and Future: Conversational Emotion Recognition through Structural Modeling of Psychological Knowledge. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 1204–1214. [Google Scholar]
Wei, Y.; Liu, S.; Yan, H.; Ye, W.; Mo, T.; Wan, G. Exploiting Pseudo Future Contexts for Emotion Recognition in Conversations. arXiv 2023, arXiv:2306.15376. [Google Scholar]
Lei, S.; Dong, G.; Wang, X.; Wang, K.; Wang, S. InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-Task LLMs Framework. arXiv 2023, arXiv:2309.11911. [Google Scholar]
Hu, D.; Bao, Y.; Wei, L.; Zhou, W.; Hu, S. Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations. arXiv 2023, arXiv:2306.01505. [Google Scholar]
Yang, Z.; Li, X.; Cheng, Y.; Zhang, T.; Wang, X. Emotion Recognition in Conversation Based on a Dynamic Complementary Graph Convolutional Network. IEEE Trans. Affect. Comput. 2024, 15, 1567–1579. [Google Scholar]
Fu, Y. CKERC: Joint Large Language Models with Commonsense Knowledge for Emotion Recognition in Conversation. arXiv 2024, arXiv:2403.07260. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. arXiv 2021, arXiv:2103.10360. [Google Scholar]

Figure 1. During dataset annotation, annotators use commonsense to identify emotional triggers in the dialogue and then integrate these cues to infer an objective sentiment category, minimizing subjective bias.

Figure 2. This model operates in three phases: (1) Sentiment cause sentence acquisition identifies emotional causes from dialogue history, distinguishing between self-reasons and other-reasons. (2) Dynamic retrieval of experience examples—retrieves relevant examples from an experience database, refined by BART to enhance contextual alignment. (3) Prompt instruction construction model fine-tuning—constructs structured prompts integrating sentiment causes, historical context, and retrieved examples to optimize LLM-based sentiment classification.

Figure 3. The prompt consists of six components: (1) target sentence

u_{i}

, (2) instruction statement defining the model’s role and input format, (3) dialogue history providing contextual information within a history window w, (4) emotional reasons incorporating m other- and n self-reasons, refined by BART, (5) task statement defining sentiment classification with predefined categories

L

, and (6) an experience example retrieved and refined by BART for contextual alignment.

Figure 3. The prompt consists of six components: (1) target sentence

u_{i}

, (2) instruction statement defining the model’s role and input format, (3) dialogue history providing contextual information within a history window w, (4) emotional reasons incorporating m other- and n self-reasons, refined by BART, (5) task statement defining sentiment classification with predefined categories

L

, and (6) an experience example retrieved and refined by BART for contextual alignment.

Figure 4. History window exploration experiment chart.

Figure 5. Here, (a) represents the comparison of LLM classification performance with zero samples and the method setup of this study; (b) represents the comparison of LLM classification performance with the LoRA setup; and (c) represents the comparison of LLM classification performance with the LoRA and the method setup of this study.

Figure 6. Case study 1: real examples of successful model predictions.

Table 1. Examples of different types of inference path information.

Sentence (Event)	X Pays Y a Compliment
Self-Reasoning Path	xEffect	be acknowledged
	xReact	feel good
	xWant	chat with Y
Other-Reasoning Path	oEffect	smile
	oReact	feel flattered
	oWant	compliment X back

Table 2. Statistical information on datasets.

Dataset	Number of Dialogs			Number of Sentences
Dataset	Train	Dev	Test	Train	Dev	Test
IEMOCAP	108	12	31	5163	647	1623
MELD	1039	114	280	9989	1109	2610
DailyDialog	11,118	1000	1000	87,170	8069	7740

Table 3. Dataset sentiment category information.

Dataset	Classes	Sentiment Category
IEMOCAP	6	happy, sad, neutral, angry, excited, frustrated
MELD	7	anger, disgust, fear, joy, neutral, sadness, surprise
DailyDialog	7	anger, disgust, fear, joy, neutral, sadness, surprise

Table 4. Information on indicators for the assessment of dataset indicators.

Dataset	Metric	Addition
COSMIC	weighted F1	-
DAG-ERC	weighted F1	-
DialogueCRN	weighted F1	w/o neutral category sentences

Table 5. Comparison of sentiment classification performance of different models on the benchmark dataset.

Dataset	IEMOCAP		MELD		DailyDialog
Dataset	Weighted-F1	Acc	Weighted-F1	Micro-F1	Macro-F1	Micro-F1
COSMIC [36]	65.28	64.25	65.21	65.13	51.05	58.48
DAG-ERC [34]	67.10	66.47	63.37	-	-	58.25
DialogueCRN [9]	66.20	67.01	58.39	58.26	-	55.46
SKAIG [49]	66.96	-	65.18	-	51.95	59.75
CauAIN [10]	64.29	63.84	65.15	64.85	53.85	58.21
ERCMC [50]	66.07	65.58	65.64	-	52.11	59.92
UniMSE [51]	70.66	70.56	65.51	65.09	-	-
InstructERC [52]	71.39	71.43	69.15	68.96	-	-
[53]	68.31	-	66.25	-	-	60.21
CKERC [54]	72.40	-	69.27	-	-	-
CDEA	66.92	66.44	65.73	66.59	54.29	60.44
CDEA + llama	73.26	72.25	69.34	69.61	63.25	64.59

Table 6. CDEA ablation experiment results.

	IEMOCAP	MELD	DailyDialog
CDEA	66.92	65.73	60.44
w/o Inter-Path	65.91	65.24	59.69
w/o Intra-Path	65.96	63.53	59.33
w/o Inf-Path	65.17	65.38	59.04

Table 7. CDEA+llama ablation experiment results.

	IEMOCAP	MELD	DailyDialog
CDEA+llama	73.26	69.34	64.59
w/o exper demonstration	70.65	67.29	63.68
w/o label paraphrasing	70.55	67.41	63.13
w/o LoRA	70.23	63.88	63.52

Table 8. Case Study 2: The table below provides a dialogue sample in which Chandler’s emotions are predicted, illustrating the impact of LLM-driven reasoning. Incorrect predictions are marked ✕, while correct predictions are marked ✓. Target sentences (i.e., Chandler’s utterances that require sentiment classification) are highlighted in bold.

Contents of the Dialog
Joey: Oh, yeah, yeah, sure. We live in the building by the uh sidewalk. (neutral) Chandler: You know it? (surprise) Joey: Hey, look, since we are neighbors and all, what do you say we uh, get together for a drink? (neutral) Chandler: Oh, sure, they love us over there. (neutral) Joey: Ben! Ben! Ben! (neutral)
Model	Prediction
DialogueCRN	surprise ✕
CKERC	joy ✕
ECERN	joy ✕
ECERN + llama	neutral ✓

Table 9. Module-level time consumption analysis on DailyDialog.

Module	Phase	Average Time (ms/sentence)
Dialogue History Preprocessing	Sentiment Cause Sentence Acquisition	42
Sentiment Cause Detection (Self and Other)	Sentiment Cause Sentence Acquisition	185
Experience Example Retrieval	Dynamic Retrieval of Experience Examples	160
BART-Based Experience Refinement	Dynamic Retrieval of Experience Examples	225
Prompt Construction	Prompt Instruction Construction and Fine-tuning	105
LLM Inference	Prompt Instruction Construction and Fine-tuning	3492
Total System Runtime	-	4209

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wang, M.; Zhuang, X.; Zeng, X.; Li, Q. CDEA: Causality-Driven Dialogue Emotion Analysis via LLM. Symmetry 2025, 17, 489. https://doi.org/10.3390/sym17040489

AMA Style

Zhang X, Wang M, Zhuang X, Zeng X, Li Q. CDEA: Causality-Driven Dialogue Emotion Analysis via LLM. Symmetry. 2025; 17(4):489. https://doi.org/10.3390/sym17040489

Chicago/Turabian Style

Zhang, Xue, Mingjiang Wang, Xuyi Zhuang, Xiao Zeng, and Qiang Li. 2025. "CDEA: Causality-Driven Dialogue Emotion Analysis via LLM" Symmetry 17, no. 4: 489. https://doi.org/10.3390/sym17040489

APA Style

Zhang, X., Wang, M., Zhuang, X., Zeng, X., & Li, Q. (2025). CDEA: Causality-Driven Dialogue Emotion Analysis via LLM. Symmetry, 17(4), 489. https://doi.org/10.3390/sym17040489

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CDEA: Causality-Driven Dialogue Emotion Analysis via LLM

Abstract

1. Introduction

2. Related Work

2.1. Sentiment Analysis Techniques

2.2. Conversational Sentiment Analysis Techniques

3. Method

3.1. Mission Definition and Model Overview

3.2. Sentiment Cause Sentence Acquisition

3.3. Dynamic Retrieval of Experience Examples

3.4. Prompt Instruction Construction

3.5. Training and Loss Functions

4. Experiments

4.1. Setup

4.1.1. Models and Datasets

4.1.2. Implementation Details

4.2. Results and Analysis

4.2.1. Overall Results

4.2.2. Ablation Study

4.2.3. Hyperparametric Study

4.2.4. Comparative Experiments with Different LLM in Different Supervised Scenarios

4.3. Case Study

4.4. Module Time Consumption Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI