DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation

Lou, Wenzhe; Yang, Wenzhong; Wei, Fuyuan

doi:10.3390/app13158629

Open AccessArticle

DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation

by

Wenzhe Lou

¹,

Wenzhong Yang

^1,2,* and

Fuyuan Wei

¹

School of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8629; https://doi.org/10.3390/app13158629

Submission received: 25 May 2023 / Revised: 23 July 2023 / Accepted: 24 July 2023 / Published: 26 July 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, emotional dialogue generation garnered widespread attention and made significant progress in the English-speaking domain. However, research on emotional dialogue generation in Chinese still faces two critical issues: firstly, the lack of high-quality datasets with emotional characteristics makes it difficult for models to fully utilize emotional information for emotional intervention; secondly, there is a lack of effective neural network models for extracting and integrating inherent logical information in the context to fully understand dialogues. To address these issues, this paper presented a Chinese dialogue dataset called LifeDialog, which was annotated with sentiment features. Additionally, it proposed DialogCIN, a contextual inference network that aims to understand dialogues based on a cognitive perspective. Firstly, the proposed model acquired contextual representations at both the global and speaker levels. Secondly, different levels of contextual vectors were separately inputted into the understanding unit, which consists of multiple inference modules. These modules iteratively performed reasoning and retrieval to delve into the inherent logical information of the dialogue context. Subsequently, appropriate emotions were predicted for feedback. Finally, an emotion-aware decoder was employed to generate a response. Experimental results on our manually annotated dataset, LifeDialog, demonstrated that DialogCIN can effectively simulate human cognitive inference processes, enabling a better understanding of dialogue context and improving the quality of generated dialogues.

Keywords:

emotional dialogue; response generation; multi-turn dialogues; inference module

1. Introduction

Recent studies demonstrated that equipping dialogue systems with emotional capabilities yields a substantial enhancement in user experience and customer satisfaction [1,2]. The realm of emotional dialogue systems encompasses a wide array of application scenarios, including, but not limited to, psychological counseling, intelligent customer service, virtual emotional companions, and various others. Diverging from conventional dialogue generation, emotional dialogue generation not only encompasses the logical cohesion of conversations but also embraces the expression and regulation of emotional states, thus rendering the generated responses more attuned to the emotional requirements of users. Given the multifaceted technical challenges inherent in emotional dialogue generation tasks, which necessitate the amalgamation of natural language processing, deep learning, and other technical methodologies, this research area emerged as a pivotal direction within the realm of natural language generation.

Early research on emotional dialogue generation primarily relied on template and rule-based methods [3,4], which generated emotional statements through pattern matching. However, the construction of such methods presented challenges and lacked scalability, thus limiting their practical applicability. In recent years, spurred by the continuous advancement of dialogue generation theory and technology, the paradigm of emotional dialogue generation transitioned towards end-to-end models based on deep learning. Zhou et al. [5] advanced the expression of emotions by integrating emotion categories, internal memory networks, and external memory dictionaries. Wei et al. [6] introduced a unified end-to-end neural network architecture capable of encoding both semantics and emotions in a post, thereby generating responses with appropriate emotional expression. To address the generation of emotionally aware responses in dialogues, Li et al. [7] proposed a multi-resolution interactive model that leveraged emotional information at both the dialogue-level and word-level, incorporating user feedback. Liang et al. [8] developed a heterogeneous graph neural network-based model capable of perceiving emotions from multiple sources of knowledge, including dialogue history, emotion flow, facial expressions, audio, and speaker personality. Mao et al. [9] employed a hierarchical recursive encoder-decoder framework, leveraging two enhanced self-attention encoders to capture semantic and emotional signals independently. Subsequently, these signals were fused in the decoder to generate responses. Li et al. [10] employed a dual-view conditional variational autoencoder framework, allowing for the separate extraction of emotional information and content-related information from the latent variable distribution of the response. These extracted features were then combined in the decoder to generate emotionally infused responses. Although the aforementioned research achieved impressive results in specific tasks pertaining to the generation of emotional dialogues, it often entailed static learning from dialogue contexts. However, this approach imposes limitations on the exploration of inherent semantic information within the context and constrains the ability to capture more comprehensive contextual cues. As illustrated in Figure 1, Speaker A initially expressed their emotion as “sadness”, while Speaker B inquired about the reason behind A’s sadness, as evident in Utterances u2 and u4. Additionally, Speaker B also conveyed the cause of their own sadness, as reflected in Utterances u5 and u7. At this stage, Speaker B’s emotion remains “sadness”. Subsequently, Speaker A provided suggestions to Speaker B, as seen in Utterances u8 and u10, which were derived from an understanding and inferential learning of the context. Towards the end of the conversation, Speaker A’s emotion transitioned from the initial “sadness” to “happiness”. This highlights the significant role of context-based learning and inference in generating appropriate responses.

Presently, a number of Chinese dialogue datasets emerged. Wang et al. [11] constructed a short-text dialogue dataset based on Sina Weibo, encompassing a substantial volume of dialogue instances alongside manually annotated data and candidate response pools. Wang et al. [12] curated the Chinese Short Text Conversation Corpus (LCCC) through a meticulous data-cleaning process on Weibo dialogue data. Qian et al. [13] introduced a large-scale dialogue dataset named Pchatbot, comprising two subsets sourced from Weibo and a judicial forum. This dataset provides anonymized user IDs and timestamps for each post and reply. However, it is important to note that the majority of existing Chinese dialogue datasets are primarily acquired through crawling from social media platforms such as Weibo and Zhihu. Consequently, these datasets tend to exhibit higher levels of noise, lower corpus quality, and deviate from the characteristics of real-life conversations. Moreover, these datasets lack annotations pertaining to emotional features, which significantly impedes the progress of research in the domain of generating emotionally driven dialogue in the Chinese language.

The technical challenges associated with Chinese emotional dialogue generation can be summarized as follows. Firstly, there is a concern regarding the availability of Chinese corpora. Presently, much of the dialogue generation research is based on English corpora, while existing Chinese datasets lack annotations for emotional features. This poses a significant challenge for models to effectively leverage emotional information for generating emotionally driven responses. Furthermore, the majority of available corpora are derived from social media platforms, introducing substantial noise and deviating from the characteristics of everyday conversations. Thus, constructing a high-quality Chinese emotional dialogue dataset becomes a critical endeavor. Secondly, most existing methods predominantly rely on static memory to learn dialogue context and generate responses. However, this approach often results in insufficient assimilation of the inherent logical information. In human communication, individuals typically engage in reasoning and retrieval processes within the context to deliver contextually appropriate responses. Hence, exploring dynamic reasoning and retrieval mechanisms in dialogue models becomes imperative.

To address the aforementioned research problem, this article proposes DialogCIN, a cognitive perspective-based emotional dialogue generation model designed to generate responses by comprehensively understanding the context. Firstly, we integrated dialogue information with emotional cues and employ a multi-head self-attention mechanism to obtain contextual representations at both the global and speaker levels. Secondly, we introduced a multi-turn inference module that iteratively extracts and combines the intrinsic logical cues embedded in the context. This module comprises two processes: the reasoning process and the retrieval process, which simulate the cognitive reasoning process observed in humans. The reasoning process utilizes a bidirectional long short-term memory (BiLSTM) network [14] to learn the intrinsic logical information of the dialogue context and extract contextual cues. The retrieval process leverages an attention mechanism to match relevant contextual cues through the retrieval of static global memories. Finally, utilizing the acquired contextual cue vectors, we employed an emotional classifier to ascertain the emotion label of the response and guide the response generation process. In comparison to other dialogue generation models, the proposed model outlined in this paper exhibited enhanced performance on our annotated dialogue dataset, LifeDialog.

This paper made the following main contributions:

We constructed a Chinese dialogue dataset named LifeDialog, encompassing annotations for emotional, sentence type, and topic features of the dialogues. The dataset underwent meticulous manual annotation to mitigate the influence of noise on the data and to provide a high-quality resource for Chinese dialogue generation research.
We presented DialogCIN, a contextual inference network devised to generate emotionally enriched dialogues by comprehending conversations from a cognitive standpoint, encompassing both the global and speaker levels.
We proposed an “Inference Module” that emulates the human cognitive reasoning process by iteratively performing the reasoning process on the acquired dialogue context at various levels. This iterative approach facilitates a comprehensive comprehension of the underlying logical information embedded within the dialogue.
We conducted a comparative evaluation of the proposed DialogCIN model and the baseline model using the LifeDialog dataset, thereby highlighting the advantages of the former.

2. Related Work

2.1. Open Domain Dialogue Generation

In recent years, remarkable advancements in deep learning and natural language processing technologies propelled substantial advancements in open-domain dialogue systems. Prominent industrial-level applications such as Microsoft XiaoIce, Alibaba Xiaomi, Baidu Xiaodu, and others emerged as notable examples of these advancements. Simultaneously, researchers are actively exploring methodologies to augment the quality, diversity, consistency, and personalization of open-domain dialogue systems. At present, open-domain dialogue systems can be broadly classified into two categories: retrieval-based systems and generative systems.

Retrieval-based dialogue systems operate by selecting the most appropriate response from a predefined set of candidate responses. Zhou et al. [15] introduced an approach where multi-turn question-answer statements are merged into a single column, treating the entire dialogue history as a “sentence” for sentence-level matching. They utilized gated recurrent units to extract lexical features and matched them with candidate responses. However, this approach lacks robust control over the logical relationships within the context. In addressing this limitation, Zhang et al. [16] proposed a deep dialogue integration model that employs attention mechanisms to extract crucial information from dialogues and responses, enabling the calculation of matching scores for each dialogue round. This approach primarily addresses challenges arising from noise and redundancy when directly concatenating multiple dialogue rounds as context information. Tao et al. [17] proposed a multi-representation fusion network that considers matching with context responses of multiple granular representation forms to perform multi-turn response selection, further reducing the complexity of multi-turn retrieval models.

The retrieval-based approach in open-domain dialogue systems requires a substantial amount of high-quality data to support its performance. However, it is prone to limitations such as limited response diversity and flexibility. In contrast, the generative-based approach overcomes the need for an excessively large or precise response corpus by leveraging the ability of models to generate diverse responses through learning from a language corpus. Serban et al. [18] introduced the Hierarchical Recurrent Encoder-Decoder Model (HRED), which utilizes a recurrent neural network (RNN) [19] on the encoding side to map dialogues into vector representations. To incorporate contextual information effectively, a higher-level contextual RNN is employed to iteratively track changes in the RNN information on the encoding side. Lee et al. [20] adopted a standard encoder-decoder framework and introduced a stochastic connection layer between the encoder and decoder. This layer randomly selects feature vectors from the encoder’s output as input to the decoder, enabling the generation of dialogue responses. Shen et al. [21] employed a hierarchical self-attention mechanism and a distant supervision approach. These techniques are utilized to select relevant words and sentences from the dialogue history, guiding the decoder’s generation process from a global perspective. Zhang et al. [22] proposed a multi-round dialogue generation model called ReCoSa, which is based on a self-attentive mechanism. This model detects the contexts associated with responses and generates appropriate responses based on these contexts, enhancing the quality of the generated dialogues.

2.2. Emotional Dialog Generation

In recent years, research highlighted the significance of incorporating emotional factors in the development of human-like conversation generation models [1,2]. The existing approaches to emotional dialogue generation can be broadly classified into two categories. The first category involves controllable emotion-based dialogue systems that rely on user-input emotions. Zhou et al. [5] first proposed an Emotional Chatting Machine (ECM) based on the Sequence to Sequence (Seq2Seq) model [23], which uses emotion category embeddings to simulate emotional expressions, encodes emotional vectors, and improves the decoder to generate emotional statements. Shen et al. [24] proposed a new framework called Curriculum Dual Learning (CDL), which extends controllable emotion-based reply generation to a dual-task approach, alternating between generating emotional replies and emotional queries. Song et al. [25] explicitly or implicitly integrated specific emotions into coherent phrases and proposed a semi-supervised approach to creating an emotion lexicon to generate coherent and meaningful replies. These approaches aim to improve emotional expression in dialogue generation by leveraging specific emotion vectors and enhancing the overall emotional quality of the generated responses. However, they rely on user-input emotions or require additional emotion input, which may restrict their practical application. The other type of emotional dialogue generation involves models that autonomously learn emotional states from the dialogue in order to generate emotional responses. Li et al. [7] addressed this by considering both sentence-level and word-level granularity to capture subtle emotional differences more accurately. They introduced an adversarial learning framework that facilitates the generation of appropriate responses. Liang et al. [8] proposed a heterogeneous graph-based framework that leverages multi-source knowledge, including dialogue history, emotion flow, facial expressions, and speaker personality. This framework constructs a heterogeneous graph based on the dialogue content and employs an encoder to learn the graph’s representation. By doing so, it comprehensively understands the dialogue content, perceives emotions, predicts appropriate emotional states, and ultimately generates coherent emotional replies.

2.3. Datasets for Dailogue Generation

Currently, commonly used datasets in the field of dialogue generation include Ubuntu [26], DailyDialog [27], Open subtitles [28], Weibo [11], and others. Datasets are a critical component of dialogue generation tasks, directly affecting the model’s performance during training and testing, and have a significant impact on the model’s adaptability and effectiveness in solving real-world problems. To advance research progress in Chinese dialogue generation, some researchers took the initiative to construct their own datasets. For instance, Wang et al. [12] developed the large-scale Chinese short-text dialogue dataset LCCC, providing both base and large versions. Qian et al. [13] collected Chinese dialogue data from Weibo, including valuable personal information such as timestamps and user IDs. However, current Chinese dialogue generation datasets either suffer from low-quality issues due to data crawling from social media platforms such as Weibo and Douban, or lack emotional features. In contrast, our dataset was sourced from real dialogues and constructed using reliable manual annotations, adding richer content to the Chinese dialogue generation corpus and is suitable for training and evaluating Chinese emotional dialogue generation tasks.

3. Proposed Method

This paper presents DialogCIN, a cognitive-based model for generating emotionally intelligent responses in dialogues. The core concept of this model revolves around employing multi-head self-attention to acquire contextual representations at both the global and speaker levels. Subsequently, by incorporating an understand unit consisting of multiple inference modules, the model progressively conducts reasoning and retrieval operations, emulating human cognitive processes and comprehensively grasping the conversational context. Moreover, an emotion predictor is employed to determine the emotion label associated with the generated response. Finally, the response is generated using a gate recurrent unit (GRU) [29] as the decoder, guided by the predicted emotion label, thereby facilitating effective communication with human counterparts. The overall framework of the model is shown in Figure 2.

3.1. Task Formulation

Given a tuple

〈U, E〉

as input, where

U = \{u_{1}, u_{2}, \dots, u_{N}\}

is the dialogue context of the previous

N

turns between the two speakers, and

E = \{e_{1}, e_{2}, \dots, e_{N}\}

is the emotion sequence of the two speakers’ utterances, the task is to generate a target response

Y = \{y_{1}, y_{2}, \dots, y_{k}\}

for the (N + 1)th turn (i.e.,

u_{N + 1}

), where

k

represents the number of tokens in the response utterance, and the response is consistent with the dialogue context and appropriate emotion. The dialogue history

U

consists of N utterances, where the i-th utterance

u_{i} = \{x_{i, 1}, x_{i, 2}, \dots, x_{i, M}\}

is a sequence of M tokens. Thus, the task of emotion-driven dialogue generation is to compute the probability

P (Y | U, E)

of generating a response

Y

based on the given conversation context

U

and the emotion sequence

E

.

3.2. Inference-Based Encoder

3.2.1. Utterance Features

This layer serves as the data input layer, responsible for converting the input text data into numerical vector representations. In order to achieve this, we begin by tokenizing each round of dialogue, represented by the utterance

u_{i}

, using the jieba segmentation tool. Subsequently, an embedding process is applied to obtain the vector representation

k_{i}

for each utterance. Simultaneously, for each emotion label

e_{i}

associated with each round, its vector representation

l_{i}

is derived through One-Hot encoding. To obtain a comprehensive representation, the utterance vector is fused with the emotion vector, resulting in

d_{i} = [k_{i}; l_{i}]

, where

[.; .]

signifies the concatenation operation. To encode the dialogue turn feature vector

d_{i}

, we utilize Bi-GRU, which captures the contextual information from both directions. The last hidden state of the Bi-GRU is extracted as the representation

h_{i}

for the respective utterance. Ultimately, we obtain the vector representations

\{h_{1}, h_{2}, \dots, h_{N}\}

corresponding to the N dialogue turns.

3.2.2. Representation Unit

Multi-Head Self-Attention (MHSA) [30] can capture important information in a sequence by aggregating features from other utterances in the dialogue history to generate contextual representations for each utterance. We also use Position Embedding (PE) to distinguish between different positions of utterances. Specifically, we concatenate them with the utterance representations as

H_{u} = {[h_{1}; P E_{1}], [h_{2}; P E_{2}], \dots, [h_{N}; P E_{N}]}

. Subsequently, we obtain the global-level contextual representation

X_{u}

, as follows:

X_{u} = M H S A^{G} (H_{u}, H_{u}, H_{u})

(1)

To acquire the contextual representation at the speaker level, we initiate the process by obtaining vector representations of all utterances from each speaker, denoted as

\{h_{s}, h_{s + 2}, \dots, h_{d}\}

, where s takes the values 1 or 2, and d corresponds to N or N − 1. Following this, we apply position embedding to each vector representation, resulting in

H_{A}

and

H_{B}

. Ultimately, we employ diverse multi-head self-attentions to capture the self-correlations among utterances originating from the same speaker, as depicted below:

X_{A} = M H S A^{A} (H_{A}, H_{A}, H_{A})

(2)

X_{B} = M H S A^{B} (H_{B}, H_{B}, H_{B})

(3)

3.2.3. Understand Unit

To enable our model to better understand the conversational context, we designed understand units that mimic the human brain’s understanding process. In the understand unit, we employed multi-round inference modules to progressively mine and integrate contextual information from the dialogue. The architecture of the inference module is illustrated in Figure 3.

The inference module encompasses two primary processes: reasoning and retrieval. Within the t-th round, the reasoning process is executed, involving the utilization of a BiLSTM network. This network enables the model to grasp the internal logical order and consolidate contextual cues, thus facilitating the extraction of pertinent features from the sequence. By leveraging the capabilities of the BiLSTM network, the model becomes adept at capturing intricate sequence structures and dependencies, thereby emulating the cognitive reasoning process observed in human cognition.

{\tilde{q}}^{(t - 1)} = \overset{\leftrightarrow}{L S T M} (q^{(t - 1)})

(4)

where

{\tilde{q}}^{(t - 1)}

is the output vector.

q^{(t - 1)}

is initialized with the global-level contextual representation

X_{u}

, that is:

q^{(0)} = W X_{u} + b

, where

W

and

b

are learnable parameters. In the continuous working process of the reasoning, we can learn the inherent logical order between utterances, which is similar to the conscious thinking process of humans. t indicates the number of “processing steps” performed to compute the final vector representation, i.e., the number of Inference Modules.

Within the t-th round, we employed an attention mechanism to facilitate the matching of the contextual representations of global-level utterances. This approach emulates the human retrieval process, enabling the model to effectively concentrate on salient features. By leveraging the attention mechanism, the model acquires the capability to precisely discern and emphasize crucial elements, thereby enhancing its discernment and decision-making prowess, particularly in pivotal segments. The detailed calculation is as follows:

e^{(t - 1)} = f (q^{(t - 1)}, {\tilde{q}}^{(t - 1)}),

(5)

α^{(t - 1)} = S o f t m a x (e^{(t - 1)}),

(6)

r^{(t - 1)} = f (q^{(t - 1)}, α^{(t - 1)})

(7)

where

f

is the function that computes the matrix multiplication from

q^{(t - 1)}

and

{\tilde{q}}^{(t - 1)}

.

Next, we fused the output

{\tilde{q}}^{(t - 1)}

of the reasoning process with the output

r^{(t - 1)}

of the retrieval process to obtain the vector

q^{(t)}

, which is then input to the next Inference Module to continue the process of reasoning and retrieval:

q^{(t)} = L a y e r N o r m (α ⊙ {\tilde{q}}^{(t - 1)} + (1 - α) ⊙ r^{(t - 1)})

(8)

where

α

is the weighting factor and

⊙

is the matrix number product. In summary, given the global contextual representation

X_{u}

and the number of inference rounds

T

, the entire Understand Unit can be represented as:

q^{G} = U n d e r s t a n d^{G} (X_{u}, T)

(9)

In the Understand Unit, we designed two levels of cognitive processes, namely the global-level cognition and the speaker-level cognition, which provide different perspectives to understand the dialogue context. Therefore, the output of the speaker-level cognition can be represented as:

q^{A} = U n d e r s t a n d^{A} (X_{A}, T)

(10)

q^{B} = U n d e r s t a n d^{B} (X_{B}, T)

(11)

Finally, the representation

X

of the utterance after passing through the Understand Unit is the concatenation of the aforementioned output vectors, as shown below:

X = [q^{G}; q^{A}; q^{B}]

(12)

3.2.4. Emotion Predictor

After comprehensively understanding the dialogue context in the Understand Unit, we transformed the linear transformation of the utterance representation

X

into a fixed-size vector using a Maxpooling layer. Subsequently, we employed a Softmax layer to predict the appropriate emotion:

H^{m a x} = M a x P o o l i n g (F N N (X))

(13)

P = S o f t m a x (W^{p} H^{m a x})

(14)

where

W^{p}

is the trainable parameter, FNN is a feedforward neural network layer.

3.3. Emotion-Aware Decoder

To generate the t-th word

y_{t}

, we first used the previous t − 1 generated tokens

y_{1 : t - 1}

as input to obtain a representation with multi-head self-attention (future masked):

H^{r} = M H S A (R, R, R)

(15)

where

R

is the embedding vector of generated tokens for the target response

Y

, i.e.,

y_{1 : t - 1}

.

We employed another multi-head attention to attend to the historical representation

H^{r}

as the query, with

X

as the keys and values, and then output the representation

O

through a FNN layer:

O = F N N (M u l t i H e a d (H^{r}, X, X))

(16)

Then, in order to effectively incorporate the predicted emotion into the generation process, we performed feature fusion:

O^{e s} = [O; E]

(17)

where

E

is obtained from

P

by linear transformation.

We adopted the GRU model as the decoder to generate the response, and the decoding process of the decoder is represented by the following equation:

z_{t} = G R U (O_{t - 1}^{e s}, z_{t - 1})

(18)

where

z_{t}

is the hidden state of the GRU at moment t.

Finally, we used a Softmax layer to obtain word probabilities by combining the emotion-aware representation

O^{e s}

with the decoder’s hidden state representation

z_{t}

at time step t, which can be expressed as follows:

P = S o f t m a x (W^{s} [O^{e s}; z_{t}])

(19)

where

W^{s}

is the trainable parameter. The log-likelihood of the response sequence

Y = \{y_{1}, y_{2}, \dots, y_{k}\}

is given by:

P (Y | U, E) = \prod_{t} P (y_{k} | y_{1 : t - 1}; U, E)

(20)

3.4. Training Objective

Our model can be trained end-to-end, and the overall training objective

J

includes the response generation loss

L_{M L L}

and the emotion classification loss

L_{C L S}

, which are formulated as follows:

J = (1 - λ) L_{M L L} + λ L_{C L S}

(21)

L_{M L L} = \min (- \sum_{t = 0}^{J} \log (y_{t}))

(22)

L_{C L S} = \min (- \log (P [e_{N + 1}]))

(23)

where

λ

is the balance coefficient between generation loss and classification loss, and

e_{N + 1}

is the target sentiment label.

4. Proposed Dataset

4.1. Data Collection

To construct a high-quality conversational corpus, we curated dialogue content from various Chinese language learning websites. The resulting dataset possesses distinct characteristics that contribute to its value and effectiveness. Firstly, it closely resembles real-life dialogue scenarios, offering a faithful reflection of people’s emotional expressions and conversational patterns in everyday communication. Consequently, it enhances the model’s ability to capture and express emotions accurately. Secondly, this dataset mitigates the presence of noise and language style biases commonly found in social media data, such as typos, abbreviations, and internet slang, which can adversely affect model training and performance. The dialogue data are sourced exclusively from diverse real-life contexts, encompassing workplace, travel, family, campus settings, and more. In light of these characteristics, we gave this dataset the name “LifeDialog”. We carefully selected dialogues that align with the task requirements and performed thorough manual annotations for the emotional and sentence type features of each sentence in the dialogues, as well as the topic features of each dialogue. This meticulous annotation process ensured the high quality of the dataset. In total, we curated 10,045 dialogues, all consisting of exchanges between two speakers. Each dialogue comprises a minimum of two utterances. To facilitate effective model training and evaluation, we partitioned the data into training, validation, and testing sets, following an 8:1:1 ratio. Table 1 provides detailed statistical information regarding the dataset.

4.2. Data Processing

In this study, we conducted data cleaning on the collected raw dialogue corpus. Specifically, we applied filters to remove text segments that exceeded 512 tokens in length. To eliminate dialogues containing offensive language, we employed a method that involved creating a profanity word library and combining it with regular expression matching and manual review for the purpose of cleaning. Additionally, we utilized regular expression matching to eliminate special symbols, such as “￥,” “@,” “$,” and others, from the dialogues. This was carried out to facilitate subsequent data analysis and processing. Furthermore, considering the research objectives and dataset characteristics, we selectively removed dialogue texts involving three or more speakers. This choice was guided by the observation that two-party dialogues are more prevalent and easier to model, while dialogues with three or more speakers may introduce additional complexities in terms of interactions and language dynamics that differ from two-party conversations. By maintaining consistency and control within the dataset, we ensured a more focused and coherent analysis. Therefore, we retained data exclusively from two-party dialogues in order to maintain the desired research scope and objectives.

We conducted statistical analysis on the dialogues to gather insights into their structure and content. Specifically, we calculated the average number of utterances and tokens in the dialogues and present the detailed results in Table 2. In this context, an “utterance” refers to an individual linguistic unit within a dialogue, typically representing a single speaker’s turn, be it a user or a system’s utterance. Subsequently, we manually annotated the data based on the emotional, syntactic, and thematic features of the dialogues. To guarantee the reliability and accuracy of the annotated data, a meticulous labeling process was implemented, involving two annotators and one reviewer, all with backgrounds in natural language-processing research and equipped with professional annotation training and guidance. The annotators were entrusted with the task of labeling the data, adhering to predetermined standards and guidelines that included the use of provided feature labels. Simultaneously, the reviewer assumed the responsibility of assessing the quality of the labeled data. This labeling process followed an independent approach, aiming to minimize subjective bias and promote consistency across annotations. In instances where discrepancies emerged between the annotators, discussions were facilitated between the annotators and the reviewer, ultimately leading to a collaborative decision. By employing this rigorous annotation workflow, we ensured the high quality and credibility of our dataset, establishing a solid foundation for subsequent research and analysis in the field of natural language processing.

Furthermore, we conducted an analysis of the number of utterances in each dialogue within the LifeDialog dataset, as depicted in Figure 4. It was observed that the dialogues in the dataset tend to conclude after a reasonable number of dialogue turns, with an average of approximately seven turns per dialogue. This characteristic makes the dataset well suited for training dialogue models. By examining the number of utterances in each dialogue, valuable insights can be gained regarding the length and complexity of the conversations. This information plays a vital role in enabling the model to handle interactions of varying lengths and hierarchies. Some dialogues may exhibit a relatively simplistic structure, consisting mainly of direct question–answer or simple instruction exchanges. On the other hand, certain dialogues may entail greater intricacy, demanding in-depth reasoning, incorporation of background knowledge, and contextual comprehension. Consequently, the diversity of utterance counts in the dialogue corpus contributes to the system’s robustness and its ability to generalize effectively across different conversational scenarios. The dataset’s inclusion of dialogues with varying utterance counts ensures that the model acquires the necessary adaptability to address a wide range of conversational complexities and ultimately enhances its performance in real-world applications.

The emotional labels in the dataset were defined based on the “Big Six Theory” [31], which posits that there are six primary emotions in humans, namely, anger, disgust, fear, happiness, sadness, surprise. However, during the annotation process, it was recognized that the existing emotion labels needed to be expanded to encompass a broader spectrum of emotions. Consequently, the label ‘Neutral’ was introduced to account for instances of emotional neutrality or the absence of expressed emotions. As a result, the LifeDialog dataset now comprises a comprehensive set of seven emotion labels: {Anger, Disgust, Fear, Happiness, Sadness, Surprise, Neutral}. The distribution of emotions within the LifeDialog dataset is depicted in Figure 5. The emotion label “Happiness” exhibited the highest count, while the counts of the “Sadness” and “Anger” labels were relatively balanced. In contrast, the counts of the “Surprise,” “Disgust,” and “Fear” labels were lower. These observations provide several insights: happiness is a commonly expressed emotion in everyday conversations, and individuals tend to convey negative emotions such as sadness and anger within their dialogues. Conversely, the occurrence of surprise, disgust, and fear emotions is less frequent in daily conversations, potentially indicating their association with specific contexts. Additionally, in public dialogues, individuals often exhibit neutral or mild emotions influenced by social norms. The presence of multiple emotion labels reflects the diverse range of emotional expressions and variations in the dialogue, thereby facilitating the dialogue system in better comprehending user emotions and generating accurate emotional responses.

During communication, conversational behavior often reflects the underlying intentions individuals aim to achieve. In light of the various types of Chinese sentences, we categorized the sentence class labels into four distinct categories: declarative sentences, interrogative sentences, imperative sentences, and exclamatory sentences. Declarative sentences primarily convey information and describe situations, including both positive and negative scenarios. Interrogative sentences are primarily used to inquire about situations. Imperative sentences are mainly employed to make requests or suggest that others take certain actions or refrain from doing so. Exclamatory sentences can express the speaker’s opinions or convey intense emotions. The distribution of sentence types within the LifeDialog dataset is illustrated in Figure 6. By examining this distribution, we gained insights into the frequency and relative importance of different sentence types in dialogues. This analysis aids in understanding the varying degrees of emphasis placed on different sentence types during conversations, shedding light on their respective roles within the dialogue context.

The LifeDialog dataset is an extensive collection of real-life conversations, encompassing a diverse range of daily scenarios. These scenarios include casual chats among family members in domestic settings, question-and-answer sessions between teachers and students in educational environments, and more. To ensure organization and clarity, we categorized all conversation topics into eight distinct categories. Figure 7 presents a visual representation of the distribution of topics across the training, development, and testing sets of the dataset. Notably, the highest number of instances within the dataset can be found in the daily life category (2485), interpersonal relationships category (1505), and workplace category (1461). These figures align closely with the distribution of communication needs observed in real-life situations. Specifically, the daily life category encompasses informal conversations among family members or friends, the interpersonal relationships category involves communication during social activities, and the workplace category involves work-related communication. The collection and distribution of this data will provide valuable resources and support for natural language processing tasks.

4.3. Data Features

In this section, we conducted a thorough analysis of the LifeDialog dataset and summarized several notable characteristics of the dataset, including:

We enriched the corpus with a diverse set of features, thereby enhancing the semantic representation of the dialogues. These features encompass emotional attributes of utterances, sentence type characteristics of utterances, and topic-related attributes of dialogues.
We manually annotated seven different and common emotions, such as happiness, sadness, and anger, in the LifeDialog dataset to ensure the quality of the annotations, and provided sufficient data support for other related tasks, such as emotion recognition in dialogues.
The LifeDialog dataset encompassed dialogues from a diverse range of topics, including, but not limited to, daily life, workplace, travel, and more. This broad coverage of topics distinguished it from domain-specific dialogue datasets that focused on particular domains like medical or legal conversations. As a result, the LifeDialog dataset exhibited a high level of versatility and applicability across a wide range of domains and applications in the field of natural language processing.
In contrast to datasets like the LCCC dataset and other Chinese dialogue datasets that are primarily derived from posts and replies on social media platforms, the LifeDialog dataset is exclusively composed of manually written and carefully selected data. This meticulous data curation process guarantees that our dataset maintains higher grammatical standards and more accurately captures the characteristics of real-life conversations.

5. Experiments

5.1. Experimental Setting

The experimental data employed in this study comprise the manually annotated dialogue dataset, LifeDialog. The word-embedding dimension and hidden-layer dimension in the experiment were both set to 512, the batch size was set to 64, and Adam [32] was used as the optimizer. The initial learning rate of the model was set to 1 × 10⁻⁴ with dynamic learning rate decay, the dropout rate was set to 0.3, the attention head number was set to 8, and the number of Inference Modules in the Understand Unit was set to 2. Based on experimental verification, we set the weight coefficient λ to 0.3. The experimental environment and parameter settings for the comparison model were identical to those of DialogCIN.

5.2. Comparison Experiment and Result Analysis

To make comparisons, we employed the following models as baselines:

(1) HRED: The Hierarchical Recurrent Encoder-Decoder Model: this model divides the dialogue into multiple hierarchical representations using a hierarchical design and models them using an RNN. By introducing a context encoder and a generator, the model can capture the semantic information and contextual dependencies of the dialogue, resulting in the generation of dialogue responses.

(2) WSeq [33]: This model initially computes the cosine similarity between the vector representations of the conversation context and the current message, employing these values as weights to effectively manage the significance of response generation. Subsequently, the conversation context representation is integrated using this weighting scheme. Lastly, a hierarchical sequence-to-sequence model is employed to generate a response.

(3) DSHRED [34]: A method is proposed herein for generating context-aware dialogue responses, employing a combination of static and dynamic attention mechanisms. The static attention mechanism assigns weights to individual sentences within the dialogue history, whereas the dynamic attention mechanism assigns weights to individual words within each sentence.

(4) ReCoSa: The core idea of this model is to automatically learn the correlations between contexts using a self-attention mechanism. It utilizes the encoded representations from the dialogue history and computes weights for each context representation through self-attention. This allows the model to focus on relevant contextual information and generate responses by understanding the semantics and context of the conversation.

5.3. Evaluation Metrics

In our experimental evaluation, we employed both automated and human assessments to validate the efficacy of our model. Automatic evaluation metrics, namely Perplexity (PPL), BLEU [35], and Distinct [36], were utilized to compare the performance of our proposed model against the baseline model. Furthermore, human evaluations were conducted to gauge the semantic coherence and emotional appropriateness of both the baseline and proposed models. Through a comprehensive analysis of these evaluation metrics, we can discern the strengths and weaknesses of our proposed model in comparison to the baseline model.

5.3.1. Automatic Metrics

(1) Perplexity (PPL) is a metric utilized in evaluating the predictive capability of a language model, providing insights into the level of uncertainty exhibited by the model when generating sentences from a designated test set. PPL values range from 1 and have no defined upper limit. A lower perplexity value indicates a superior alignment of the language model with the test set, suggesting that the model can more precisely estimate the probability of the next word given the contextual history. Conversely, a higher perplexity value signifies a weaker alignment of the language model with the test set, indicating a diminished ability to accurately predict subsequent words. The mathematical definition of PPL is as follows:

P P L = \sqrt[N]{\prod_{i = 1}^{N} \frac{1}{p (w_{i} ∣ w_{1} w_{2} \dots w_{i - 1})}}

(24)

where

p (w_{i})

is the probability of generating the i-th word.

(2) The BLEU metric is used to measure the similarity between generated outputs and reference answers. It calculates a score by comparing the matching of n-grams (contiguous sequences of words or characters) in the generated output and the reference answer. The score typically ranges between 0 and 1, where a higher value indicates a closer quality match between the generated output and the reference answer. The mathematical definition of BLEU is as follows:

B P = \{\begin{matrix} 1 & i f c > r \\ e^{1 - \frac{r}{c}} & i f c \leq r \end{matrix}

(25)

B L E U = B P \cdot e x p (\sum_{n = 1}^{N} w_{n} l o g P_{n})

(26)

where

P_{n}

denotes the accuracy of the N-gram on the dataset,

w_{n}

denotes the weights of different N-grams,

B P

is the over-shortening penalty factor, and N is the length of the N-gram, and BLEU-4 is used for the experiments in this paper.

(3) Distinct-1 and Distinct-2 metrics serve as measures to assess the diversity present in generated text. Distinct-1 evaluates the ratio of distinct individual words or phrases found within the generated text, while Distinct-2 measures the ratio of unique bigrams, which are adjacent pairs of words or phrases. The range of the Distinct metric typically spans from 0 to 1, with a higher value closer to 1 indicating a greater abundance of diverse and unique words or bigrams within the generated text. Consequently, such high values signify a more varied and distinct set of responses produced by the model. The mathematical definition of the Distinct metric is provided as follows:

D i s t i n c t - 1 = \frac{c o u n t (d i s t i n c t_{w_{i} \in R} (w_{i}))}{c o u n t (a l l_{w_{i} \in R} (w_{i}))}

(27)

D i s t i n c t - 2 = \frac{c o u n t (d i s t i n c t_{w_{i} w_{i + 1} \in R} (w_{i} w_{i + 1}))}{c o u n t (a l l_{w_{i} w_{i + 1} \in R} (w_{i} w_{i + 1}))}

(28)

where

R

represents all the generated results on the test set,

d i s t i n c t ()

indicates the removal of all repetitions,

a l l ()

represents all the results, and

c o u n t

represents the number of statistics.

5.3.2. Human Evaluation

To conduct human evaluation, we enlisted the participation of three evaluators who were tasked with assessing the semantic coherence and emotional appropriateness of 100 randomly selected contexts and their corresponding generated replies. The evaluation of semantic coherence primarily scrutinized the logicality of the responses, the consistency of information presented, and the fluency in transitioning between sentences. Evaluators evaluated whether the response effectively addressed the input dialogue and seamlessly connected with the preceding conversation. The assessment of emotional appropriateness centered around determining whether the emotional expression in the generated response aligned with the dialogue context. Evaluators examined whether the expressed emotions in the response harmonized with the background of the conversation and the intended emotions of the user. Evaluators assigned ratings to the evaluation results using a 5-level scale, where each level corresponded to different scores: perfectly appropriate (5), basically appropriate (4), neutral (3), inappropriate (2), and completely inappropriate (1). Finally, we computed the average score across all samples for each model, providing a comprehensive reflection of the results obtained through human evaluation.

5.4. Results and Analysis

5.4.1. Comparison Experiment and Result Analysis

Based on the experimental settings in Section 5.1, we trained our proposed DialogCIN model, which is a context inference network for dialogue generation, and compared it with other classic dialogue generation models under the same experimental conditions. Through the analysis of the experimental results, we drew valuable conclusions.

Table 3 presents the results of different models for dialogue generation on the LifeDialog test set. It can be observed that DialogCIN outperformed other models in all four evaluation metrics. HRED is a hierarchical recurrent encoder-decoder model, and WSeq is a sequence-to-sequence model. Overall, their performance was relatively poor. Compared to the DSHRED model, the DialogCIN model showed improvements of 7.5% in PPL, 4.0% in BLEU, 4.2% in Distinct-1, and 11.5% in Distinct-2. This indicates that our model had more comprehensive and in-depth modeling of dialogue history, allowing it to capture long-term dependencies and complex contextual information in conversations. Compared to the ReCoSa model, the DialogCIN model showed improvements in all four evaluation metrics: PPL, BLEU, Distinct-1, and Distinct-2. Particularly, it achieved significant enhancements in the Distinct-1 and Distinct-2 metrics. This was attributed to our proposed Inference Module, which helped the model to better comprehend the context of the dialogue and incorporated contextual information into the generated responses. Based on the experimental findings presented in Table 3, it is evident that DialogCIN exhibited a comparatively modest enhancement in the BLEU metric when juxtaposed with the baseline models. This outcome can be ascribed to the inherent openness and diversity prevalent in the dialogue scenarios of this particular task, consequently giving rise to substantial linguistic and grammatical disparities between the reference answers and the generated responses. Such disparities impose limitations on the effectiveness of the BLEU metric. However, notable advancements were observed in the Distinct metrics, indicating that the DialogCIN model excels in circumventing redundant content in its responses, comprehensively grasping the underlying logical information conveyed within the given context, and generating responses that are both fluent and varied. These conclusions were further substantiated through subsequent human evaluation.

5.4.2. Human Evaluation Results

We compared the differences between DialogCIN and the baseline model in terms of semantic coherence and emotional appropriateness, and the results are presented in Table 4. The results indicate that the DialogCIN model achieved the highest scores in both of these metrics. This suggests that, compared to the baseline model, the DialogCIN model generated responses that were more consistent with the semantic and emotional context, particularly with a more prominent performance in the Semantic Coherence metric. This demonstrates that our DialogCIN model was able to better understand the semantic context of the conversation and generate appropriate responses, outperforming the baseline model in this regard.

5.4.3. Ablation Experiment and Result Analysis

To investigate the impact of different modules in DialogCIN on its performance, we conducted several ablation studies on the LifeDialog dataset. In these studies, we removed different modules that modeled the global and speaker-level information in the Representation Unit and Understand Unit. We found that the model’s performance decreased to varying degrees when we ablated the Representation Unit and Understand Unit separately.

Effect of Understand Unit

When we removed all the Understand Units at the Context and Speaker levels, as shown in Part 3 of Table 5, DialogCIN’s PPL, BLEU, Distinct-1, and Distinct-2 metrics on the LifeDialog dataset decreased by 3.8%, 12.5%, 8.1%, and 8.3%, respectively. This demonstrates the effectiveness of the designed Understand Units, which can iteratively perform the reasoning and retrieval process to improve the model’s ability to understand context. In addition, as shown in Part 2, when we separately removed the Understand Units at the Context and Speaker levels, DialogCIN’s performance experienced varying degrees of decline, indicating that both the global-level and speaker-level Understand Units play a positive role in the model.

Effect of Representation Unit

As demonstrated in the final row of Table 5, a noteworthy decline in DialogCIN’s performance was observed upon removing the Representation Unit. This outcome serves as a compelling indication of the pivotal role played by the extraction of correlations among utterances in acquiring an effective representation of the said utterances. Such correlation extraction stands as a crucial requirement for achieving optimal performance within the DialogCIN model.

Effect of Different Level

Upon individually removing the global-level and speaker-level context within both the Representation Unit and the Understand Unit, we observed distinct degrees of performance degradation within the model. This observation suggests that both the global-level and speaker-level context exert a positive influence on both the Representation Unit and the Understand Unit. Notably, our findings revealed that the impact of the global-level context surpassed that of the speaker-level context, regardless of whether it was within the Representation Unit or the Understand Unit. This disparity can be attributed to the fact that the global-level context inherently encapsulates a richer set of logical information, whereas the speaker-level context predominantly comprises logical information pertaining to the utterances of a specific speaker. Consequently, the speaker-level context possesses limited potential in enhancing the model’s contextual understanding and generation effectiveness.

5.4.4. Parameter Experiment and Result Analysis

To further analyze the effectiveness of the Inference Module in the Understand Unit under different numbers of settings, we tested the impact of the Inference Module on model performance for different numbers of settings. The experimental results are shown in Table 6.

Through our investigations, we discovered that the model attained its optimal overall performance when the number of settings in the Inference Module was set to 2. Notably, as the number of settings in the Inference Module increased, we observed a decrease in the model’s overall performance, which was particularly evident in the Distinct-1 and Distinct-2 metrics. The observed decline in these metrics indicates a higher occurrence of text repetition and redundancy within the generated output, which may suggest a reduced generalization ability of the model. We attributed this phenomenon to the fact that a limited number of Inference Modules failed to fully extract the intrinsic logical information embedded within the context. Conversely, an excessive number of Inference Modules can lead to the loss of contextual information or the acquisition of irrelevant details during the reasoning and retrieval process. This behavior resembles the human tendency to forget earlier information during dialogues. Therefore, by setting the number of settings for the Inference Module to 2, the DialogCIN model effectively extracted the inherent logical information within the context without suffering from information loss or learning superfluous details, thereby yielding an optimal model performance.

5.4.5. Case Study

In order to provide additional evidence supporting the efficacy of our approach, we performed a detailed instance analysis on three selected examples extracted from the LifeDialog dataset employed in our experimental evaluation. The results of this analysis are presented in Table 7.

Case 1

In this instance, the conversation topic was a normal everyday conversation without obvious emotional tones. From the generated results, the response produced by WSeq can be considered a safe response without substantive meaning, while the response generated by HRED hardly established any connection with the conversation context. Therefore, only the responses generated by DSHRED, ReCoSa, and DialogCIN establish reasonable semantic connections with the conversation context. In terms of the emotional aspect of the response, the responses generated by DSHRED and ReCoSa did not contain any emotional bias, while only DialogCIN’s generated responses aligned with the semantic and emotional context of the conversation. This observation substantiates the notion that our approach possesses a profound comprehension of the conversational context, facilitating the generation of emotionally nuanced responses.

Case 2

In this instance, it is evident that, apart from WSeq and HRED, the responses generated by the remaining methods aligned closely with the semantic context. Specifically, the response generated by DSHRED exhibited a lack of emotional bias, whereas ReCoSa conveyed emotional resonance. Nevertheless, in comparison to ReCoSa, we posited that the response generated by DialogCIN was more fitting for the given context.

Case 3

In this instance, it was discernible that Speaker B exhibited negative emotions. Upon careful examination, it became evident that, with the exception of WSeq, the responses generated by the other methods aptly corresponded to the contextual cues. However, the response generated by HRED failed to elicit a positive impact on the emotional state of the speaker. In contrast, although the responses generated by DSHRED and ReCoSa were able to partially assuage the speaker’s emotions, their emotional expressions remained relatively superficial. Consequently, we contend that the response generated by DialogCIN was more fitting and possessed the capacity to effectively alleviate Speaker B’s negative emotions, thereby carrying significant practical implications.

6. Conclusions

This paper presented the construction of a meticulously crafted Chinese dialogue dataset named LifeDialog, which encompassed diverse attributes such as emotions and topics. Furthermore, we proposed DialogCIN, an emotional dialogue generation model that leverages human cognitive patterns to perceive emotions within dialogues and generate contextually appropriate responses. Drawing inspiration from the cognitive processes observed in human dialogues, we devised a representation unit and an understand unit. The representation unit enabled the acquisition of a comprehensive dialogue representation at both global and speaker levels. Within the understand unit, we designed an Inference Module that iteratively executed reasoning and retrieval procedures, facilitating a profound comprehension of the dialogue context and enabling the generation of coherent responses. Our empirical findings validated the efficacy of the DialogCIN model, which garnered favorable feedback pertaining to both semantic coherence and emotional aptness. In future endeavors, we plan to integrate external knowledge graphs, such as C3KG and KdConv, into the existing model, as we firmly believe that this incorporation will significantly enhance the dialogue generation capabilities of our model.

Author Contributions

Conceptualization, W.L. and W.Y.; methodology, W.L. and F.W.; validation, W.L.; data curation, W.L., F.W.; writing—original draft preparation, W.L.; writing—review and editing, W.L., W.Y. and F.W.; supervision, W.L. and W.Y.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China, grant number 202204120017, the Autonomous Region Science and Technology Program, grant number 2022B01008-2, and Autonomous Region Science and Technology Program, grant number 2020A02001-1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to copyright.

Conflicts of Interest

The authors declare no conflict of interest.

References

Prendinger, H.; Ishizuka, M. The Empathic Companion: A Character-Based Interface that Addresses Users’ Affective states. Appl. Artif. Intell. 2005, 19, 267–285. [Google Scholar] [CrossRef] [Green Version]
Partala, T.; Surakka, V. The effects of affective interventions in human–computer interaction. Interact. Comput. 2004, 16, 295–309. [Google Scholar] [CrossRef]
Keshtkar, F.; Inkpen, D. A pattern-based model for generating text to express emotion. In Proceedings of the Affective Computing and Intelligent Interaction: Fourth International Conference, ACII 2011, Memphis, TN, USA, 9–12 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 11–21. [Google Scholar]
Skowron, M. Affect listeners: Acquisition of affective states by means of conversational systems. In Revised Selected Papers, Proceedings of the Development of Multimodal Interfaces: Active Listening and Synchrony: Second COST 2102 International Training School, Dublin, Ireland, 23–27 March 2009; Springer: Berlin/Heidelberg, Germany, 2010; pp. 169–181. [Google Scholar]
Zhou, H.; Huang, M.; Zhang, T.; Zhang, T.; Zhu, X.; Liu, B. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Orleans, LA, USA, 2–7 February 2018; p. 32. [Google Scholar]
Wei, W.; Liu, J.; Mao, X.; Guo, G.; Zhu, F.; Zhou, P.; Hu, Y. Emotion-aware chat machine: Automatic emotional response generation for human-like emotional interaction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1401–1410. [Google Scholar]
Li, Q.; Chen, H.; Ren, Z.; Chen, Z.; Tu, Z.; Ma, J.E. Multi-resolution Interactive Empathetic Dialogue Generation. arXiv 2019, arXiv:1911.08698. [Google Scholar]
Liang, Y.; Meng, F.; Zhang, Y.; Chen, Y.; Xu, J.; Zhou, J. Infusing multi-source knowledge with heterogeneous graph neural network for emotional conversation generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 13343–13352. [Google Scholar]
Mao, Y.; Cai, F.; Guo, Y.; Chen, H. Incorporating emotion for response generation in multi-turn dialogues. Appl. Intell. 2022, 52, 7218–7229. [Google Scholar] [CrossRef]
Li, M.; Zhang, J.; Lu, X.; Zong, C. Dual-View Conditional Variational Auto-Encoder for Emotional Dialogue Generation. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 21, 1–18. [Google Scholar] [CrossRef]
Wang, H.; Lu, Z.; Li, H.; Chen, E. A dataset for research on short-text conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 935–945. [Google Scholar]
Wang, Y.; Ke, P.; Zheng, Y.; Chen, E. A large-scale chinese short-text conversation dataset. In Part I 9, Proceedings of the Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, 14–18 October 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 91–103. [Google Scholar]
Qian, H.; Li, X.; Zhong, H.; Guo, Y.; Ma, Y.; Zhu, Y.; Liu, Z.; Dou, Z.; Wen, J.R. Pchatbot: A large-scale dataset for personalized chatbot. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 2470–2477. [Google Scholar]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
Zhou, X.; Dong, D.; Wu, H.; Zhao, S.; Yu, D.; Tian, H.; Liu, X.; Yan, R. Multi-view response selection for human-computer conversation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 372–381. [Google Scholar]
Zhang, Z.; Li, J.; Zhu, P.; Zhao, H.; Liu, G. Modeling multi-turn conversation with deep utterance aggregation. arXiv 2018, arXiv:1806.09102. [Google Scholar]
Tao, C.; Wu, W.; Xu, C.; Hu, W.; Zhao, D.; Yan, R. Multi-representation fusion network for multi-turn response selection in retrieval-based chatbots. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, Australia, 11–15 February 2019; pp. 267–275. [Google Scholar]
Serban, I.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; p. 30. [Google Scholar]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Lee, J.Y.; Lee, K.A.; Gan, W.S. A Randomized Link Transformer for Diverse Open-Domain Dialogue Generation. In Proceedings of the 4th Workshop on NLP for Conversational AI, Dublin, Ireland, 27 May 2022; pp. 1–11. [Google Scholar]
Shen, L.; Zhan, H.; Shen, X.; Feng, Y. Learning to select context in a hierarchical and global perspective for open-domain dialogue generation. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7438–7442. [Google Scholar]
Zhang, H.; Lan, Y.; Pang, L.; Guo, J.; Cheng, X. Recosa: Detecting the relevant contexts with self-attention for multi-turn dialogue generation. arXiv 2019, arXiv:1907.05339. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Shen, L.; Feng, Y. CDL: Curriculum dual learning for emotion-controllable response generation. arXiv 2020, arXiv:2005.00329. [Google Scholar]
Song, Z.; Zheng, X.; Liu, L.; Xu, M.; Huang, X.J. Generating responses with a specific emotion in dialog. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3685–3695. [Google Scholar]
Uthus, D.C.; Aha, D.W. The ubuntu chat corpus for multiparticipant chat analysis. In Proceedings of the 2013 AAAI Spring Symposium Series, Palo Alto, CA, USA, 25–27 March 2013. [Google Scholar]
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv 2017, arXiv:1710.03957. [Google Scholar]
Tiedemann, J. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. Recent Adv. Nat. Lang. Process. 2009, 5, 237–248. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.H.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS 2017, Long Beach, CA, USA, 4–7 December 2017. Advances in Neural Information Processing Systems 30. [Google Scholar]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Tian, Z.; Yan, R.; Mou, L.; Song, Y.; Feng, Y.; Zhao, D. How to Make Context More Useful? An Empirical Study on Context-Aware Neural Conversational Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 231–236. [Google Scholar]
Zhang, W.; Cui, Y.; Wang, Y.; Zhu, Q.; Li, L.; Zhou, L.; Liu, T. Context-sensitive generation of open-domain conversational responses. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018; pp. 2437–2447. [Google Scholar]
Xing, C.; Wu, W.; Wu, Y.; Liu, J.; Huang, Y.; Zhou, M.; Ma, W. Topic aware neural response generation. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; p. 31. [Google Scholar]
Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; Jurafsky, D. Deep reinforcement learning for dialogue generation. arXiv 2016, arXiv:1606.01541. [Google Scholar]

Figure 1. Example of a conversation.

Figure 2. Overview of the proposed method—DialogCIN.

Figure 3. Structure diagram of the Inference Module.

Figure 4. Distribution of the number of utterances in LifeDialog.

Figure 5. Distribution of Emotion in LifeDialog.

Figure 6. Distribution of sentence classes in LifeDialog.

Figure 7. Distribution of topics in LifeDialog.

Table 1. Dialogue dataset statistics.

Spilt	Quantity
Train	8039
Dev	1026
Test	980
Total	10,045

Table 2. Statistical information on the LifeDialog dataset.

Total Dialogues	10,045
Average Utterances Per Dialogue	7.1
Average Tokens Per Dialogue	112.7
Average Tokens Per Utterance	16.0

Table 3. Results of different models.

Models	PPL	BLEU	Distinct-1	Distinct-2
WSeq	95.78	0.0096	0.0366	0.179
HRED	101.53	0.0084	0.0379	0.1871
DSHRED	97.70	0.01	0.0379	0.1929
ReCoSa	98.41	0.0099	0.0346	0.1868
DialogCIN	91.36	0.0104	0.0395	0.2151

Table 4. Results of human evaluation.

Model	Semantic Coherence	Emotional Appropriateness
WSeq	2.78	2.81
HRED	2.97	3.05
DSHRED	3.13	3.17
ReCoSa	3.32	3.41
DialogCIN	3.78	3.65

Table 5. Result of ablation experiment.

Part	Representation		Understand		PPL	BLEU	D-1	D-2
Part	Context	Speaker	Context	Speaker	PPL	BLEU	D-1	D-2
1	√	√	√	√	91.36	0.0104	0.0395	0.2151
2	√	√	√	×	90.43	0.0094	0.0384	0.2094
2	√	√	×	√	93.21	0.0093	0.0379	0.2063
3	√	√	×	×	94.93	0.0091	0.0363	0.1972
	√	×	×	×	99.91	0.0087	0.0349	0.1926
	×	√	×	×	98.79	0.0089	0.0344	0.1898
	×	×	×	×	103.15	0.0079	0.0310	0.1644

Table 6. Model performance for different number of Inference Module settings.

N_Modules	PPL	BLEU	Distinct-1	Distinct-2
1	93.42	0.0098	0.0367	0.2068
2	91.36	0.0104	0.0395	0.2151
3	98.85	0.0095	0.0332	0.1942
4	106.25	0.0097	0.0313	0.1931

Table 7. Case study.

Case 1	Context	A:我听说你得到了一个去欧洲培训半年的机会，是真的吗？ (A: Is it true that you have been offered an opportunity to undergo a six-month training trip in Europe?)
	Context	B:是真的，我下个月14号就要动身。 (B: It’s true, I’m moving on the 14th of next month.)
	WSeq	好的。(OK.)
	HRED	这个好。(This is good.)
	DSHRED	不错不错。(Nice nice.)
	ReCoSa	我也想有这种机会。(I would like to have that opportunity.)
	DialogCIN	祝贺你！得到这么好的机会。 (Congratulations! Getting such a great opportunity.)
Case 2	Context	A:生日快乐，林小姐。这是我送给您的礼物。 (A: Happy Birthday, this is my gift to you.)
	Context	B:哦，谢谢！您太好了。啊！是一个瓷瓶。 (B: Oh, thank you! You are too kind. Ah! It’s a porcelain vase.)
	WSeq	是吗？(Is it?)
	HRED	我也是这么想的。(That’s what I thought.)
	DSHRED	这和玉瓶哪个好看？(Which looks better, this or the jade bottle?)
	ReCoSa	我也喜欢，它很漂亮。(I like it too, it’s beautiful.)
	DialogCIN	这是我的荣幸，很高兴你喜欢。 (It was my pleasure and I’m glad you enjoyed it.)
Case 3	Context	A:怎么一副垂头丧气的样子？ (A: Why do you look so downcast?)
	Context	B:我正郁闷着呢，考试考砸了。 (B: I was depressed, having failed my exams.)
	WSeq	我想是的。(I think so.)
	HRED	你怎么知道的？(How do you know?)
	DSHRED	没事的。(It’s all right.)
	ReCoSa	不要伤心。(Don’t be sad.)
	DialogCIN	别这么沮丧，你可以重考嘛。 (Don’t be so discouraged. You can retake the exam.)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lou, W.; Yang, W.; Wei, F. DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation. Appl. Sci. 2023, 13, 8629. https://doi.org/10.3390/app13158629

AMA Style

Lou W, Yang W, Wei F. DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation. Applied Sciences. 2023; 13(15):8629. https://doi.org/10.3390/app13158629

Chicago/Turabian Style

Lou, Wenzhe, Wenzhong Yang, and Fuyuan Wei. 2023. "DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation" Applied Sciences 13, no. 15: 8629. https://doi.org/10.3390/app13158629

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DialogCIN: Contextual Inference Networks for Emotional Dialogue Generation

Abstract

1. Introduction

2. Related Work

2.1. Open Domain Dialogue Generation

2.2. Emotional Dialog Generation

2.3. Datasets for Dailogue Generation

3. Proposed Method

3.1. Task Formulation

3.2. Inference-Based Encoder

3.2.1. Utterance Features

3.2.2. Representation Unit

3.2.3. Understand Unit

3.2.4. Emotion Predictor

3.3. Emotion-Aware Decoder

3.4. Training Objective

4. Proposed Dataset

4.1. Data Collection

4.2. Data Processing

4.3. Data Features

5. Experiments

5.1. Experimental Setting

5.2. Comparison Experiment and Result Analysis

5.3. Evaluation Metrics

5.3.1. Automatic Metrics

5.3.2. Human Evaluation

5.4. Results and Analysis

5.4.1. Comparison Experiment and Result Analysis

5.4.2. Human Evaluation Results

5.4.3. Ablation Experiment and Result Analysis

5.4.4. Parameter Experiment and Result Analysis

5.4.5. Case Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI