Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning

Tran, Quoc-Dai Luong; Le, Anh-Cuong

doi:10.3390/app13085041

Open AccessArticle

Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning

by

Quoc-Dai Luong Tran

^*,†

and

Anh-Cuong Le

^†

Natural Language Processing and Knowledge Discovery Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(8), 5041; https://doi.org/10.3390/app13085041

Submission received: 8 March 2023 / Revised: 8 April 2023 / Accepted: 13 April 2023 / Published: 17 April 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The development of conversational agents that can generate relevant and meaningful replies is a challenging task in the field of natural language processing. Context and predictive capabilities are crucial factors that humans rely on for effective communication. Prior studies have had a significant limitation in that they do not adequately consider the relationship between utterances in conversations when generating responses. This study aims to address this limitation by proposing a novel method that comprehensively models the contextual information of the current utterance for response generation. A commonly used approach is to rely on the information of the current utterance to generate the corresponding response, and as such it does not take advantage of the context of a multi-turn conversation. In our proposal, different from other studies, we will use a bi-directional context in which the historical direction helps the model remember information from the past in the conversation, while the future direction enables the model to anticipate its impact afterward. We combine a Transformer-based sequence-to-sequence model and the reinforcement learning algorithm to achieve our goal. Experimental results demonstrate the effectiveness of the proposed model through qualitative evaluation of some generated samples, in which the proposed model increases 24% average BLEU score and 29% average ROUGE score compared to the baseline model. This result also shows that our proposed model improves from 5% to 151% for the average BLEU score compared with previous related studies.

Keywords:

BERT; conversation system; chatbot; deep reinforcement learning; sequence to sequence

1. Introduction

Conversational agents aim to enable computers to communicate with humans by automatically generating a response to each user input. Researchers have been working on developing robust, scalable, and context-aware systems for a long time. One of the key challenges they are focused on is generating meaningful and consistent dialogue responses that align with the conversation history, which this paper also addresses. While previous studies have indicated that incorporating additional information can enhance the accuracy of models, it is still uncertain how contextual information impacts a conversation as a whole, including its connection to future outcomes and how different contextual factors affect the conversation. Our goal is to describe and understand how contextual factors contribute to model performance improvement. In this study, we examine, explore, and exploit the contexts humans use in everyday decision making. Based on the principle that the more information the better, we have also presented different generation-based models that utilize context information in a conversation. This study proposes a Deep Reinforcement Learning model for analyzing the influence of different contextual information on responses in a conversation and how to combine them to improve a conversation’s coherence and consistency.

Existing chatbot systems can be classified into rule-based, information retrieval-based, and generation-based systems. Rule-based methods rely on a knowledge base of pre-defined question–answer pairs. When an input is received, the system matches it to the most similar question pattern in the knowledge base and retrieves the corresponding answer [1]. However, these predefined rules take time to expand and maintain. Additionally, these systems respond to user questions in a stereotyped manner without utilizing the conversation history.

Another method based on the Information Retrieval (IR) approach is available [2,3,4]. This approach is suitable when a large set of conversations is available. The principle of IR-based models is to select the corresponding response based on a given input from a dataset (containing a set of conversation or question–answer pairs). However, a disadvantage of IR-based models is their limited ability to understand semantic differences between input contexts. Additionally, IR-based models can only respond within the dataset, making them suitable for query-type lookups rather than forming a conversation.

Recently, generation-based approaches have attracted much attention in NLP, especially in machine translation and chatbots. These approaches treat a conversation as a source-to-target problem. Seq2Seq models based on LSTM have shown excellent performance in generating responses, such as in [5,6,7]. The strength of this approach lies in its ability to generate answers in a generalized way that is not dependent on predetermined rules. The mechanism of this method mimics human thinking, including two stages: the first stage of encoding (i.e., understanding) the question and the second stage of generating the corresponding answer. Therefore, most current research continues this approach and focuses on developing deep neural models to improve the quality of conversations, which is also the approach in our work. However, most Seq2Seq-based models are trained on single-turn conversations, making them incapable of handling long-context conversations. Additionally, due to numerous generic responses in the training dataset, Seq2Seq models tend to generate generic responses regardless of the input [8,9,10].

In reality, humans use contextual reasoning to make daily decisions. Context refers to the collection of texts surrounding a word used in a sentence or phrase of interest. Some studies have used previous utterances to generate current ones or reinforcement learning methods to keep track of the conversation’s history, such as [11,12]. However, these studies typically only considered partially observed context when generating responses at the current position. For a chatbot model to be effective, it must consider the logical relationship between turns in a conversation. Therefore, the chatbot must consider the surrounding contexts in the ongoing conversation when generating a response. For example, in Table 1, there is a dialogue between a boy and his mother. In this conversation, the answer in turn 7 is based on the information “bus” mentioned in the previous turn (turn 3). This example highlights the importance of a chatbot managing the flow of information and connecting utterances coherently and meaningfully throughout the entire conversation.

This study aims to use contextual information to generate responses effectively. At position i in a multi-turn conversation, the task of the conversation modeling is to generate a response for the following position

i + 1

. In this proposed conversation model, we use two contexts: the left and right contexts of the current utterance. The left-context stores information from previous utterances, providing historical context for the current one. The right-context, on the other hand, indicates how the current utterance may affect future utterances. We use the left-context by integrating multiple preceding utterances from position i backward into the Seq2Seq structure to gather contextual information and generate the corresponding response for position

i + 1

. The more challenging task is how to use the right-context. When the model generates a response at position

i + 1

, there is no utterance available at subsequent positions. So how can we exploit the right-context during the training process? To exploit the right contextual information, we consider the conversation as a Markov decision process like [13,14,15]. Then, we apply the reinforcement learning technique to manage it. Instead of stopping the generation of utterances at position

i + 1

, we utilize the result to generate utterances at subsequent positions such as

i + 2

,

i + 3

, and so on. The model is designed to simultaneously optimize the entire utterance sequence, including positions

i + 1

,

i + 2

, and beyond. This is achieved using a combination of a deep learning model and a reinforcement learning mechanism, known as the deep reinforcement learning model. Using this model, we have incorporated both the left-context and right-context of the current utterance into the model’s training process.

Unlike recent studies that often focus on designing the architecture of models for generating responses, our goal is to describe and understand contextual factors that can improve the model’s performance in generating responses. In this study, we take a more general approach, using both the left-context and right-context of the current utterance when generating the subsequent response. Our model aims to generate responses in a multi-turn conversation and is a potent combination of Seq2Seq architecture and reinforcement learning algorithm. Seq2Seq is capable of leveraging historical information as a left-context. In contrast, the reinforcement learning algorithm leverages the future impact of the current response. This optimization process helps the model generate appropriate responses that align with the previous conversation and guide its future impact. The contributions of this paper are summarized as follows:

Using historical information in the conversation for the Seq2Seq model can improve consistency in the conversation. When combined with future contextual information, the Seq2Seq model can evaluate the current response based on long-term objectives that the Reinforcement Learning (RL) algorithm will accumulate. This method enables the chatbot to maintain a conversation that adheres to a specific objective. The primary objective of utilizing bi-directional context is to ensure the conversation stays on track toward the desired goal. Our aim is to develop a new conversational model that takes advantage of effective context to improve the coherence and consistency of a conversation.
RL techniques require a forward-looking function, which is an essential component that scores the quality of each response. In order to enhance the training of models for achieving the desired goal, we have introduced two additional forward-looking functions.
Our proposed method for training conversational agents involves using deep reinforcement learning algorithms that leverage action spaces obtained from simulating conversations between two pre-trained virtual agents.

2. Related Work

In this section, we will provide a brief overview of the models proposed in recent years, including their respective strengths and limitations. Pattern matching and machine learning are two main approaches to developing a conversational model based on the applied techniques.

Weizenbaum from the Massachusetts Institute of Technology (MIT) developed ELIZA, one of the earliest rule-based chatbots [16]. ELIZA is a simple program that uses pre-defined rules to communicate with users in natural language. It searches for keywords in the user’s input text and analyzes them using predefined rules to generate a response. PARRY is a chatbot extension of ELIZA, created by Stanford [17], and has many improvements over ELIZA. PARRY simulates a patient with schizophrenia and generates responses based on the user’s emotional state and the previous response. From 1995 to 2000, the Artificial Intelligence Markup Language (AIML) was developed to build a knowledge base for chatbot systems using pattern-matching techniques. ALICE was the first chatbot to use the AIML language, and its knowledge base contains around 41,000 patterns, in contrast to ELIZA’s 200 rules. However, even with this massive knowledge base, ALICE is still not intelligent enough to generate human-like responses. The weakness of pattern-matching methods is that they often produce robotic and repetitive responses, lacking the naturalness of actual human interactions. Therefore, this approach is better suited to a single-turn conversation where the response can be selected from a database, and the user is only interested in the final response. In practice, we need chatbots to generate appropriate and goal-oriented responses, ensuring the conversation remains coherent and meaningful within the given context.

Unlike chatbot systems based on the pattern matching method, chatbots that use Machine Learning can extract information from user input using NLP techniques and learn from conversations. They are not limited to using pre-defined rules set by the user. The core idea of Machine Learning is to train a model with labeled question–answer pairs provided by humans, which maps the relationships between inputs and responses from the training data. One of the earliest generations of chatbots was developed using Statistical Machine Translation (SMT) by [18]. The SMT-based approach treats conversational responses as a language translation problem, where the mapping rules between input and output are learned from training data. Based on this idea, some previous studies proposed SMT models for building chatbots [18]. They utilized SMT models by taking the user’s utterance X as input and generating the response Y using the translation method. The language model

P (Y)

was constructed by counting the frequency of n-grams in the training data. The probability of a response sentence given an input sentence denoted as

P (Y | X)

forms the basis for generating reasonable phrases in a conversation. A translation table was also created based on the training data during the training process. The Moses phrase-based decoder used the translation table to select the best response for the input sentence. In their experiments, the input and response were in the same language, and the data consisted of status updates on Twitter. However, the researchers found that mapping responses in a conversation was more complex than translating between two languages and unsuitable for multi-turn conversations.

Recognizing user messages and generating appropriate responses can enrich communication in chatbot systems. Recently, text generation tasks based on the Seq2Seq neural network model have attracted much attention in the field of neural dialogue generation [6]. Their models use two recurrent neural networks (RNNs). The first network encodes the input sentence to a context vector, and the second decodes the vector to generate the desired response. In recent studies, generating responses based on Seq2Seq models has achieved significant improvements in various applications, from machine translation [19,20,21], text summarization [22,23,24], and chatbots [5,6,25]. Seq2Seq-based models are widely used in response generation because they usually achieve a better performance than earlier models. However, they have several drawbacks to multi-turn conversations. They learn by maximizing the probability of generating a response based on the previous dialogue turn using maximum likelihood estimation (MLE). Moreover, MLE may have difficulty estimating the specific targets of chatbot systems. Furthermore, the training dataset includes numerous generic responses, which can cause Seq2Seq models to generate common and boring responses like “I don’t know” regardless of the input [8]. These types of responses can lead to the termination of the conversation or fall into an endless loop after three turns [26]. Serban et al. recently introduced a hierarchical neural model that captures relationships between turns in a conversation [5,11]. They improved their model by training it on a dataset of question–answer pairs and pre-trained embeddings. Additionally, Li et al. [27] proposed using maximum mutual information (MMI) instead of the MLE objective function for response generation tasks to increase response diversity. Although this improvement may generate more appropriate responses, it is still challenging to achieve the goal of a chatbot: simulating human-to-human interactions by providing informative responses to engage users.

Another research direction considers dialogues as a Markov Decision Process (MDP) [13,14,15] and uses reinforcement learning techniques to address related issues. RL is a general-purpose framework for sequential decision making and is typically described as an agent that interacts with unknown environments. Many tasks such as generation, reasoning, information extraction, and dialogue can be formulated as sequential decision making. Therefore, in recent years, deep reinforcement learning has garnered a lot of attention in NLP [28]. In chatbot systems, RL-based models consider a conversation as a sequential decision process that operates over its state space, action set, and strategy to address the challenges of multi-turn dialogues in previous models. These models define a dialogue as either a Markov Decision Process (MDP) [13,14] or a Partially Observable Markov Decision Process (POMDP) [29,30,31]. Using reinforcement learning (RL), they monitor the state transition process, take appropriate actions (utterances), and obtain information from the user [28]. The authors of [26] simulated a conversation between two virtual agents, evaluated action sequences using policy gradient methods, and presented rewards for three useful dialogue attributes: informativeness, coherence, and ease of answering. Another study suggested a method based on reinforcement learning to create a chatbot using a generation model that generates sequences for a task-oriented model [32]. The experiments showed that this method results in more natural conversations that more efficiently accomplish task objectives.

According to a recent study [15], the authors developed a conversational system for learning policy by incorporating three reward functions. The first reward function evaluates the similarity between previous utterances and the topic presentation. The second one measures semantic coherence using mutual information between the generated response and previous utterances. The final reward function encourages the model to produce grammatically correct and fluent responses. Chen et al. [33] proposed an actor-critical model to implement deep reinforcement learning (DRL). The model was trained in parallel using data collected from various dialogue tasks and tested on 18 tasks from PyDial [34]. The results showed that the model achieved robust learning efficiency. Another RL-based approach [35] proposed Offline RL, which can train dialogue agents using static datasets. Their experiments showed that this method generated conversations that could help complete tasks better for specific purposes.

Unlike other works focusing on improving chatbot systems’ model architecture based on deep learning, this study uses reinforcement learning to exploit bi-directional context in a muti-turn conversation. Our goal is to learn contextual features from human-to-human conversations by combining the strengths of Seq2Seq models with the advantages of RL. We construct the model using RL to achieve various goals, such as avoiding overfitting by integrating multiple context constraints in RL and enhancing the conversation’s long-turn coherence and consistency.

3. The Proposed Model

Firstly, we provide a formal description of the problem. Most existing work on chatbots studies response generation for single-turn conversation, which only consider the immediately preceding utterance. However, this is not like how people typically converse. In practice, human conversations often consist of multiple turns, rather than just a single turn. In this research, we treat the problem in a multi-turn scenario where the dataset consists of multiple conversations. Each conversation comprises a sequence of multiple turns, denoted as

s_{0}, s_{1}, s_{2}, \dots, s_{n - 1}, s_{n}

, where

s_{t}

and

s_{t + 1}

represent successive turns between two agents. The task of building a chatbot model can be viewed as a source-to-target mapping problem, where the model learns mapping rules between source utterances and their corresponding suitable target responses from massive training data.We split the training dataset into n pairs

{(s_{t}, s_{t + 1})}_{t = 1}^{n}

where

(s_{t}, s_{t + 1})

represents the

t^{t h}

pair consisting of an input and its corresponding target. In a conversation system,

s_{t}

and

s_{t + 1}

denote two consecutive turns in the dialogue. Each user utterance

s_{t} = {w_{1}^{t}, w_{2}^{t}, \dots, w_{| s_{t} |}^{t}}

is paired with a sequence of outputs

s_{t + 1} = {w_{1}^{t + 1}, w_{2}^{t + 1}, \dots, w_{| s_{t + 1} |}^{t + 1}}

that needs to be predicted, where

w_{k}^{t}

represents the

k^{t h}

word in the utterance

s_{t}

.

3.1. Left-Context with BERT-Based Model

3.1.1. BERT Pre-Trained Model

BERT (Bi-directional Encoder Representations from Transformers) is a neural network architecture based on the transformer architecture, which was introduced in [36]. This architecture is designed to model sequences of data, such as natural language text, and has been used in a variety of natural language processing tasks, including machine translation [37,38], language modeling [39], and chatbot [40]. In order to understand the relationship between two sentences, the BERT training process also utilizes next sentence prediction. A pre-trained model with this understanding is relevant for tasks like question answering. During training, the model receives input pairs of sentences and learns to predict if the second sentence is the next one in the original text.

Figure 1 shows an overview of the BERT architecture. The input text is first tokenized and embedded using BERT’s token embedding layer. The position embedding layer adds positional information to the input tokens, allowing the model to distinguish between the order of the tokens. On a high level, the BERT block is a stack of multiple Transformer encoder blocks, each of which is composed of multi-head attention layer and positionwise layer (shown in the right part of the figure). The multi-head attention layer allows the model to weigh the importance of different words in the input text based on their relevance to each otherr. In contrast, the feed-forward layer transforms the weighted input into a fixed-length vector representation. This process is repeated for each transformer block in the stack, and the output embedding layer generates the final vector representation of the input text, which is then used for downstream tasks such as text classification or question answering. In this study, we use the pre-trained BERT model to downstream our task.

3.1.2. Integrating Left-Context with BERT2BERT

A human can control the informational flow in their conversation during a long-term conversation. However, most current models are incapable of handling the history of a conversation. That is the reason those models usually generate repetitive and generic responses. Our model solves this problem by using a left-context, denoted as

c_{L}

, which serves as contextual information from the conversation history. This context is defined as a k consecutive sequence of previous utterances

s_{t - k}, \dots, s_{t - 2}, s_{t - 1}

at the turn t. Based on the principle that more information is better, this left-context can provide additional information during the training process.

Our proposed model for generating historical contextual responses is a transformer-based encoder–decoder model. It has been proven to improve performance on many tasks using the encoder–decoder model [41,42]. However, such models require a massive dataset for pre-training before fine-tuning for a desired task. Recently, a study showed that skipping the cost of the pre-training process and using a pre-trained encoder allows the transformer-based encoder–decoder model to achieve competitive results in text generation tasks [43,44,45]. Inspired by their study, we use the BERT2BERT architecture [43] and warm-start the encoder and decoder with the BERT-based checkpoint [46]. The proposed model takes an input message

s_{t} = {w_{1}^{t}, w_{2}^{t}, \dots, w_{| s_{t} |}^{t}}

and its left-context

c_{L}

as input. This sequence is then fed into an encoder, which is a stack of BERT-based encoder blocks. Each block consists of a self-attention layer and two feed-forward layers, as shown in Figure 2.

The encoder maps the input sequence

[c_{L}, s_{t}]

to a contextualized encoded vector

{\bar{X}}^{B E R T}

, as follows:

{\bar{X}}^{B E R T} = f_{θ_{e n c}} ([c_{L}, s_{t}])

(1)

The architecture of the decoder block is similar to the encoder block (as shown in Figure 3). However, the decoder block is conditioned on the contextualized encoded vector

{\bar{X}}^{B E R T}

, and the model also includes a cross-attention layer in addition to the decoder. Each decoder block is more extensive than the encoder, consisting of a self-attention layer, two feed-forward layers, and a cross-attention layer that obtains contextual information from the vector

{\bar{X}}^{B E R T}

. In addition, a linear layer called the LM Head is included on top of the last decoder block, which maps the output vectors to the logit vectors L.

During the training process, the decoder maps the contextualized encoded

{\bar{X}}^{B E R T}

and a target sequence

s_{t + 1}

to logit vectors L. The probability distribution of the target sequence

s_{t + 1}

is factorized from conditional distributions of the next word using Bayes’s rule:

p_{B e r t 2 B e r t} (s_{t + 1} | [c_{L}, s_{t}]) = p_{θ_{d e c}} (s_{t + 1} | {\bar{X}}^{B E R T}) = \prod_{i = 1}^{| s_{t + 1} |} p_{θ_{d e c}} (w_{i}^{t + 1} | w_{0 : i - 1}^{t + 1}, {\bar{X}}^{B E R T})

(2)

The logits define the distribution of the target sequence

s_{t + 1}

conditioned on the input sequence

s_{t}

through a softmax operation. As a result, each generated token is defined by the softmax of the logit vector as follows:

p_{θ_{d e c}} (w_{i}^{t + 1} | w_{0 : i - 1}^{t + 1}, {\bar{X}}^{B E R T}) = S o f t m a x (l_{i})

(3)

where

l_{i}

is the

i^{t h}

token in logit vector L. We define

s_{t + 1} = {w_{1}^{t + 1}, w_{2}^{t + 1}, \dots, w_{| s_{t + 1} |}^{t + 1}}

as the ground-truth output for a given input sequence

s_{t}

. The training objective aims to minimize the following cross-entropy (CE) loss:

L_{B e r t 2 B e r t} = - \sum_{i = 1}^{| s_{t + 1} |} log p_{θ_{d e c}} (w_{i}^{t + 1} | w_{i - 1}^{t + 1}, {\bar{X}}^{B E R T})

(4)

As discussed, although encoder–decoder-based models can generate meaningful responses, they still need to meet the objectives of a chatbot. Most of these models are trained to generate the best response from a single utterance, making them suitable only for single-turn conversation systems. Our architecture proposes a solution to this problem by integrating the left-context into the encoder–decoder models, allowing the system to consider the previous utterances in the dialogue. However, traditional models are still short-sighted in predicting responses in multi-turn conversations because they ignore the potential impact on the future of the dialogue.

3.2. Bi-Directional Context Using Deep Reinforcement Learning

To solve the above problems, we propose formulating the conversation as a reinforcement learning problem and using long-term rewards to optimize response generation. Moreover, to gain insight from the long-context success of a conversation, the proposed model provides an utterance generation model conditioned on the impact of a generated response in an ongoing dialogue. We define the right-context as capturing the influence of utterances in the future. We first predict the best response corresponding to the historical context and then fine-tune the model with a desirable goal conditioned on the future context

c_{R}

.

That goal can be achieved with a Reinforcement Learning algorithm through a Markov Decision Process (MDP). MDP is a machine learning method to solve decision problems by interacting with the environment to reach desired goals [47], and is used to solve decision-making problems sequentially. It consists of an agent, including a learner, a decision-maker, and the environment, encompassing all external factors. MDP is a collection of states S, a set of actions A, a transition function P, and a reward function R. Given an MDP

(S, A, P, R)

, the model is trained to find a policy

π

that solves the problem. From an algorithmic perspective, a policy is a conditional probability distribution over the set of actions A. During the interaction, the agent takes action a according to a policy

π

. The environment updates to a new state based on the agent’s action a. More specifically, the agent interacts with the environment at each of a sequence of discrete time steps

t = 0, 1, 2, \dots

. At the time t, each pair of the current state

s_{t} \in S

and the action

a_{t} \in A (s)

creates a transition tuple

(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})

, with

s_{t + 1}

being the next state, and the agent also receives a numerical reward,

r_{t + 1} \in R

. The environment and agent produce a sequence

s_{0}, a_{0}, r_{1}, s_{1}, a_{1}, r_{2}, s_{2}, a_{2}, r_{3}, \dots

, where

r_{t}

and

s_{t}

are defined by discrete probability distributions conditioned on the previous state and action. At time t, the transition probability to the next state

s_{t + 1}

, given particular values of the preceding state

s_{t}

and action

a_{t}

, is defined by

\begin{matrix} p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \end{matrix}

(5)

for all

s_{t}, s_{t + 1} \in S

,

r_{t} \in R

and

a_{t} \in A (s)

.

In the dialogue task, given input

s_{t}

, we need to find a response

s_{t + 1}

that optimizes a measure of semantic relevance to

s_{t}

. RL can be used to learn when a part of the conversation was generated. We can define a conversation as an MDP

(S, A, P, R)

and solve it using RL. The corresponding MDP

(S, A, P, R)

is defined as follows [28]: the set of states S is defined as the conversation history, A is a set of generated responses for the system to reply to the user, and P is the transition probability function trained using the BERT-based encoding

p_{R L} (s_{t + 1} | s_{t}, c_{L})

, where

c_{L}

is the left-context at time t. The reward function, denoted as R, represents the forward-looking reward received for each chosen action and plays a crucial role in achieving a successful conversation. Operationalizing R can improve the models to achieve the desired goal.

3.2.1. Reward Definition

Our work proposes two rewards to generate desired responses and ensure the conversation’s responses have a specific goal. An emerging problem with encoder–decoder-based models is that they often generate highly possible responses but are incoherent or irrelevant to the dialogue history. To avoid inappropriate responses in a conversation, we use consecutive turns to define our forward-looking functions. By utilizing the mutual information between the action a and the preceding utterance in the conversation, we can ensure the appropriateness of the generated responses. Let

r_{1}

denote the reward obtained for each action. Let

h_{t}

and

h_{t + 1}

denote the representations obtained from the agents for two consecutive turns

s_{t}

and

s_{t + 1}

. The cosine similarity between

h_{t}

and

h_{t + 1}

gives the first reward at the current state

s_{t}

:

r_{1} = c o s (h_{t}, h_{t + 1}) = c o s (\frac{h_{t} . h_{t + 1}}{∥ h_{t} ∥ ∥ h_{t + 1} ∥})

(6)

In addition to the first reward, we suggest the second reward to encourage the model to contribute new information in each conversation. This reward leverages the intralinguistic relations in the sentence to detect changes in the content, thereby helping to maintain coherence and sustainability in the conversation. To achieve this purpose, we define the second reward, denoted as

r_{2}

, as the minimum cumulative distance between the words in turn

s_{t}

and all the words in the subsequent turn

s_{t + 1}

.

r_{2} = \frac{1}{N_{s_{t}}} \sum_{w_{p} \in s_{t}} min_{w_{q} \in s_{t + 1}} c o s (\frac{w_{p} . w_{q}}{∥ w_{p} ∥ ∥ w_{q} ∥}))

(7)

where

N_{s_{t}}

denotes the number of tokens in sentence

s_{t}

, and

w_{p}

,

w_{q}

are embedding vectors of the words in sentences

s_{t}

and

s_{t + 1}

, respectively. The final reward for an action

a_{t}

at the current state

s_{t}

is a weighted sum of the rewards computed by

r_{t} = λ_{1} r_{1} + λ_{2} r_{2}

(8)

where

λ_{1} + λ_{2} = 1

. We set

λ_{1} = 0.5

and

λ_{2} = 0.5

.

3.2.2. Conversation Simulation

The main idea of the proposed model is to simulate a conversation by allowing two chatbots to communicate with each other, as shown in Figure 4. While the pre-trained encoder–decoder allows the model to generate coherent responses with the conversation history, using RL enhances the model’s ability to generate responses optimized for long-term goals. We simulate the conversation as follows. At the first step, we obtain an input sentence with conversation history as contextual information

c_{L}

from the training dataset and feed it to the first agent. The first agent encodes the inputs into a vector representation and decodes it to generate a response

s_{t}

for the next turn. The second agent updates the state of the simulation by combining the conversation history with the output

s_{t}

. It immediately encodes this new state into a representation and decodes it into a new response, fed back to the first agent and repeated. At the end of the simulation, the right-context

c_{R}

is a sequence of k consecutive utterances

{s_{t + 1}, s_{t + 2}, \dots, s_{t + k}}

to the right of the generated response

s_{t}

.

The transition probability distribution

π

is initialized as a pre-trained BERT-based model and represents the policy P:

π = p_{B e r t 2 B e r t} (a_{t} | [s_{t}, c_{L}])

(9)

where

s_{t}

is the current state of the conversation and its left-context

c_{L}

; we generate a list of candidate responses A as follows:

A = {a | a \sim π}

(10)

In Reinforcement Learning, the agent and environment interact over a sequence of actions in a conversation. The goal of the agent is to maximize the expected reward from its actions through the interaction [48]:

\begin{matrix} maximize \sum_{[a_{t}, . ., a_{t + k}] \in c_{R}} π_{θ} (a_{t}, \dots, a_{t + k}) r (a_{t}, \dots, a_{t + k}) \end{matrix}

(11)

where

a_{t}

is a generated response in turn t and

θ

is the set of parameters in the model;

c_{R}

is the right-context obtained in the simulation process; and

r (a_{t}, \dots, a_{t + k})

is the cumulative discounted reward associated with the sequence of utterances

a_{t}, \dots, a_{t + k}

. When the simulator reaches the end of the conversation, it estimates the rewards based on the specific goals and computed by

r (a_{t}, \dots, a_{t + k}) = \sum_{τ = t}^{t + k} γ^{τ - t} r (a_{τ})

(12)

where

γ

is the discount factor which adjusts the importance of rewards over time in the reinforcement learning algorithm.

γ

is a real value

\in [0, 1]

and tells how important future rewards are to the current state. We begin a curriculum learning strategy by simulating the dialogue for k turns and using the policy gradient method to find parameters that maximize the expected future reward. The objective we aim to maximize is the expected cumulative reward:

\begin{matrix} L_{θ} & = E_{a_{t}, \dots, a_{t + k} \sim π_{θ}} \sum_{τ = t}^{t + k} γ^{τ - t} r (a_{τ}) \\ = \sum_{[a_{t}, . ., a_{t + k}] \in c_{R}} π_{θ} (a_{t}, \dots, a_{t + k}) r (a_{t}, \dots, a_{t + k}) \end{matrix}

(13)

\begin{matrix} = \sum_{a_{t : t + k} \in c_{R}} π_{θ} (a_{t : t + k}) r (a_{t : t + k}) \end{matrix}

(14)

The reinforcement algorithm maximizes

L_{θ}

by following its gradient [49]:

\nabla θ L_{θ} = \sum_{a_{t : t + k} \in c_{R}} \nabla θ π_{θ} (a_{t : t + k}) r (a_{t : t + k})

(15)

Using the chain rule and relying on the equation

\nabla θ f (θ) = f (θ) \frac{\nabla θ f (θ)}{f (θ)} = f (θ) \nabla θ log f (θ)

, the above equation can be rewritten as follows:

\nabla θ L_{θ} = \sum_{a_{t : t + k} \in c_{R}} π_{θ} (a_{t : t + k}) \nabla θ log π_{θ} (a_{t : t + k}) r (a_{t : t + k})

(16)

Finally, the derivative for the loss function can be written as an expectation as follows:

\nabla θ L_{θ} = E_{a_{t : t + k} \sim π_{θ}} [\nabla θ log π_{θ} (a_{t : t + k}) r (a_{t : t + k})]

(17)

Algorithm 1 summarizes the method used to model the bi-directional context chatbot:

Algorithm 1 DRL-Chat

Require: Input sequence (X), ground-truth output sequence (Y), and conversation history (

C_{L}

).

▹Pre-training Policy with Left-context:

1:: Initialize the policy model $π_{θ}$ based on pre-trained BERT.
2:: for number of training iterations do
3:: Run encoding on X, Y, $C_{L}$ and obtain a contextualized encoded vector ${\bar{X}}^{B E R T}$ .
4:: Run encoding by feeding ${\bar{X}}^{B E R T}$ to the decoder and obtain a response $\hat{Y}$ .
5:: Calculate the loss according to Equation (4) and update the parameters.
6:: end for

▹Fine-tuning Policy with Right-context:

7:: for number of training iterations do
8:: Run policy $π$ and obtain response $s_{t}$ .
9:: Run the simulator to obtain sequence of sentence $s_{t}, \dots, s_{t + k}$ , with $s_{t} \sim π_{θ}$ .
10:: Observe the sequence and calculate the reward according to Equation (12).
11:: Calculate the loss according to Equation (17) and update the parameters of the model.
12:: end for

4. Experiment and Discussion

4.1. Description of Dataset

Our experiments utilized the DailyDialog dataset [50] to evaluate our models. The dataset consists of a wide variety of dialogues from daily communications and is divided into three main categories: Work (14.49%), Ordinary Life (28.26%), and Relationship (33.33%). The dataset was built to cover various topics in our daily lives, and it contains 13,118 multi-turn dialogues. To ensure the dataset’s consistency with real-life experiences, Li et al. [50] invited individuals to engage in social activities (Relationship), discuss recent events (Ordinary Life), and talk about work-related topics (Work).

The DailyDialog dataset serves multiple purposes, including enhancing social bonding. The dataset contains rich emotions and is manually labeled to ensure high quality. Additionally, the dialogues cover various daily scenarios such as holidays, shopping, restaurants, and so on. In contrast to social media datasets such as Twitter Dialog Corpus [18] and Chinese Weibo [51], the language used in DailyDialog is written by humans and often focuses on a specific topic. So, we prefer the conversations in this dataset because of its formal writing style. The primary objective of this dataset is to develop a high-quality multi-turn dialogue dataset, which distinguishes it from most existing dialogue datasets.

As discussed before, reinforcement learning is applied to each given multi-turn dialogue as a Markov Decision Process, in which the agent learns to determine the following action to take in the environment to complete the conversation based on specific criteria. Based on these characteristics, we have chosen the dataset for the proposed model.

4.2. Quantitative Evaluation

It is widely recognized that evaluation plays an essential role in developing the conversational agent. We evaluate dialogue generation systems based on two criteria. One criterion demonstrates a reasonable correlation between human judgment and the response generation task. In contrast, the other criterion measures the consistency and coherence of the utterances in a conversation.

For the first metric, we use the BLEU (Bilingual Evaluation Understudy) score [52]. This metric is based on the string-matching algorithm, which compares consecutive n-grams of the generated response with the consecutive n-grams in the reference sentence and counts the number of matches with weighted scores. The BLEU score measures how many words overlap in a generated response compared to a reference response, and it is widely used to evaluate dialogue quality [27,53]. A higher BLEU score indicates that the generated response is more similar to the reference response and is more likely to be rated as human-like.

For the second metric, we use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score, which has recently been employed to evaluate dialogue quality as well [54]. It encompasses multiple metrics for evaluating the quality of a response by comparing it to other human-generated reference versions. This score utilizes n-gram count, word sequence, and word pair to measure the similarity of the chatbot-generated output to the reference text.

4.3. Designing and Evaluating Experimental Models

Firstly, we deployed a baseline model based on BERT2BERT, which is widely chosen as a baseline model in text generation tasks [43]. We then designed different experimental models to demonstrate the effectiveness of using different contexts for response generation in the conversational model. To this end, we designed three models as follows:

BERT LC (only left-context based on BERT) This model addresses the shortcomings of traditional generative models that only generate responses based on the current sentence without considering information from previous conversation turns. It does so by reusing historical information to predict responses, exploiting contextual information from prior turns in the conversation.
BERT RC (only right-context with BERT and reinforcement learning) Encoder–decoder architectures have been proven to yield good results in text generation tasks. However, they still lack the ability to process information by considering its impact on the future. To address this limitation, this approach leverages the strengths of the BERT2BERT method to capture the semantic information of a previous conversation and combines it with the power of reinforcement learning algorithms to manage the information flow for future conversations.
BERT FC (full context with BERT and reinforcement learning) This model combines the left-context and right-context to take advantage of them. The left-context allows the model to leverage the information already passed in the conversation. At the same time, the right-context helps the model capture the influence of the information flow on future conversations.

We experimented with different context sizes (i.e., different numbers of utterances on the left and right sides of the context) to determine the most appropriate context length for these data. Our best model will then be compared to the latest relevant studies.

Results are presented in Table 2. From these results, we have the following observations:

We presented the BLEU and ROUGE scores to measure conversation generation for correlation with human judgment. We showed that three of our models improved the BLEU score compared to the baseline model. However, the difference in BLEU score between the baseline and the proposed model (with only left-context) is slight. BERT is a state-of-the-art language model that uses a transformer-based architecture to learn contextual relationships between words in a text corpus. One of the main advantages of BERT is its ability to integrate contextual information into its understanding of text. Therefore, adding more contextual information to it may be difficult to further enhance the model’s effectiveness.
When comparing the use of a single context, we found that the left-context-based model improved the average BLEU score by 5% and the average ROUGE score by 7% compared to the baseline. In contrast, the right-context-based model correspondingly achieved 8% BLEU score and 10% ROUGE score. However, using both the left-context and the right-context, we also find that the RL-based models generate more coherent outputs when compared to the baseline model. Specifically, the model improved 12% average BLEU score and 13% average ROUGE score.

Figure 5 visually represents the scores for the four experimental models we trained. In general, using contextual information, either left-context or right-context can help the model better capture the mutual information among utterances in conversation. Among the four experiments, the proposed model with full context achieved the highest BLEU and ROUGE scores because they not only consider history information but also capture information in future utterances.

In addition, we evaluated our proposed model based on the length of the simulated conversation with 1, 3, 5, and 7 turns (as shown in Table 3). These results are visually represented in Figure 6 and Figure 7. When compared with the baseline, our best model still increased the average BLEU score by 24% and the average ROUGE score by 29%. The simulated conversation was used for model training with reinforcement learning, which performs better as the length of the conversation increases. However, the model quickly converged at turn 5. Although BLEU and ROUGE scores are widely used for evaluating dialogue quality, they are mainly designed for comparing individual sentences rather than entire dialogues, and can become noisy if the data comparison is too long. Moreover, in reinforcement learning, the reward function reflects the success of a model in achieving its goals. Our two reward functions are based only on word embeddings and not on the sentence level. Thus, they may not be strong enough to capture contextual information in long conversations. We recognize this as a reason why RL algorithms may not be effective beyond a certain threshold, which is determined by the length of the simulated dialogue (MDP).

We also compare the proposed methods with the relevant current methods (shown in Table 4). HRED [55] is a hierarchical model based on the encoder–decoder. The model used previous responses and encoded all the past information to generate a probable next token. Their experiments demonstrate that their model improves upon previous models. COHA [56] build their model using emotion states and capturing expressions from predefined emotions. They show that their proposed model can generate contextually and emotionally appropriate responses. Recently, a pre-trained dialogue generation model, PLATO [57], has been used based on hidden vectors to determine the inherent features. The model used attention mechanisms to combine contextual information and characteristic features. It showed an improvement in comparison with previous studies.

These current studies are shown in Table 4. Our proposed model showed a significant improvement, especially in higher BLEU score indices such as BLEU2, BLEU3, and BLEU4. This indicates that the use of context in our model helped generate more accurate and coherent responses. Thus, our proposed model can potentially improve the quality of automated conversation systems. Moreover, these results demonstrate superior performance compared to those of previous studies. While the BLEU index decreases significantly for earlier models in these studies as we compare BLEU2, BLEU3, and BLEU4 to BLEU1, our model still produces results that are not too far off. This finding confirms that our model generates accurate responses with highly natural, human-like tendencies appropriate to the conversation content.

Upon concluding our analysis, we present a comprehensive summary and thorough comparison of the proposed model’s efficacy against our experimental models and relevant studies when compared to the baseline model (as shown in Table 5).

The results presented in Table 5 clearly demonstrate that models constructed within a contextualization framework exhibit a significant enhancement in the coherence and consistency of a conversation. The extensive experimental results also provide compelling evidence for the effectiveness of the proposed methods in generating responses that are both reasonable and coherent. Thus, it can be inferred that building models using a context-based approach is a promising strategy for improving conversational coherence and consistency. These experimental findings provide strong support for this claim and highlight the potential benefits of adopting a context-driven approach in natural language processing.

5. Conclusions and Future Work

In this study, we developed a novel conversational model that leverages effective context to improve the coherence and consistency of conversations. Our model is based on a Transformer-based sequence-to-sequence model, which utilizes BERT to encode the current utterance’s left-context. By employing a reinforcement learning strategy and building the corresponding reward function, we can incorporate the right-context of conversations during the training process to enhance the generation model. The proposed model can capture the flow of conversation and the relationships between utterances more effectively by utilizing both left-context and right-context. The left-context helps the system keep track of the conversation’s history, including the topics discussed and any pertinent information mentioned earlier. This feature assists the system in generating more appropriate responses that consider the current state of the conversation. On the other hand, the right-context enables the system to anticipate the conversation’s direction and generate more forward-looking responses.

Experimental results have shown that our proposals effectively improve the quality of generated responses. In our comparison, we discovered that the left-context-based model increased the average BLEU score by 5% and the average ROUGE score by 7% compared to the baseline when utilizing a single context. Meanwhile, the right-context-based model achieved an 8% BLEU score and a 10% ROUGE score improvement. However, when we employed both left and right contexts, we observed that the RL-based models generated more coherent results than the baseline model. Specifically, our best model increased the average BLEU score by 24% and the average ROUGE score by 29%. We compared the outcomes of our proposed models with those of other studies in the literature. Our model consistently outperformed them by an average of 5% to 151% based on the BLEU score.

5.1. Theoretical Implications

Our proposal is a novel approach that considers contextual factors to improve the conversational model. Humans employ contextual information in everyday decision making. This is a sensible idea to mimic human conversation in chatbot systems. We performed multiple experiments to showcase that incorporating both left- and right-contexts is more effective than using either one separately and significantly more advantageous than not utilizing context at all. Based on our experiments, having more context can be beneficial when training a chatbot system. It can help the system better understand the user’s intent and provide more accurate and relevant responses.

Moreover, we also proposed an approach that resolves issues with traditional neural network models in conversation response generation by integrating Transformer-based Seq2Seq models and RL. The experimental results demonstrate that our proposed model has significantly improved when compared to current relevant studies. From an academic perspective, there is still considerable research space for exploring this direction, which is very worthy of in-depth excavation by interested researchers.

5.2. Practical Implications

Building conversational agents to generate appropriate and meaningful responses is a challenging problem in the field of natural language processing. Moreover, consistency is indeed important in chatbot conversations. Consistency helps ensure the chatbot’s responses are in alignment with the user’s purpose. One of the critical factors humans use in daily communication is context and the ability to anticipate.

Unlike the earlier studies, we provide additional information for a chatbot by exploring contextual factors in the conversation. These factors help the chatbot generate appropriate responses that have a clear purpose and align with the context of the conversation. A consistent chatbot experience helps users feel comfortable and confident when using the chatbot. This leads to a positive user experience, increasing the likelihood of the user returning to the chatbot.

5.3. Future Work

The use of BLEU and ROUGE scores for sentence comparison in chatbot systems has been widely debated due to uncertainty about their correlation with human response quality. Although BLEU and ROUGE scores have been extensively used for evaluating dialogue quality, they are primarily designed for comparing sentences rather than dialogues. Therefore, if the data comparison is too lengthy, these scores can become noisy. It is necessary to identify better automatic evaluation metrics for the future development of dialogue systems. Developing chatbots with human-like thinking capabilities is still challenging.

Based on the improved results of this study, we will design additional rewards that are based on the characteristics of human decision making. These rewards can guide chatbot behavior to be more in line with human expectations, ultimately improving the quality of the chatbot and making conversations feel more natural. We will also enrich our model with a Large Language Model (LLM), such as GPT, or plan to incorporate popular RL techniques to build self-learning conversational agents. With these enhancements, chatbots can become more effective tools for communication, customer service, and other purposes.

Author Contributions

Conceptualization, Q.-D.L.T.; Data curation, Q.-D.L.T.; Formal analysis, Q.-D.L.T. and A.-C.L.; Investigation, Q.-D.L.T. and A.-C.L.; Methodology, Q.-D.L.T. and A.-C.L.; Supervision, A.-C.L.; Writing—original draft, Q.-D.L.T.; Writing—review and editing, Q.-D.L.T. and A.-C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during the current study are available in the web of Yanran repository (http://yanran.li/dailydialog) [50], accessed on 2 February 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIML	Artificial Intelligence Markup Language
ALICE	Artificial Linguistic Internet Computer Entity
BERT	Bi-directional Encoder Representations from Transformers
BLEU	Bilingual Evaluation Understudy
CNN	Convolutional Neural Network
DRL	Deep reinforcement learning
LLM	Large Language Model
LSTM	Long Short Term Memory
MLE	Maximum-likelihood estimation
MDP	Markov Decision Process
NLP	Natural Language Processing
POMDP	Partially Observable Markov Decision Process
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
RL	Reinforcement learning
Seq2Seq	Sequence-to-sequence
SMT	Statistical Machine Translation

References

Wallace, R. The Anatomy of A.L.I.C.E. In Parsing the Turing Test; Springer: Dordrecht, The Netherlands, 2009; pp. 181–210. [Google Scholar]
Jafarpour, S.; Burges, C.; Ritter, A. Filter, Rank, and Transfer the Knowledge: Learning to Chat. Adv. Rank. 2010, 10, 2329–9290. [Google Scholar]
Yan, Z.; Duan, N.; Bao, J.; Chen, P.; Zhou, M.; Li, Z.; Zhou, J. DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 516–525. [Google Scholar] [CrossRef]
Zhong, H.; Dou, Z.; Zhu, Y.; Qian, H.; Wen, J.R. Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 5808–5820. [Google Scholar] [CrossRef]
Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.; Pineau, J. Building End-to-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 3776–3783. [Google Scholar]
Mou, L.; Song, Y.; Yan, R.; Li, G.; Zhang, L.; Jin, Z. Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 3349–3358. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Bangkok, Thailand, 18–22 November 2020; MIT Press: Cambridge, MA, USA, 2014; Volume 2, NIPS’14. pp. 3104–3112. [Google Scholar]
Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.Y.; Gao, J.; Dolan, B. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; Association for Computational Linguistics: Denver, CO, USA, 2015; pp. 196–205. [Google Scholar] [CrossRef]
Xu, H.D.; Mao, X.L.; Chi, Z.; Sun, F.; Zhu, J.; Huang, H. Generating Informative Dialogue Responses with Keywords-Guided Networks. In Proceedings of the Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, 13–17 October 2021; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2021; pp. 179–192. [Google Scholar] [CrossRef]
Ismail, J.; Ahmed, A.; Ouaazizi Aziza, E. Improving a Sequence-to-sequence NLP Model using a Reinforcement Learning Policy Algorithm. In Proceedings of the Artificial Intelligence, Soft Computing and Applications. Academy and Industry Research Collaboration Center (AIRCC), Copenhagen, Denmark, 29–30 January 2022. [Google Scholar] [CrossRef]
Csaky, R. Deep Learning Based Chatbot Models. arXiv 2019, arXiv:1908.08835. [Google Scholar]
Cai, P.; Wan, H.; Liu, F.; Yu, M.; Yu, H.; Joshi, S. Learning as Conversation: Dialogue Systems Reinforced for Information Acquisition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 4781–4796. [Google Scholar] [CrossRef]
Levin, E.; Pieraccini, R.; Eckert, W. A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 2000, 8, 11–23. [Google Scholar] [CrossRef]
Pieraccini, R.; Suendermann, D.; Dayanidhi, K.; Liscombe, J. Are We There Yet? Research in Commercial Spoken Dialog Systems. In Text, Speech and Dialogue; Springer: Berlin/Heidelberg, Germany, 2009; pp. 3–13. [Google Scholar] [CrossRef]
Yang, M.; Huang, W.; Tu, W.; Qu, Q.; Shen, Y.; Lei, K. Multitask Learning and Reinforcement Learning for Personalized Dialog Generation: An Empirical Study. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 49–62. [Google Scholar] [CrossRef] [PubMed]
Weizenbaum, J. ELIZA—A Computer Program for the Study of Natural Language Communication between Man and Machine. Commun. ACM 1966, 9, 36–45. [Google Scholar] [CrossRef]
Parkison, R.C.; Colby, K.M.; Faught, W.S. Conversational Language Comprehension Using Integrated Pattern-Matching and Parsing. In Readings in Natural Language Processing; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1986; pp. 551–562. [Google Scholar]
Ritter, A.; Cherry, C.; Dolan, W.B. Data-Driven Response Generation in Social Media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; Association for Computational Linguistics: Edinburgh, UK, 2011; pp. 583–593. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. Minimum Risk Training for Neural Machine Translation. arXiv 2015, arXiv:1512.02433. [Google Scholar]
Vaswani, A.; Bengio, S.; Brevdo, E.; Chollet, F.; Gomez, A.N.; Gouws, S.; Jones, L.; Kaiser, L.; Kalchbrenner, N.; Parmar, N.; et al. Tensor2Tensor for Neural Machine Translation. arXiv 2018, arXiv:1803.07416. [Google Scholar]
Nallapati, R.; Xiang, B.; Zhou, B. Sequence-to-Sequence RNNs for Text Summarization. arXiv 2016, arXiv:1602.06023. [Google Scholar]
Nallapati, R.; Zhai, F.; Zhou, B. SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive Summarization of Documents. arXiv 2016, arXiv:1611.04230. [Google Scholar] [CrossRef]
Paulus, R.; Xiong, C.; Socher, R. A Deep Reinforced Model for Abstractive Summarization. arXiv 2017, arXiv:1705.04304. [Google Scholar]
Pamungkas, E.W. Emotionally-Aware Chatbots: A Survey. arXiv 2019, arXiv:1906.09774. [Google Scholar]
Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; Gao, J. Deep Reinforcement Learning for Dialogue Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 1192–1202. [Google Scholar] [CrossRef]
Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, B. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: San Diego, CA, USA, 2016; pp. 110–119. [Google Scholar] [CrossRef]
Uc-Cetina, V.; Navarro-Guerrero, N.; Martín-González, A.; Weber, C.; Wermter, S. Survey on reinforcement learning for language processing. Artif. Intell. Rev. 2023, 56, 1543–1573. [Google Scholar] [CrossRef]
Gašić, M.; Breslin, C.; Henderson, M.; Kim, D.; Szummer, M.; Thomson, B.; Tsiakoulis, P.; Young, S. POMDP-based dialogue manager adaptation to extended domains. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, 22–24 August 2013; Association for Computational Linguistics: Metz, France, 2013; pp. 214–222. [Google Scholar]
Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. POMDP-Based Statistical Spoken Dialog Systems: A Review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
Xiang, X.; Foo, S. Recent Advances in Deep Reinforcement Learning Applications for Solving Partially Observable Markov Decision Processes (POMDP) Problems: Part 1—Fundamentals and Applications in Games, Robotics and Natural Language Processing. Mach. Learn. Knowl. Extr. 2021, 3, 554–581. [Google Scholar] [CrossRef]
Hsueh, Y.L.; Chou, T.L. A Task-Oriented Chatbot Based on LSTM and Reinforcement Learning. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 22, 1–27. [Google Scholar] [CrossRef]
Chen, Z.; Chen, L.; Liu, X.; Yu, K. Distributed Structured Actor-Critic Reinforcement Learning for Universal Dialogue Management. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2400–2411. [Google Scholar] [CrossRef]
Ultes, S.; Rojas-Barahona, L.M.; Su, P.H.; Vandyke, D.; Kim, D.; Casanueva, I.; Budzianowski, P.; Mrkšić, N.; Wen, T.H.; Gašić, M.; et al. PyDial: A Multi-domain Statistical Dialogue System Toolkit. In Proceedings of the ACL 2017, System Demonstrations, Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 73–78. [Google Scholar]
Verma, S.; Fu, J.; Yang, S.; Levine, S. CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement Learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 4471–4491. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
De Coster, M.; Dambre, J. Leveraging Frozen Pretrained Written Language Models for Neural Sign Language Translation. Information 2022, 13, 220. [Google Scholar] [CrossRef]
Yan, R.; Li, J.; Su, X.; Wang, X.; Gao, G. Boosting the Transformer with the BERT Supervision in Low-Resource Machine Translation. Appl. Sci. 2022, 12, 7195. [Google Scholar] [CrossRef]
Kurtic, E.; Campos, D.; Nguyen, T.; Frantar, E.; Kurtz, M.; Fineran, B.; Goin, M.; Alistarh, D. The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 4163–4181. [Google Scholar]
Shen, T.; Li, J.; Bouadjenek, M.R.; Mai, Z.; Sanner, S. Towards understanding and mitigating unintended biases in language model-driven conversational recommendation. Inf. Process. Manag. 2023, 60, 103139. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Rothe, S.; Narayan, S.; Severyn, A. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks. Trans. Assoc. Comput. Linguist. 2020, 8, 264–280. [Google Scholar] [CrossRef]
Chen, C.; Yin, Y.; Shang, L.; Jiang, X.; Qin, Y.; Wang, F.; Wang, Z.; Chen, X.; Liu, Z.; Liu, Q. bert2BERT: Towards Reusable Pretrained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2134–2148. [Google Scholar] [CrossRef]
Naous, T.; Bassyouni, Z.; Mousi, B.; Hajj, H.; Hajj, W.E.; Shaban, K. Open-Domain Response Generation in Low-Resource Settings Using Self-Supervised Pre-Training of Warm-Started Transformers. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2023, 22, 1–12. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, (Long and Short Papers). pp. 4171–4186. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; A Bradford Book: Cambridge, MA, USA, 2018. [Google Scholar]
Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Zaremba, W.; Sutskever, I. Reinforcement Learning Neural Turing Machines. arXiv 2015, arXiv:1505.00521. [Google Scholar]
Li, Y.; Su, H.; Shen, X.; Li, W.; Cao, Z.; Niu, S. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2017; Volume 1: Long Papers, pp. 986–995. [Google Scholar]
Wang, H.; Lu, Z.; Li, H.; Chen, E. A Dataset for Research on Short-Text Conversations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 935–945. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; Association for Computational Linguistics: Philadelphia, PA, USA, 2002. ACL’02. pp. 311–318. [Google Scholar] [CrossRef]
Vinyals, O.; Le, Q.V. A Neural Conversational Model. In Proceedings of the ICML, Lille, France, 6–11 July 2015. [Google Scholar]
Kapočiūtė-Dzikienė, J. A Domain-Specific Generative Chatbot Trained from Little Data. Appl. Sci. 2020, 10, 2221. [Google Scholar] [CrossRef]
Sordoni, A.; Bengio, Y.; Vahabi, H.; Lioma, C.; Grue Simonsen, J.; Nie, J.Y. A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, New York, NY, USA, 18–23 October 2015; Association for Computing Machinery: New York, NY, USA, 2015. CIKM ’15. pp. 553–562. [Google Scholar] [CrossRef]
Zhou, H.; Huang, M.; Zhang, T.; Zhu, X.; Liu, B. Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, Orleans, LA, USA, 2–7 February 2018; AAAI Press: New Orleans, LA, USA, 2018. [Google Scholar]
Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 85–96. [Google Scholar] [CrossRef]
Fan, J.; Yuan, L.; Song, H.; Tang, H.; Yang, R. NLP Final Project: A Dialogue System; Hong Kong University of Science and Technology (HKUST): Hong Kong, China, 2020. [Google Scholar]

Figure 1. The architecture of pre-trained BERT model.

Figure 2. The architecture of encoder. The encoder is initialized based on multiple BERT blocks, each of which is composed of a multi-head attention and feed-forward network.

Figure 3. The architecture of decoder, in which each BERT block has cross-attention layers added between the multi-head attention and the feed-forward network.

Figure 4. Bi-directional context for response generator using deep reinforcement learning.

Figure 5. Chart showing the performance of experiments measured by BLEU score (a) and ROUGE score (b).

Figure 6. Chart showing the performance with different lengths of simulated conversation measured by BLEU score (a) and ROUGE score (b).

Figure 7. Average of the BLEU scores and ROUGE score with different lengths of simulated conversation.

Table 1. A conversation between two people in which the last turn requires information from earlier in the conversation.

Turn 1	Mom, how can we get to the supermarket?
Turn 2	We can take a bus.
Turn 3	Does this bus go there?
Turn 4	I can’t see clearly.
Turn 5	Let’s step in.
Turn 6	No, it’s not right.
Turn 7	Mom! What’s bus number?

Table 2. Summarization results of different models.

	BLEU 1	BLEU 2	BLEU 3	BLEU 4	ROUGE Precision	ROUGE Recall	ROUGE Fmeasure
BERT (Baseline)	0.452	0.328	0.269	0.207	0.315	0.279	0.279
BERT LC (only left-context)	0.458	0.345	0.287	0.226	0.332	0.302	0.301
BERT RC (only right-context with 1 turn)	0.466	0.357	0.297	0.235	0.337	0.313	0.309
BERT FC (bi-directional context with 1 turn)	0.488	0.367	0.308	0.246	0.345	0.325	0.320

Table 3. Model performance using different lengths of simulated conversation.

	BLEU 1	BLEU 2	BLEU 3	BLEU 4	ROUGE Precision	ROUGE Recall	ROUGE Fmeasure
BERT FC (1 turn)	0.488	0.367	0.308	0.246	0.345	0.325	0.320
BERT FC (3 turns)	0.492	0.374	0.316	0.256	0.362	0.336	0.334
BERT FC (5 turns)	0.518	0.403	0.348	0.289	0.389	0.371	0.366
BERT FC (7 turns)	0.500	0.390	0.337	0.281	0.381	0.366	0.362

Table 4. Comparison of proposed model with recent studies.

Model	BLEU1	BLEU2	BLEU3	BLEU4
Best proposed model	0.518	0.403	0.348	0.289
HRED [50]	0.396	0.174	0.019	0.009
COHA [50]	0.379	0.156	0.018	0.066
COHA + Attention [50]	0.464	0.220	0.017	0.009
Plato [58]	0.486	0.389	-	-

Table 5. Rate of increase in BLEU score compared to our best proposed model.

Model	Average BLEU	Best Proposed Model (Rate of Increase in BLEU Score)
BERT (baseline)	0.314	0.389 (+24%)
HRED	0.15	0.389 (+151%)
COHA	0.15	0.389 (+151%)
COHA+Attention	0.178	0.389 (+118%)
Plato	0.438	0.461 (+5%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tran, Q.-D.L.; Le, A.-C. Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning. Appl. Sci. 2023, 13, 5041. https://doi.org/10.3390/app13085041

AMA Style

Tran Q-DL, Le A-C. Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning. Applied Sciences. 2023; 13(8):5041. https://doi.org/10.3390/app13085041

Chicago/Turabian Style

Tran, Quoc-Dai Luong, and Anh-Cuong Le. 2023. "Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning" Applied Sciences 13, no. 8: 5041. https://doi.org/10.3390/app13085041

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Bi-Directional Context for Improved Chatbot Response Generation Using Deep Reinforcement Learning

Abstract

1. Introduction

2. Related Work

3. The Proposed Model

3.1. Left-Context with BERT-Based Model

3.1.1. BERT Pre-Trained Model

3.1.2. Integrating Left-Context with BERT2BERT

3.2. Bi-Directional Context Using Deep Reinforcement Learning

3.2.1. Reward Definition

3.2.2. Conversation Simulation

4. Experiment and Discussion

4.1. Description of Dataset

4.2. Quantitative Evaluation

4.3. Designing and Evaluating Experimental Models

5. Conclusions and Future Work

5.1. Theoretical Implications

5.2. Practical Implications

5.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI