This section focuses on the solution developed, starting with describing the data used in the study. Subsequently, the employed methodology is outlined and discussed. It begins by presenting the formal definition of the tasks at hand. Then, the data preprocessing methods used to construct dialog subgraphs are introduced. Finally, the bidirectional reasoning model is presented in detail.
3.1. Task Definition
We define the target-oriented response generation task as follows:
Here,
c represents the conversation context,
G is a knowledge subgraph associated with the context,
t is the dialog target, and
y is the transition response representing the model’s output that connects the conversation context
c and the target
t. Equation (
1) signifies that
is determined by selecting the response
y that maximizes the conditional probability. In simpler terms, the argmax function is used to find the response with the highest probability of being the correct or most suitable transition response given the context, target, and associated knowledge subgraph. Specifically, our method generates a prompt keyword set
z for transition based on subgraph
G, which is extracted on the knowledge graph ConceptNet. Then, we integrate the keywords
z and the conversation context
c to generate a proper transition response
y.
3.2. Method Overview
Three Steps Our method is divided into three steps: data preprocessing, training, and prediction. The data preprocessing involves dialog data cleaning, keyword extraction (the red font in
Figure 1 is an example for keywords extraction), and dialog subgraph construction. During the training phase, a supervised learning method minimizes negative log-likelihood optimization. The prediction stage is based on the trained model and used with the beam-search [
37] algorithm to generate the response.
Main Components Our model comprises three main components, as shown in
Figure 1: an encoder and two decoders. The first component is the dialog encoder, which utilizes a Transformer encoder to encode both the source and target utterances into an embedding vector. The encoder plays a fundamental role in capturing the semantic information of the dialogs, which arises from its attention mechanism for long-range dependency capture, scalability, and parallelization, coupled with positional encoding and self-attention for semantic comprehension. Our method’s following steps (bidirectional reasoning and response generation) are based on the output embedding generated by the dialog encoder. The second component is a bidirectional reasoning decoder. A graph neural network initially encodes the dialog subgraph constructed during the data preprocessing stage. Following this, the dialog embedding vector obtained from the dialog encoder is input into the decoder, and an attention mechanism [
38] is applied to focus on the dialog subgraph and generate prompt keywords. The third component is a response generator based on GPT. It inputs the dialog context and prompt keywords, generating the final response.
3.4. Subgraph Construction
We describe the construction of a dialog subgraph using
c and
t, represented as
. Here,
V denotes the set of topic nodes, while
E represents the edges connecting these topics. The specific details are provided in
Figure 2, and the numbers in the nodes of
Figure 2 indicate that the keywords are added to the dialog subgraph in order.
Node Selection To determine the nodes in
G, we employ a rule-based keyword extractor that combines TF-IDF [
39] and Part-of-Speech [
40] features to extract keywords from
c and
t. The keywords in c serve as the source topic nodes denoted as
, while the keywords in t serve as the target topic nodes denoted as
, where p and q represent the number of keywords in the source c and target t, respectively. Therefore,
. Afterward, we retrieve the neighboring nodes of keywords from ConceptNet, choosing
N nodes to add to
V and establish edges among them. The appropriate value of
N will be determined in the forthcoming ablation experiment section. Furthermore, we add a particular neighbor node
for each keyword within the subgraph. The
serves as a condition to terminate decoding generation.
Embedding Initialization After identifying the nodes, we utilize ConceptNet [
41] to obtain a node representation. Each topic node
is initially aligned with the corresponding node in ConceptNet and represented as
, where
denotes the initial representation of the node
.
refers to the Numberbatch (
https://github.com/commonsense/conceptnet-numberbatch, accessed on 1 December 2023) embeddings, and
d represents the dimension of each node representation. Numberbatch is an embedding space for word vectors that leverages a combination of semantic information from diverse knowledge sources to enhance the representation of words. Developed by the team at ConceptNet, Numberbatch exhibits improved performance in various natural language processing tasks by capturing nuanced semantic relationships.
Additionally, to capture topic relations effectively, is updated by incorporating the representations of its K-hop neighbors in ConceptNet, known as K-hop neighboring representations. K represents the maximum number of hops considered, which is set to 2. denotes the k-th hop neighboring nodes of in the ConceptNet graph. and correspond to the weight matrix and bias vector, respectively.
3.5. Dialog Encoder
The dialog encoder comprehends the dialog context and outputs an embedding of the dialog. To obtain an embedded representation of the dialog, we concatenate the source and target utterances into a single line and input it into the dialog encoder. Our approach utilizes a multi-layer Transformer to encode the dialog context. Previous work [
42,
43] has proven that a multi-layer structure is highly effective in capturing semantic information, which offers parameter efficiency and feature abstraction, outperforming alternatives like single-layer Transformers or sequential models, such as RNNs or LSTMs, and proving well suited for tasks demanding comprehensive dialog comprehension.
Formally, given a dialog context
, where
is a sequence of words, the Transformer encoder will convert
into a sequence of hidden embeddings:
In the above equation,
represents a sequence of hidden embeddings. In this case, it appears that the output embeddings correspond to specific positions.
refers to the Transformer encoder with parameters
. This is the function that processes the input sequence.
is the input sequence to the Transformer encoder, where each
is the embedding vector of the
j-th word in the
i-th sequence.
Here, is the utterance representation embedding that incorporates source and target awareness. refers to another Transformer layer with parameters . This layer is used to process the input sequence. is the input sequence to the second Transformer layer, which is the output from the previous Transformer layer. The output embeddings can be further used for tasks such as guiding the generation of a keyword set in a bidirectional reasoning decoder.
3.6. Bidirectional Reasoning Decoder
The bidirectional reasoning module generates a keyword sequence as cue words to give an “intermediate” utterance. Bidirectional graph scoring is utilized to fuse graph node representations based on the dialog context representations, as shown in
Figure 3.
Graph Encoding To encode the topic entities in the dialog graphs and obtain representations for concepts and relations, we employ multi-layer GCN encoders [
44]. Inspired by the TransE [
45] model, we also update the concept embedding by subtracting the corresponding relation embedding from each neighboring concept embedding to obtain the relation representation. At the
layer, we update the embedding of each entity
by aggregating its neighbors
, which consist of pairs of concepts and relations that are connected to
:
In the above equation, , and represent the embeddings of node node , and the relations between and at layer , and are specific trainable parameter matrices for layer ; is a non-linear active function. The relation embedding is also updated at the layer via . After L iterations, we are able to obtain , a set of concept representations, and , a set of relation representations.
Bidirectional Reasoning We employ bidirectional reasoning to compute concept distributions on graphs. This approach incorporates neighboring information and the current decoder state to adjust the weight of the bidirectional concept (source and target entities) in the graph at each decoding step. Initially, a score of 1 is assigned to source and target concepts, while others are given a score of 0. Subsequently, the information regarding the scored concepts is disseminated throughout the graph to update unvisited concepts in both directions. For an unvisited concept
,
is computed by aggregating the evidence from its visited neighboring concepts
:
Here, in the triple relevance , denote nodes (or concepts) in the graph; denote the representations of nodes at the L-th layer of a neural network. is the weight matrix relating the triple , is the current decoder state at decoding step, and is the Sigmoid activation function.
In concept score update function
,
v is a node or concept in the graph,
is a set of visited neighboring concepts for
v, and
is a discount factor. This equation updates the concept score by aggregating information from its visited neighboring concepts. After
L-hop interactions, the distribution over the concepts is as follows:
Here, represents the distribution over concepts in the graph given the current decoder state and at decoding step t. In other words, it is the probability of each concept in the graph being the next element in the sequence being generated. is the softmax function applied to the scores assigned to each concept in the graph. The score for each concept v is obtained from the bidirectional reasoning mechanism described earlier.
The final generation distribution conjoins the distribution over the concepts in graphs and the distribution over the standard vocabulary with a soft gate as follows:
represents the probability distribution of the token at decoding step t given the previously generated tokens , the graph , and the classification state . is a soft gate that determines whether to refer to the graph or not. is the distribution over concepts in the graph given the current decoder state, which represents the likelihood of each concept being the next token. is the distribution over the standard vocabulary V given the current decoder state and the state , which represents the likelihood of each standard vocabulary token being the next token.
Furthermore, the decoding stop condition of the decoder is as follows. (1) When encountering a specially marked node added to the neighbor nodes of , the decoder will regard as a legal candidate node and automatically stop decoding when is selected. (2) The cue words set z exceeds . Moreover, we set all probabilities extracted in step to 0 to avoid repeated extractions.
3.7. Response Decoder
As shown in
Figure 1, the sequence of the keywords
z generated by bidirectional reasoning is sent to the response generator to generate a response containing relevant keywords. Inspired by prompts using generative models [
46], an explicit sequence of entity keywords
z can be regarded as prior knowledge, or cue words, in the generation process [
47].
Here,
represents the initial value of the variable
. In this case, it is set equal to
z. Meanwhile,
z represents a sequence of keywords generated by bidirectional reasoning.
represents a set of word embedding vectors at decoding step
t. It is defined as a list that includes the embedding vector
for each word
in the sequence
.
Subsequently, is a conditional probability distribution over the next word and a length parameter , given some context C and the previous sequence . refers to a Transformer model commonly used in natural language processing tasks. is likely the output hidden state vector associated with the special token “[CLS]” in the Transformer model. represents the set of word embedding vectors at the previous decoding step, i.e., at step . At each decoding step t, the decoder combines and the probability distribution to be generated, where is a list of word embedding vectors for .