Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training

Huang, Yi; Chen, Si; Chen, Yaqin; Feng, Junlan; Deng, Chao

doi:10.3390/app14177614

Open AccessArticle

Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training

by

Yi Huang

^*

,

Si Chen

,

Yaqin Chen

,

Junlan Feng

^* and

Chao Deng

JIUTIAN Team, China Mobile Research Institute, Beijing 100053, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7614; https://doi.org/10.3390/app14177614

Submission received: 17 May 2024 / Revised: 12 July 2024 / Accepted: 14 July 2024 / Published: 28 August 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Pre-trained language models (PLMs) are proficient at understanding context in plain text but often struggle with the nuanced linguistics of task-oriented dialogues. The information exchanges in dialogues and the dynamic role-shifting of speakers contribute to complex coreference and interlinking phenomena across multi-turn interactions. To address these challenges, we propose Coreference-Enhanced Contrastive Pre-training (CECPT), an innovative pre-training framework specifically designed to enhance dialogue modeling. CECPT utilizes unsupervised dialogue datasets to capture both semantic richness and structural coherence. Our experimental results demonstrate that the CECPT model significantly outperforms established baselines in three critical applications: intent recognition, dialogue act prediction, and dialogue state tracking. These findings suggest that CECPT is more adept at following the information flow within dialogues and accurately linking statuses to their respective references.

Keywords:

dialogue system; contrastive pre-training; graph modeling

1. Introduction

Task-oriented dialogue systems [1,2,3,4] aim to assist users with basic tasks in multi-turn conversations. And neural approaches for dialogue modeling, particularly through fine-tuning pre-trained language models (PLMs), have gained significant attention. These models are initially self-supervised on extensive free text data, using transfer learning as a basis. Fine-tuning these representations has led to notable advancements in various downstream tasks and applications.

However, the inherent divergence in linguistic characteristics between plain text and conversational contexts diminishes the practical utility of existing pre-trained language models. Coreference occurs when two or more expressions or mentions in a session refer to the same entity, which is an important element for a coherent understanding of the whole dialogue. For example, we realize that the same day may refer to Saturday or Monday according to the dialogue history. Cross-links lead to natural structures which are especially noticeable in multi-turn dialogues. Previous work has tried to focus on modeling topics or dialogue acts [5]. Notably, the explicit utilization of coreference information has remained unexplored. In contrast, large-scale PLMs have primarily focused on the implicit modeling of linguistic knowledge, such as part-of-speech and syntactic structure [6].

In this study, we introduce Coreference-Enhanced Contrastive Pre-training (CECPT), a contrastive pre-training framework aimed at enhancing dialogue modeling using large amounts of unsupervised data with both semantic and structural information, explicitly incorporating coreference and graph information in pre-training to tackle the aforementioned challenges. CECPT incorporates a text encoder to capture informative dialogue semantics and a graph encoder to grasp dialogue flow structures. The text encoder employs self-supervised contrastive learning to bring coreference-related words closer in semantic representations, while the graph encoder utilizes graph-based structures for contrastive pre-training on dialogue structures. These complementary representations collectively enhance dialogue modeling, resulting in improved performance in downstream tasks. Our experimental findings demonstrate that our pre-trained dialogue model, CECPT, outperforms strong baseline models across three key dialogue applications. This underscores the importance of leveraging comprehensive information in dialogue modeling. Furthermore, our evaluation analysis indicates that coreference-aware models excel at tracing information flow among conversational participants and accurately associating statuses with corresponding mentions.

Our contributions can be summarized as follows:

We introduce an innovative dialogue modeling framework that leverages contrastive pre-training and explicitly integrates coreference-guided information.
Our approach encompasses both semantic and structural aspects by incorporating coreference information into the pre-training process within a unified framework.
To the best of our knowledge, this study marks the first exploration of dialogue modeling both semantics and structure through graph contrastive training, yielding promising results across several downstream tasks.

2. Related Work

PLMs are popular among multiple NLP tasks since they greatly reduce the cost of human annotations on large datasets. Therefore, task-oriented dialogue systems enhance model performance by employing pre-training techniques for various tasks. A recent work [7] introduces a pre-trained and fine-tuned Transformer-based model to tackle the entity linking task within task-oriented dialogues. PEFTTOD [8] involves adding extra modifiable components or variables that are not present in the initial procedure. UBAR [9], a GPT-2-based model, is fine-tuned to enhance response generation, policy optimization, and end-to-end modeling.

Unlike plain texts or documents, task-oriented dialogues are rich in structural and role-sensitive information. However, few studies have successfully captured this information. In the realm of task-oriented dialogues, approaches [10,11,12,13] incorporate dialogue contexts during pre-training, which may cause the loss of dialogue structural information. There are a few studies that maintain dialogue structural information in downstream tasks. For example, HGAT [14] utilizes a model with a graph attention network (GAT) for a dialogue relation extraction task. A limited number of studies have been conducted on graph pre-training for task-oriented dialogues. There are some directions for graph pre-training. For example, Graph Contrastive Coding is a self-supervised graph neural network model for graph pre-training [15].

Instead of pre-training, some notable works focus on enhancing the state-of-the-art performance on downstream tasks (e.g., intent recognition [16], dialogue relation extraction [17]). While some approaches prioritize achieving state-of-the-art results on specific tasks, our focus is not solely on achieving such results; we aim to avoid the unnecessary addition of extra components on fine-tuning [18] in our experiments, distinguishing this approach from masked language modeling with contrastive loss [19].

3. Methodology

As illustrated in Figure 1, our CECPT model comprises two components: coreference semantic pre-training and graph structure pre-training. In this section, we first introduce the required preprocessing work, including the usage of the AllenNLP toolkit (http://docs.allennlp.org/v0.9.0/api/allennlp.models.coreference_resolution.html, accessed on 10 February 2024) and the modifications of the parsed coreference pairs during our pre-training process. Then, we illustrate the details of our model architecture.

3.1. Preprocessing Setup

To build broad and role-sensitive graphs learning dialogue knowledge from large-scale task-oriented datasets, CECPT uses an automatic coreference toolkit to parse the concatenated multi-turn utterances into a predefined graph structure. Considering latent relations between mentions in the dialogue history, pre-training coreference relations will naturally benefit the coreference-enhanced relation modeling work in dialogue text.

3.2. Dialogue Semantic Pre-Training

Generally, a dialogue d is composed of a series of constituent utterances, denoted as

u_{1}, u_{2}, \dots u_{N}

. Each utterance typically consists of multiple tokens. We employ a multi-layer Transformer architecture and extract token representations from the hidden vectors of the final layer. Furthermore, to differentiate between the roles of the speakers in the dialogue, we prepend each utterance with two distinct tokens, namely, [USR] and [SYS]. Additionally, we incorporate a special [CLS] token into the beginning of the dialogue.

As shown in Figure 2, the semantic contrastive model takes the current dialogue utterance, the previous dialogue utterances and the coreference parsed results as the input, adopts a PLM as the text encoder and trains to discriminate various coreference pairs mentioned in the dialogue. In detail, we employ a coreference discriminator as our semantic contrastive pre-training task, with the aim of fostering proximity in the representations of words or mentions that share coreference, as opposed to those with distinct coreference relationships. We leverage these coreference pairs as positive samples, training the text encoder to distinguish them from pairs exhibiting different coreference connections, serving as negative samples. Consequently, this approach enables the encoder to capture dialogue coreference semantics without requiring human annotations.

Suppose a pair

(c, c^{'})

is a positive example since they share coreference, like Monday and the same day in Figure 2, and a negative example otherwise. In detail, we replace one mention c in the original positive pair with another one from the negative coreference mentions labeled as

\hat{c}

(e.g., a train) by random sampling. And then, we obtain the

m_{c}

number of negative pairs.

To distinguish the positive coreference pair from the negative ones, we establish a training objective for positive pairs (c,

c^{'}

) using the contrastive loss of classifying the positive pair correctly:

\begin{matrix} L_{c, c^{'}} = & - x_{c}^{⊤} W x_{c^{'}} + log (exp (x_{c}^{⊤} W x_{c^{'}}) + \sum_{i = 1}^{m_{c}} exp (x_{{\hat{c}}_{i}}^{⊤} W x_{c^{'}}) + \sum_{j = 1}^{m_{c^{'}}} exp (x_{c}^{⊤} W x_{{\hat{c}}_{j}^{'}})), \end{matrix}

(1)

where

x

is the representation of the coreference mentions (e.g., c) and W is a matrix learning the similarity of the mention pair.

m_{c}

and

m_{c}^{'}

are hyper-parameters for negative sampling.

Therefore, to obtain the overall training objective for coreference semantic pre-training, we sum up the losses of all the positive pairs in the set

P_{d}

from all the dialogues d in the batch

B_{d}

:

L (θ) = \sum_{d \in B_{d}} \sum_{(c, c^{'}) \in P_{d}} L_{c, c^{'}},

(2)

where

θ

denotes the trainable parameters.

3.3. Dialogue Structure Pre-Training

As coreference-related entities can be interconnected, a graphical representation could efficiently capture the structured information, facilitating the modeling of the interconnected mentions in multi-turn dialogues. In this section, we first introduce the construction of the dialogue graph, and then illustrate how to learn the dialogue structure utilizing the graph contrastive mechanism.

3.3.1. Dialogue Graph Construction

Each type of node or edge is used to encode a type of information or information flow in the dialogue, as shown in the left two blocks in Figure 3. For the nodes, we initialize the embeddings of the word and utterance nodes with the text encoder introduced in Section 3.2. And we take an average of all the embeddings of the word nodes for each mention to obtain the embeddings of the mention nodes. For the edges, each word node in one utterance is connected with the utterance node, and it is also connected with its neighboring word node through context with dashed lines in Figure 3. And each utterance node is connected to the immediate p past utterances and f future utterances, where p and f are set to 5 for all the experiments in this paper.

3.3.2. Graph Contrastive Mechanism

As shown in Figure 3, the structure contrastive model takes predefined dialogue graphs as the input, initializes the node embeddings with the text encoder described in Section 3.2 and maximizes the mutual information between the representations of the individual nodes and collective graph via contrastive learning. To learn the dialogue structure without losing too much dialogue semantic information and reduce the cost of manual annotations, we use Infograph [20] to perform unsupervised pre-training on the dialogue graphs. Infograph takes MI as the core indicator and learns the graph structure by maximizing the MI between node representation and graph representation through contrastive learning.

Suppose we have a batch of dialogue graph samples that contains N dialogue graphs

G_{1}, G_{2}, \dots G_{N}

. Take

G_{1}

as an example, where node

ν \in G_{1}

. The graph encoder is a K-layer graph isomorphism network (GIN) [21]. When graph

G_{1}

is fed into a graph encoder, the k-th layer representation of node

ν

can be obtained as Equation (3) shows:

\begin{matrix} h_{ν}^{(k)} = & f^{(k)} (Σ_{μ \in N_{(ν)}} h_{μ}^{(k - 1)} + (1 + ϵ^{(k)}) \cdot h_{ν}^{(k - 1)}), \end{matrix}

(3)

where

h_{ν}^{(k)}

represents the embedding of node

ν

in the k-th layer, and

N_{(ν)}

represents the neighbor nodes of node

ν

.

h_{μ}^{(k - 1)}

represents the k-1-th representation of node

μ

.

f^{(k)} (\cdot)

represents the multiple-layer perceptron of the graph encoder in the k-th layer, and

ϵ^{(k)}

represents the combined parameters of the graph encoder in the k-th layer.

By concatenating all the layers’ representations, the representation of node

ν

could be obtained as

h_{ν} = g (h_{ν}^{(k)} | k = 0, 1, \dots K - 1),

(4)

where

h_{ν}

stands for the representation of node

ν

and

g (\cdot)

stands for the operation of concatenation.

The graph representation can be obtained by concatenating and sum-pooling the node representations within each graph.

\begin{matrix} H_{G 1} = & g (s u m ({h_{ν}^{(k) | ν \in G_{1}}}) | k = 0, 1, \dots K - 1), \end{matrix}

(5)

where sum represents the sum-pooling operation. After obtaining the representations, we construct positive and negative sample pairs and send them to the discriminator to continuously learn the representations of the graph. In a batch of samples, we take the graph representation and its own node representations as positive sample pairs. And we use the graph representations and node representations of other graphs within the same batch as negative samples.

Take graph

G_{A}

and

G_{B}

in Figure 3 as an example. Through the graph encoder, the node representation of

G_{A}

, the global representation of

G_{A}

, and the node representation of

G_{B}

can be obtained, which are marked as yellow, purple, and cyan, respectively. Then, the global representation and node representations of

G_{A}

could constitute positive sample pairs, and the global representation of

G_{A}

and the node representations of

G_{B}

could constitute negative sample pairs. Subsequently, both positive and negative samples are fed into the discriminator, and mutual information (MI) is computed and maximized, following a similar approach to the Infograph algorithm. The structure pre-training loss is calculated as

\begin{matrix} L_{ϕ, φ} = & - (- \frac{1}{M} \sum_{i = 1}^{N} \sum_{j = 1}^{v_{i}} s p (- D_{ϕ} (H_{φ}^{i} (x), h_{φ}^{i j} (x))) \\ - \frac{1}{(N - 1) M} \sum_{i = 1}^{N} \sum_{j = 1, j \neq i}^{N} \sum_{t = 1}^{v_{j}} s p (D_{ϕ} (H_{φ}^{i} (x), h_{φ}^{j t} (x^{'})))), \end{matrix}

(6)

where

ϕ

and

φ

are parameters of discriminator D and GIN, respectively,

x^{'}

is a negative sample corresponding to an input sample x, N is the graph number,

M = \sum_{i = 1}^{N} \sum_{j = 1}^{v_{i}}

is the total number of nodes,

v_{i}

means the node number of the graph i.

H_{φ}^{i} (x)

represents the global representation of the graph i,

h_{φ}^{i j} (x)

stands for the local representation of the node j of the graph i, and

s p (z) = l o g (1 + e z)

is the softplus function. The first half in parentheses represents the mean MI of the positive pairs, and the second half represents the mean MI of the negative pairs.

4. Experiment

We initiate our contrastive pre-training using the bert-base-uncased checkpoint [22] from Hugging Face. And we select nine dialogue datasets as our pre-training datasets. The pre-training dataset statistics are shown in Table 1.

4.1. Downstream Dataset

We select the datasets OOS [32], MultiWOZ2.1 [26], DSTC2 [33], and GSIM [34] for the downstream tasks as in TOD-BERT. To prevent information leakage, the MultiWOZ2.1 test set is excluded from the model’s pre-training.

4.2. Experimental Setup

All benchmarks are conducted in PyTorch on servers featuring Tesla V100 GPUs, equipped with 32 GB of memory. We utilize the Adam optimizer for model training and employ a learning rate schedule, gradually reducing it from

2 \times 10^{- 5}

to

1 \times 10^{- 5}

for downstream tasks. In detail, the learning rate is set to

1 \times 10^{- 5}

. The pre-training batch size is set to 32, and we use a graph encoder with 5 layers of GIN and 768 hidden units.

When selecting metrics for various downstream tasks, we tailor our choices based on the task’s nature and dataset characteristics. For intent recognition, we evaluate using out-of-scope intent recall and accuracy (Acc) on all data, in-domain intents, and out-of-scope intents. The accuracy provides a general performance overview, while recall focuses on identifying positive instances, crucial for imbalanced datasets. This ensures the model performs well overall and effectively identifies minority class instances.

For dialogue act prediction, we use micro-F1 and macro-F1 scores across three datasets. Micro-F1 averages the precision and recall scores for each dialogue act class, emphasizing the overall performance. Macro-F1 averages the F1 scores for each class individually, ensuring the performance on the less frequent classes is not overshadowed by the more common ones.

For dialogue state tracking, we use joint accuracy and slot accuracy. Joint accuracy evaluates the alignment between the predicted and actual dialogue states at every turn, requiring precise correspondence for all domain–slot combinations. The slot accuracy assesses the agreement of each domain–slot–value triplet with its ground truth label.

4.3. Downstream Tasks and Evaluation

The intent recognition task involves predicting a single intent label from multiple categories for each utterance.

\begin{matrix} P_{int} = S o f t m a x (W_{c} * G (U)) \in R^{I}, \end{matrix}

(7)

where G encodes the utterance U and I represents the number of possible intents.

W_{c} \in R^{I * d_{b}}

is a linear mapping, and

d_{b}

denotes the hidden size (typically 768).

The dialogue act prediction task is the prediction of multiple dialogue acts of the next system response.

P_{d a} = S i g m o i d (W (G (x))) \in R^{N},

(8)

where

W \in R^{d_{b} * N}

denotes a weight matrix. We train the model using binary cross-entropy loss and set a threshold (typically 0.5) to predict act triggers.

The dialogue state tracking task aims to predict slot values from a predefined ontology for each (domain, slot) pair with high scores.

P_{d s t} = C o s i n e (F_{j} (G (x)), G ({v_{i}}^{j})) \in R^{I},

(9)

where x represents the dialogue history, and

{v_{i}}^{j}

denotes the i-th slot value for the j-th (domain, slot) pair. G encodes the input utterances to obtain the representation. The

C o s i n e

similarity function computes the probability of potential values for the j-th (domain, slot) pairs.

F_{j}

serves as the slot projection layer for the j-th slot, and the number of layers in F matches the count of the (domain, slot) pairs. The model is trained using the cross-entropy loss summed across all pairs.

4.4. Main Results and Analysis

In this section, we conduct a detailed performance analysis for each downstream task.

4.4.1. Intent Recognition

Table 2 displays the results for the OOS dataset, revealing our model’s superior performance across all metrics. When compared to TOD-BERT, our model CECPT achieves a noteworthy 6.5%, 1.2%, and 0.7% improvement in terms of recall on the out-of-scope intent, the accuracy on all the data, and the out-of-scope intent, respectively. We attribute this improvement to two main reasons: First, with contrastive learning on semantics, CECPT could effectively distinguish the out-of-scope intents from the in-domain intents. Second, by characterizing the underlying structured information, CECPT distills the important information from the intent-aware dialogue context of 150 in-domain intents and achieves a better performance.

4.4.2. Dialogue Act Prediction

We extend the experiments to three datasets, reporting the micro-F1 and macro-F1 scores for the multi-label classification task in Table 3. Our model CECPT achieves a 0.3% and 2.3% improvement in macro-F1 on the MWOZ corpus and DSTC2 corpus, respectively. In most datasets, when the micro-F1 scores are considered, both TOD-BERT-mlm and TOD-BERT-jnt perform closely to their upper bounds, limiting the scope for improvement with our model.

4.4.3. Dialogue State Tracking

Dialogue state tracking tasks commonly employ the joint accuracy and slot accuracy as evaluation metrics. Table 4 provides a comparison between the mainstream models, including GPT2, BERT, TOD-BERT, and our proposed model, on the MultiWOZ2.1 dataset, showcasing our model’s superior performance. In contrast to TOD-BERT-jnt, our model CECPT achieves a 3.1% and 0.3% improvement in the joint accuracy and slot accuracy, respectively.

4.5. More Details

Intuitively, both semantics and structure are pivotal in understanding the natural dynamics of dialogues. We design ablation studies in Table 2, Table 3 and Table 4 to probe the intrinsic matters. We can see that neglecting semantic or structural pre-training can lead to a substantial performance decrease compared to the CECPT settings. This indicates that the dialogue semantics and graph structure are both essential for characterizing the underlying information and facilitating the modeling of the naturally connected mentions across multi-turn dialogues. And for the impact of layers in the GIN, Figure 4 shows the convergence time of our pre-trained model and the performance on the task for dialogue state tracking under different numbers of layers in the graph encoder. The convergence time of the model is highly correlated with the number of layers in the GIN. Specifically, the higher the number of layers, the longer the convergence time required. When the number of layers is less than or equal to five, the performance improves as the number of layers increases. However, when the number of layers exceeds five, a noticeable decline in performance is observed. Therefore, we use five layers in the GIN with 768 hidden units as our graph encoder. All the experiments conducted above were on servers featuring Tesla V100 GPUs, equipped with 32GB of memory.

5. Discussion

The performance of our model is significantly influenced by the quality of the dialogue datasets used, particularly in relation to the coreference resolution task. Since the output of the coreference resolution task serves as the input for our model, any inaccuracies or limitations in the coreference resolution can propagate through and adversely affect the overall performance. Below, we outline specific limitations related to dataset dependency.

Noisy Texts and Interrupts: Dialogue datasets often contain noisy texts, such as typographical errors, informal language, and interruptions, that do not contribute to the main conversation. These elements can complicate the coreference resolution task, leading to incorrect or unclear coreference chains. Consequently, our model may struggle to accurately interpret and process such dialogues, resulting in suboptimal performance.

Domain-Specific Context: Some dataset may require additional domain-specific profiles, knowledge, or other materials to fully understand the context of certain dialogues. For example, dialogues in specialized fields like medical consultations or technical support may contain jargon or context-specific references that are not well-represented in general dialogue datasets. Current coreference resolution parsers may not perform well in these contexts, thereby limiting the effectiveness of our model.

Limitations of Current Coreference Resolution Parsers: The state of the art in coreference resolution, while advanced, still faces challenges in achieving high performance, especially with low-quality dialogue datasets. If the coreference resolution parser fails to accurately identify and link references within the text, our model’s input will be flawed, leading to potential errors in the subsequent processing and analysis.

6. Conclusions and Future Work

In this paper, we introduce CECPT, a contrastive pre-training framework for dialogue-oriented downstream tasks, to utilize the rich dialogue knowledge in the graph structure. Our experimental results demonstrate that our pre-trained dialogue model, CECPT, surpasses strong baseline models across three critical dialogue applications: intent recognition, dialogue act prediction, and dialogue state tracking. In future work, we intend to explore more with the combination of coreference and graph contrastive learning on the basis of pre-training task designing, trying to explore more in the internal mechanism. Additionally, given the limitations in Section 5, it is crucial to ensure the quality and appropriateness of the dialogue datasets used. Future work could focus on improving the coreference resolution in noisy and domain-specific contexts, as well as developing methods to better handle interruptions and informal language within dialogues. And incorporating domain-specific knowledge bases and ontologies could enhance the model’s ability to understand and process specialized dialogues.

For broader applicability in industrial contexts, such as customer service scenarios, significant challenges persist with numerous colloquial dialogue datasets. These datasets contain many instances of speech pauses and frequent topic shifts. This inherent characteristic significantly affects the performance of coreference resolution tasks, consequently impacting the accuracy of the coreference features used as input in our method. Therefore, for industrial-scale customer service environments, our method is more suitable for text dialogue datasets with clear contextual semantics rather than speech-to-text colloquial dialogue datasets. And in the future, we will provide insights into how our method can be adapted and implemented in such challenging contexts.

Author Contributions

Methodology, Y.H., S.C. and Y.C.; software, S.C. and Y.C.; validation, Y.H., J.F. and C.D.; formal analysis, S.C. and Y.C.; investigation, Y.H., S.C. and Y.C.; data curation, S.C. and Y.C.; writing—original draft preparation, Y.H., S.C. and Y.C.; writing—review and editing, Y.H., J.F. and C.D.; visualization, S.C.; supervision, J.F. and C.D.; project administration, Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Beijing Natural Science Foundation (L222006), National Key R&D Program of China (2021ZD0140408) and China Mobile Holistic Artificial Intelligence Major Project Funding (R22105ZS, R22105ZSC01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Conflicts of Interest

All authors were employed by JIUTIAN Team, China Mobile Research Institute. Authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wang, H.; Wang, L.; Du, Y.; Chen, L.; Zhou, J.; Wang, Y.; Wong, K.F. A survey of the evolution of language model-based dialogue systems. arXiv 2023, arXiv:2311.16789. [Google Scholar]
Yang, X.; Sheng, X.; Gu, J. A Task-Oriented Multi-turn Dialogue Mechanism for the Smart Cockpit. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Wuhan, China, 6–8 October 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 316–330. [Google Scholar]
Thakkar, M.; Pise, N.N. Leveraging Transformer-based Pretrained Language model for Task-oriented dialogue system. Int. J. Comput. 2023, 8, 1–4. [Google Scholar]
Yi, Z.; Ouyang, J.; Liu, Y.; Liao, T.; Xu, Z.; Shen, Y. A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems. arXiv 2024, arXiv:2402.18013. [Google Scholar]
Ni, J.; Young, T.; Pandelea, V.; Xue, F.; Cambria, E. Recent advances in deep learning based dialogue systems: A systematic survey. Artif. Intell. Rev. 2023, 56, 3055–3155. [Google Scholar] [CrossRef]
Wu, H.; Zhou, H.; Lan, M.; Wu, Y.; Zhang, Y. Connective Prediction for Implicit Discourse Relation Recognition via Knowledge Distillation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 5908–5923. [Google Scholar]
Jayanthi, S.M.; Embar, V.; Raghunathan, K. Evaluating Pretrained Transformer Models for Entity Linking in Task-Oriented Dialog. arXiv 2021, arXiv:2112.08327. [Google Scholar]
Mo, Y.; Yoo, J.; Kang, S. Parameter-Efficient Fine-Tuning Method for Task-Oriented Dialogue Systems. Mathematics 2023, 11, 3048. [Google Scholar] [CrossRef]
Yang, Y.; Li, Y.; Quan, X. Ubar: Towards fully end-to-end task-oriented dialog system with gpt-2. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14230–14238. [Google Scholar]
Wu, C.S.; Hoi, S.; Socher, R.; Xiong, C. TOD-BERT: Pre-trained natural language understanding for task-oriented dialogue. arXiv 2020, arXiv:2004.06871. [Google Scholar]
Yan, S.; Song, S.; Li, J.; Meng, S.; Hu, G. TITAN: Task-oriented dialogues with mixed-initiative interactions. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023; pp. 5251–5259. [Google Scholar]
Zhang, Z.; Shen, L.; Zhao, Y.; Chen, M.; He, X. Dialog-Post: Multi-Level Self-Supervised Objectives and Hierarchical Model for Dialogue Post-Training. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 10134–10148. [Google Scholar]
Siro, C.; Aliannejadi, M.; de Rijke, M. Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems. arXiv 2024, arXiv:2404.09980. [Google Scholar]
Chen, H.; Hong, P.; Han, W.; Majumder, N.; Poria, S. Dialogue relation extraction with document-level heterogeneous graph attention networks. Cogn. Comput. 2023, 15, 793–802. [Google Scholar] [CrossRef]
Qiu, J.; Chen, Q.; Dong, Y.; Zhang, J.; Yang, H.; Ding, M.; Wang, K.; Tang, J. Gcc: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 1150–1160. [Google Scholar]
Zhang, J.; Bui, T.; Yoon, S.; Chen, X.; Liu, Z.; Xia, C.; Tran, Q.H.; Chang, W.; Yu, P. Few-shot intent detection via contrastive pre-training and fine-tuning. arXiv 2021, arXiv:2109.06349. [Google Scholar]
Bai, X.; Chen, Y.; Song, L.; Zhang, Y. Semantic representation for dialogue modeling. arXiv 2021, arXiv:2105.10188. [Google Scholar]
Vulić, I.; Su, P.H.; Coope, S.; Gerz, D.; Budzianowski, P.; Casanueva, I.; Mrkšić, N.; Wen, T.H. ConvFiT: Conversational fine-tuning of pretrained language models. arXiv 2021, arXiv:2109.10126. [Google Scholar]
Meng, Y.; Xiong, C.; Bajaj, P.; Bennett, P.; Han, J.; Song, X. Coco-lm: Correcting and contrasting text sequences for language model pretraining. Adv. Neural Inf. Process. Syst. 2021, 34, 23102–23114. [Google Scholar]
Sun, F.Y.; Hoffmann, J.; Verma, V.; Tang, J. Infograph: Unsupervised and semi-supervised graph-level representation learning via mutual information maximization. arXiv 2019, arXiv:1908.01000. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Li, J.; Zhu, Q.; Luo, L.; Liden, L.; Huang, K.; Shayandeh, S.; Liang, R.; Peng, B.; Zhang, Z.; Shukla, S.; et al. Multi-domain task completion dialog challenge ii at dstc9. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, Ninth Dialog System Technology Challenge Workshop, Virtually, 2–9 February 2021. [Google Scholar]
Rastogi, A.; Zang, X.; Sunkara, S.; Gupta, R.; Khaitan, P. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8689–8696. [Google Scholar]
Byrne, B.; Krishnamoorthi, K.; Sankar, C.; Neelakantan, A.; Duckworth, D.; Yavuz, S.; Goodrich, B.; Dubey, A.; Cedilnik, A.; Kim, K.Y. Taskmaster-1: Toward a realistic and diverse dialog dataset. arXiv 2019, arXiv:1909.05358. [Google Scholar]
Budzianowski, P.; Wen, T.H.; Tseng, B.H.; Casanueva, I.; Ultes, S.; Ramadan, O.; Gašić, M. MultiWOZ–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv 2018, arXiv:1810.00278. [Google Scholar]
Li, X.; Wang, Y.; Sun, S.; Panda, S.; Liu, J.; Gao, J. Microsoft dialogue challenge: Building end-to-end task-completion dialogue systems. arXiv 2018, arXiv:1807.11125. [Google Scholar]
Eric, M.; Manning, C.D. Key-value retrieval networks for task-oriented dialogue. arXiv 2017, arXiv:1705.05414. [Google Scholar]
Asri, L.E.; Schulz, H.; Sharma, S.; Zumer, J.; Harris, J.; Fine, E.; Mehrotra, R.; Suleman, K. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv 2017, arXiv:1704.00057. [Google Scholar]
Mrkšić, N.; Séaghdha, D.O.; Wen, T.H.; Thomson, B.; Young, S. Neural belief tracker: Data-driven dialogue state tracking. arXiv 2016, arXiv:1606.03777. [Google Scholar]
Wen, T.H.; Vandyke, D.; Mrksic, N.; Gasic, M.; Rojas-Barahona, L.M.; Su, P.H.; Ultes, S.; Young, S. A network-based end-to-end trainable task-oriented dialogue system. arXiv 2016, arXiv:1604.04562. [Google Scholar]
Larson, S.; Mahendran, A.; Peper, J.J.; Clarke, C.; Lee, A.; Hill, P.; Kummerfeld, J.K.; Leach, K.; Laurenzano, M.A.; Tang, L.; et al. An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction. arXiv 2019, arXiv:1909.02027. [Google Scholar]
Henderson, M.; Thomson, B.; Williams, J.D. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, USA, 18–20 June 2014; pp. 263–272. [Google Scholar]
Shah, P.; Hakkani-Tür, D.; Liu, B.; Tür, G. Bootstrapping a Neural Conversational Agent with Dialogue Self-Play, Crowdsourcing and On-Line Reinforcement Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 41–51. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.C.; Brockett, C.; Gao, X.; Gao, J.; Liu, J.; Dolan, B. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv 2019, arXiv:1911.00536. [Google Scholar]

Figure 1. An overview of the proposed Coreference-Enhanced Contrastive Pre-training (CECPT).

Figure 2. An overview of our proposed semantic contrastive model.

Figure 3. An overview of the proposed structure contrastive model.

Figure 4. Training time and performance on DST task with different numbers of layers in the GIN.

Table 1. Statistics of task-oriented dialogue datasets used in the pre-training phase.

Dataset	#Dialogue	#Utterance
MetaLWOZ [23]	37,884	432,036
Schema [24]	22,825	463,284
Taskmaster [25]	13,215	303,066
MultiWOZ2.1 [26]	10,420	71,410
MSR-E2E [27]	10,087	74,686
SMD [28]	3031	15,928
Frames [29]	1369	19,986
WOZ [30]	1200	5012
CamRest676 [31]	676	2744

Table 2. Test results on the OOS dataset, one of the largest intent recognition corpora. Models marked with * are sourced from [10].

Model	Recall out	Acc All	Acc in	Acc out
GPT2 * [35]	32.0%	83.0%	94.1%	87.7%
DialoGPT * [36]	32.1%	83.9%	95.5%	87.6%
BERT * [22]	35.6%	84.9%	95.8%	88.1%
TOD-BERT-mlm * [10]	46.3%	85.9%	96.1%	89.5%
TOD-BERT-jnt * [10]	43.6%	86.6%	96.2%	89.9%
CECPT	50.1%	87.8%	96.2%	90.6%
CECPT w/o semantic	46.2%	86.7%	95.8%	90.0%
CECPT w/o structure	48.6%	86.7%	95.3%	90.3%

Table 3. Test results on three Dialogue act prediction datasets. We choose the micro and macro F1 scores as the task metric. * denotes the experiment performance from [10].

Model	MultiWOZ2.1		DSTC2		GSIM
Model	Micro- F1	Macro- F1	Micro- F1	Macro- F1	Micro- F1	Macro- F1
GPT2 *	90.8%	79.8%	92.5%	39.4%	99.1%	45.6%
DialoGPT *	91.2%	79.7%	93.8%	42.1%	99.2%	45.6%
BERT *	91.4%	79.7%	92.3%	40.1%	98.7%	45.2%
TOD-BERT-mlm *	91.7%	79.9%	90.9%	39.9%	99.4%	45.8%
TOD-BERT-jnt *	91.7%	80.6%	93.8%	41.3%	99.5%	45.8%
CECPT	91.9%	80.9%	3.9%	43.6%	99.6%	45.9%
CECPT w/o semantic	91.5%	80.9%	93.8%	41.2%	99.5%	45.8%
CECPT w/o structure	91.8%	80.7%	93.4%	40.4%	99.0%	45.4%

Table 4. Experiment results on the dialogue state tracking dataset, MultiWOZ2.1. We present joint accuracy and slot accuracy for the full data setting. Results marked with * are sourced from [10].

Model	Joint Acc	Slot Acc
GPT2 *	46.2%	96.6%
DialoGPT *	45.2%	96.5%
BERT *	45.6%	96.6%
TOD-BERT-mlm *	47.7%	96.8%
TOD-BERT-jnt *	48.0%	96.9%
CECPT	51.1%	97.2%
CECPT w/o semantic	48.4%	96.9%
CECPT w/o structure	49.2%	96.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, Y.; Chen, S.; Chen, Y.; Feng, J.; Deng, C. Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training. Appl. Sci. 2024, 14, 7614. https://doi.org/10.3390/app14177614

AMA Style

Huang Y, Chen S, Chen Y, Feng J, Deng C. Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training. Applied Sciences. 2024; 14(17):7614. https://doi.org/10.3390/app14177614

Chicago/Turabian Style

Huang, Yi, Si Chen, Yaqin Chen, Junlan Feng, and Chao Deng. 2024. "Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training" Applied Sciences 14, no. 17: 7614. https://doi.org/10.3390/app14177614

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Task-Oriented Dialogue Modeling through Coreference-Enhanced Contrastive Pre-Training

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Preprocessing Setup

3.2. Dialogue Semantic Pre-Training

3.3. Dialogue Structure Pre-Training

3.3.1. Dialogue Graph Construction

3.3.2. Graph Contrastive Mechanism

4. Experiment

4.1. Downstream Dataset

4.2. Experimental Setup

4.3. Downstream Tasks and Evaluation

4.4. Main Results and Analysis

4.4.1. Intent Recognition

4.4.2. Dialogue Act Prediction

4.4.3. Dialogue State Tracking

4.5. More Details

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI