External Slot Relationship Memory for Multi-Domain Dialogue State Tracking

Xing, Xinlai; Yang, Changmeng; Lin, Dafei; Teng, Da; Chen, Panpan; Zhang, Xiaochuan

doi:10.3390/app13158943

Open AccessArticle

External Slot Relationship Memory for Multi-Domain Dialogue State Tracking

by

Xinlai Xing

,

Changmeng Yang

^*,

Dafei Lin

,

Da Teng

,

Panpan Chen

and

Xiaochuan Zhang

School of Artificial Intelligence, Chongqing University of Technology, Chongqing 401135, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(15), 8943; https://doi.org/10.3390/app13158943

Submission received: 9 July 2023 / Revised: 27 July 2023 / Accepted: 1 August 2023 / Published: 3 August 2023

(This article belongs to the Special Issue Applications of Artificial Intelligence on Social Media)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Dialogue state tracking is an essential component in multi-domain dialogue systems that aims to accurately determine the current dialogue state based on the dialogue history. Existing research has addressed the issue of multiple mappings in dialogues by employing slot self-attention as a data-driven approach. However, learning the relationships between slots from a single sample often has limitations and may introduce noise and have high time complexity issues. In this paper, we propose an external slot relation memory-based dialogue state tracking model (ER-DST). By utilizing external memory storage, we learn the relationships between slots as a dictionary of multi-domain slot relations. Additionally, we employ a small filter to discard slot information irrelevant to the current dialogue state. Our method is evaluated on MultiWOZ 2.0 and MultiWOZ 2.1, achieving improvements of 0.23% and 0.39% over the baseline models, respectively, while reducing the complexity of the slot relationship learning component to O(n).

Keywords:

task-oriented dialogue system; dialogue state tracking; attention mechanism; pre-trained language model

1. Introduction

Dialogue state tracking (DST) is a critical component in task-oriented dialogue systems. It is responsible for inferring the user’s goals and intentions by tracking the slots and their corresponding values in the dialogue. DST aims to provide a compact representation of the dialogue, known as the dialogue state, which consists of triples <domain, slot, value>, used by the system to determine the next action. Therefore, the accuracy of dialogue state tracking is crucial for the performance of the system.

DST methods typically involve predefined slots and their possible values, known as ontology, to guide the tracking of dialogue states [1]. These methods assume that all possible slot–value pairs are known, allowing for direct matching of predefined slots and values during the dialogue to update the state. These slots and intents are often represented as abbreviations, such as “train-leaveat” and “hotel-internet”, indicating the target information to track in the task domain. In this approach, scores are typically assigned to all possible slot–value pairs in the ontology, and the highest scoring pair is selected as the predicted dialogue state. In recent years, dialogue state tracking has gained increasing attention, leading to the development of various classical models [2,3,4].

Despite the emergence of dialogue state tracking models based on neural networks, these approaches often overlook the correlations between slots when making predictions for each slot. For instance, in the second round of the human–computer dialogue shown in Table 1, the user’s last utterance is “A 3-star hotel in the same area and price range as my restaurant”. This dialogue indicates that the hotel should have the same area and price range as the restaurant, and the predicted values should be <hotel, price range, expensive>, and <hotel, area, south>. However, since the words “expensive” and “south” are not explicitly mentioned in the dialogue, they cannot be extracted. Some researchers have recognized the importance of modeling slot correlations to some extent [5,6]. In these works, the correlations between slot names have been considered [7], or prior knowledge has been incorporated [5], where artificially set slot correlation coefficients are assigned a value of 1. However, these approaches only consider the association between slot names.

To address this issue, Ye [8] proposes the Slot Self-Attentive Dialogue State Tracking (STAR) model, which utilizes both the slot names and the context related to the slots to capture slot relationships more accurately through slot self-attention. It learns the correlations between slots in a fully data-driven manner without any manual effort or prior knowledge. However, it learns the relationships between multi-domain slots from a single sample, which may introduce bias during prediction and potentially involve redundant information from other slots.

In this work, a dialogue state tracking model, ER-DST, is proposed to enable the learning of relationships between multi-domain slots from global samples. The main contributions of this paper are the following:

The model utilizes an external slot relationship memory, consisting of keys and values, to capture the relationship characteristics between slots and learn a dictionary of relationships for multi-domain slots;
A small-scale filter is employed to reduce the weight of irrelevant slots, thereby excluding interference from other slots;
By leveraging the external slot relationship memory to learn the most discriminative features of slot relationships, the proposed approach achieves a joint goal accuracy of 54.76% on MultiWOZ 2.0 [9] and 56.75% on MultiWOZ 2.1 [10], while maintaining linear complexity.

The rest of this paper is organized as follows: Section 2 presents related work; Section 3 introduces the proposed model; experiments are in Section 4; experimental results are in Section 5; and the conclusions are finally presented in Section 6.

2. Related Work

Traditional dialogue state tracking methods heavily rely on extracting triplet information from natural language understanding to predict the current dialogue state [11,12,13]. However, these methods rely on manually crafted rules and domain-specific delexicalization, which incurs significant manual effort and limits the models’ generalization ability in multi-domain conversations.

With the rise of deep learning, researchers have applied various deep neural networks to the task of dialogue state tracking. Henderson [14] utilizes a deep fully connected neural network to calculate the probabilities of all slot values for each slot, predicting the slot values. Mrkić [1] proposes the Neural Belief Track (NBT) model, which leverages representation learning to embed candidate slot–value pairs and dialogue embeddings into dense word vectors, and infers their representations during decoding to determine whether the slot–value pair appears in the dialogue. However, the scalability of the above methods is limited, and their performance is suboptimal. TRADE [15] combines the copy mechanism and generation mechanism, weighting the probability distributions obtained from both mechanisms at each decoding step of a slot value and using recurrent neural networks to capture semantic correlations between dialogue contexts to improve prediction capabilities. Refs. [16,17,18] treat the dialogue state as a selective rewrite memory structure, a table, and state momentum to enhance the model’s ability to update and correct slot values. TripPy [6] proposes the simultaneous use of three copy mechanisms to fill in slots, addressing the issue of open vocabulary. Gao [19] adopts a reading comprehension approach for state updates. In [20], an attention-based pointer network is used to copy and slot values from the dialogue context. Guo [21] proposes the DiCoS-DST model, which implements a dynamic selection method for dialogue history. Although the aforementioned methods partially address the issue of scalability, they treat the modeling of slots independently without considering the interdependencies among slots in different domains.

In recent years, pre-trained language models [22,23] have demonstrated excellent semantic encoding capabilities in downstream tasks and have received significant attention. For example, the SUMBT model proposed by Lee [3] encodes slot representations and dialogue representations using BERT, and models the relationship between slots and dialogues through attention mechanisms. Zhu [24] uses a fusion network that combines context and pattern graphs to model dialogue states. Bebensee [25] has proposed a novel span-selective prediction layer and adopted linear attention encoding. Although [3,24,25] modeled the relationship between slots and dialogue history, they overlooked the correlations among slots. Li [26] explores the hierarchical semantics of ontologies and enhances the relationships between slots through masked hierarchical attention. The LUNA model proposed by Wang [27] explicitly aligns each slot with its most relevant utterances and designs a slot ranking auxiliary task to learn the temporal correlations between slots. Feng [28] proposes a method that dynamically integrates previous slot–domain member relationships and dialogue-aware dynamic slot relationships to generate pattern graphs. Jiao [29] addresses the issue of insufficient contextual understanding in dialogues by introducing a hierarchical DST framework that models second-order slot interaction. The STAR model proposed by Ye [8] simultaneously utilizes slot names and context related to slots to capture the relationships between slots using slot self-attention more accurately. Although the above-mentioned methods partially consider the relationships between slots, these self-attention-based methods suffer from quadratic computational complexity and the challenge of learning slot relationships from a single sample, which may introduce interference.

To address the issue of feature explosion and computation associated with attention-based approaches, we propose ER-DST as a solution, which mitigates the noise introduced by slot relationships. The external slot relationship memory, independent of individual samples, captures long-term dependencies and learns the essential connections between slots across different domains from the global sample. Additionally, we employ a compact filter to reduce irrelevant connections between slots.

3. Model

As shown in Figure 1, the process of DST involves several steps, including the preparation of input data, feature extraction, model, and assessment.

Figure 2 illustrates an overview of the model, which consists of four main parts. Firstly, the context encoder encodes the input dialogue and outputs a semantic vector representation. The second part is the slot attention, which extracts slot-related information from the encoded context. The third part is the external attention layer, which models the relationships between slots, allowing the model to learn important information from multiple domains in the dialogue and exclude interference information from other slots.

3.1. Problem Definition

In the dialogue domain, the important information to be tracked is defined as a set of slots, denoted as

S = \{S_{1}, S_{2}, \dots, S_{J}\}

, where J represents the number of slots.

M_{t} = (U 1, R 1), (U 2, R 2), \dots, (U_{t - 1}, R_{t - 1})

represents the dialogue history up to the t-th round.

D_{t} = (U_{t}, D_{t})

represents the dialogue utterance in the t-th round, where

U_{t}

and

R_{t}

represent the user and system utterances, respectively.

B_{t} = s_{1}^{t}, s_{2}^{t}, \dots, s_{J}^{t}

represents the dialogue state in the t-th round, where

s_{j}^{t} = (d o m a i n, s l o t, v a l u e)

represents the j-th slot and its corresponding slot value in the t-th round. Specifically,

B_{0}

is empty in the first round. The goal of dialogue state tracking is to predict the current dialogue state

B_{t}

based on the given dialogue context composed of

M_{t}

,

D_{t}

, and

B_{t - 1}

, denoted as:

B_{t} = D S T (M_{t}, D_{t}, B_{t - 1})

(1)

3.2. Context Encoder

In recent years, BERT [22], a pre-trained language model, has demonstrated powerful contextual semantic representation capabilities in various downstream tasks. Therefore, in this paper, we adopt the BERT pre-trained model for encoding the context. At dialogue turn T, the dialogue history is defined as

D_{T} = R_{1}, U_{1}, \dots, R_{T}, U_{T}

, which is a collection of system responses R and user utterances U, where

R = {R_{t}}_{t = 1}^{T}

and

U = {U_{t}}_{t = 1}^{T}, 1 \leq t \leq T

. We define

E_{T} = B_{1}, \dots, B_{T}

as the dialogue state at each turn, where each

E_{t} = (S_{1}, V_{1}), \dots, (S_{J}, V_{J})

represents a collection of J slot-value pairs.The context encoder takes the dialogue history up to turn t, which can be represented as

X_{t} = D_{1}, \dots D_{t}, E_{t - 1}^{'}

, as input and generates the context vector representation

H_{t}

as its output. For each slot

S_{j}

and its corresponding value

V_{j}

, we encode them using

{B E R T}_{s v}

, where we use the output vector corresponding to the special token (CLS) as the slot representation. During training, we keep the model parameters fixed.

H_{t} = {B E R T}_{c o n t e x t} (X_{t})

(2)

H_{S_{j}} = {B E R T}_{s v} (S_{j})

(3)

H_{V_{j}} = {B E R T}_{S V} (V_{j})

(4)

3.3. Slot Contextual Information Extraction Layer

In the context of a dialogue, each slot pays attention to different parts of the dialogue context. To predict the state of a specific slot, it is necessary to extract slot-specific feature information from the dialogue history. To achieve this, we employ a slot attention mechanism based on multi-head attention [30]. This mechanism allows the model to focus on relevant parts of the dialogue utterances that are informative for predicting a specific slot, thereby improving prediction accuracy. It enables the subsequent layers of the model to capture the semantic information and contextual information specific to each slot, thus capturing the interrelationships between slots effectively.

Q_{t}^{S_{j}} = h_{S_{j}} W_{Q} + b_{Q}

(5)

K_{t}^{S_{j}} = H_{t} W_{K} + b_{K}

(6)

V_{t}^{S_{j}} = H_{t} W_{V} + b_{V}

(7)

α_{t}^{S_{j}} = S o f t m a x (\frac{Q_{t}^{S_{j}} K_{t}^{S_{j}^{T}}}{\sqrt{d_{k}}}) V_{t}^{S_{j}}

(8)

C_{t}^{S_{j}} = W_{2} R e L U (W_{1} [h_{S_{j}}, α_{t}^{S_{j}}] + b_{1}) + b_{2}

(9)

The parameters

W_{Q}

,

b_{Q}

,

W_{K}

,

b_{K}

,

W_{V}

, and

b_{V}

are linear layer parameters used to map the queries, keys, and values, respectively. Here,

d_{k} = d_{h} / N

, where

d_{h}

represents the hidden size of the model and N is the number of heads in the multi-head attention mechanism.

3.4. External Slot Relationship Memory Layer

Despite the slot context attention layer extracting contextual information for each slot, the model still struggles to effectively capture the contextual information of slots that have co-reference or co-referential relationships due to the diverse expressions in natural language dialogues. Additionally, the slot context attention layer computes contextual relevance information for individual slots without considering the relationships between slots. Inspired by [21], in this work, an external slot relationship memory is constructed to capture the correlations between different slots. Algorithm 1 summarizes the pseudo-code of external slot relationship memory layer. Two external memory units are treated as a dictionary of relationships between slots. Given a feature map

F_{t}^{l} = [C_{t}^{S_{1}}, C_{t}^{S_{2}}, \dots, C_{t}^{S_{J}}] \in R^{d \times J}

,

E x t e r n a l L a y e r

calculates the input features and the external memory unit

M \in R^{d \times J}

using the following equation:

\tilde{F_{t}^{l}} = {(α)}_{i, j} = N o r m ({F_{t}^{l} M}_{k}^{T})

(10)

F_{o u t} = \tilde{F_{t}^{l}} M_{v}

(11)

where

{(α)}_{i, j}

represents the correlation weight between slot i and j. M is a learnable parameter independent of the input, serving as a memory for capturing the relationships between multi-domain slot-related information.

\tilde{F_{t}^{l}}

represents the inferred correlation weight graph from multi-domain data, and

F_{o u t}

denotes the most relevant feature information of the output slots. We utilize two separate memory processing units,

M_{k}

and

M_{v}

, as keys and values. The computational complexity of the external attention is O(dJ), where d and J are hyperparameters.

Algorithm 1 A python-style pseudo-code for external slot relationship memory

Require: Input: F, an array of shape [B, J, dim] (batch_size, number of slots, embedding_dimension)

Ensure: Parameter:

M_{k}

, a linear layer

Ensure: Parameter:

M_{v}

, a linear layer

Ensure: Parameter: H, number of heads

Ensure: Output: out, an array of shape [B, J, dim] F = query_linear(F)

F = F.view(B, J, H, dim // H)

F = F.permute(0, 2, 1, 3)

attn =

M_{k}

(F)

attn = softmax(attn, dim=2)

attn = norm(attn, dim=3)

out =

M_{v}

(attn)

out = out.permute(0, 2, 1, 3)

out = out.view(B, J, dim)

F =

W_{o}

(out)

To filter out redundant information in slot relations, we adopt the double-normalization technique proposed in [31], which normalizes both the feature rows and feature columns. The single external layer computation process can be described as follows:

{(\tilde{a})}_{i, j} = C M_{k}^{T}

(12)

{\hat{α}}_{i, j} = exp ({\hat{α}}_{i, j}) / \sum_{k} exp ({\hat{α}}_{i, j})

(13)

α_{i, j} = {\hat{α}}_{i, j} / \sum_{k} {\hat{α}}_{i, k}

(14)

And we utilize the multi-head external layer, which is represented as follows:

h_{i} = E x t e r n a l L a y e r (F_{t}^{l}, M_{k}, M_{v})

(15)

G_{t}^{l} = M u l t i H e a d (F_{t}^{l}, M_{k}, M_{v}) = C o n c a t (h_{1}, \dots, h_{H}) W_{o}

(16)

F_{t}^{l + 1} = F F N (G_{t}^{l}) + G_{t}^{l}

(17)

where

h_{i}

represents the index of the head, H represents the total number of heads, and

W_{o}

is a linear learnable parameter matrix. The method

M u l t i H e a d ()

represents that we have adopted a multi head approach. Method

F F N ()

indicates that we have adopted a linear fully connected layer.

C o n c a t ()

represents the splicing operation.

The final slot vector representation is denoted as

F_{t}^{L + 1} = [f_{t}^{S_{1}}, \dots, f_{t}^{S_{J}}]

, where

f_{t}^{S_{j}}

represents the representation of a specific slot information.

3.5. Slot Value Matching

Using Euclidean distance for value prediction of each slot, we normalize the slot feature vectors, calculate the distance between the feature vectors and the candidate values. Then, select the closest value for updating.

r_{t}^{S_{j}} = L a y e r N o r m (L i n e a r (f_{t}^{S_{j}}))

(18)

p (V_{t}^{j} | X_{t}, S_{j}) = \frac{exp (- d (h^{V_{j}}, r_{t}^{S_{j}}))}{\sum_{V_{j}^{'} \in v_{j}} exp (- d (h^{V_{j}^{'}}, r_{t}^{S_{j}}))}

(19)

where

d ()

is the Euclidean distance function and

v_{j}

represents the value space for slot

S_{j}

. The training objective of this model is to maximize the joint objective accuracy of all slots, with the loss for each round being the negative log-likelihood sum.

L_{t} = \sum_{j = 1}^{J} - log (p (V_{t}^{j} | X_{t}, S_{j}))

(20)

4. Experiments

4.1. Datasets

MultiWOZ is a task-oriented, multi-domain, manually annotated multi-turn dialogues dataset. It is currently the largest annotated dialogue dataset for task-oriented dialogue systems. The dataset consists of seven domains: attraction, hospital, police, hotel, restaurant, taxi, and train. It contains a total of 10,438 dialogue instances. Approximately 70% of the dialogues consist of more than ten turns, showcasing the complexity of the corpus. As shown in Figure 3, the average number of turns for single-domain and multi-domain dialogues is 8.93 and 15.39, respectively, totaling 115,434 turns. MultiWOZ decomposes the dialogue modeling task into three subtasks to evaluate the quality of the dialogue model based on the performance of each subtask: dialogue state tracking, dialogue act-text generation, and dialogue context-text generation. In this paper, we experiment with the proposed ER-DST model using the MultiWOZ 2.0 [9] and MultiWOZ 2.1 [10] as the testing benchmarks.

4.2. Experimental Settings

The experiment used BERT-base-uncased to encoder the dialogue context. Additionally, another BERT-base-uncased model, which was not fine-tuned during training, was used to encode the slots and slot values. The Adam optimizer was employed for model training with a learning rate of

10^{- 4}

, a dropout rate of 0.1, a batch size of 64, and 16 epochs. The experimental code was developed and executed using Python 3.7.0 and PyTorch 1.7.1, with CUDA 11.1.0 utilized for accelerated training. We trained our model on the training set and employed early stopping based on the performance of the validation set. After completing the model training, we conducted a final performance evaluation on the test set.

4.3. Evaluation Metrics

In terms of evaluation metrics, we uses joint goal accuracy (JGA) as the evaluation metric. JGA considers the model’s accuracy in predicting multiple goals, making it suitable for evaluating its ability to track multiple user goals in multi-turn dialogues. Specifically, for a dialogue system with N goals, the model needs to predict each goal’s slot values (e.g., date, time, location). JGA is calculated by evaluating the overall predictions of the model for all goals. A successful prediction is counted when all the predictions for all the goals are correct; otherwise, it is counted as a failure. Therefore, a higher JGA value indicates better performance in tracking dialogue states. Due to its challenging nature, JGA is considered a stringent metric, as it requires all the <domain, slot, value> triplets to be correctly predicted in each turn to consider the dialogue state prediction as correct.

J G A = \frac{T P + T N}{P + N} = \frac{\sum_{1}^{n} T_{t u r n}}{\sum_{1}^{n} t u r n}

(21)

4.4. Baseline Models

In order to demonstrate the effectiveness of our proposed method, we primarily compared it with four existing methods on the MultiWOZ dataset. Among them, CSFN-DST [24] constructs a pattern graph to model the dependency relationships between slots and used BERT encoding for dialogue information. SOM-DST [16] treats the dialogue state as a fixed-sized memory and selectively overwritten it at each turn. TripPY [6] utilizes three mechanisms to obtain slot–value information. STAR [8] uses self-attention to capture the relationships between slots. Table 2 presents the methods available, allowing us to examine the effectiveness of our approach in modeling slot relationships for dialogue state tracking.

5. Experimental Results

5.1. Main Results

We present the main results of our approach in Table 3, where ER-DST is compared with several other state-of-the-art methods. The evaluation of our model’s performance is conducted using joint goal accuracy, which is consistent with the evaluation approach used for these methods. ER-DST specifically refers to our proposed model equipped with the external slot relation memory.

Table 3 clearly demonstrates the significant performance improvement achieved by ER-DST on both datasets. Specifically, on the MultiWOZ 2.0, our method achieves a joint goal accuracy of 54.76%. Additionally, on the improved MultiWOZ 2.1, we achieve a joint goal accuracy of 56.75%.

Table 4 presents the experimental results of the ER-DST and the baseline models on the accuracy of specific domain slots in the MultiWOZ2.1. The results indicate that ER-DST outperforms the STAR in the specific domains of attraction, hotel, and taxi. Notably, ER-DST achieves significant JGA in the domains of hotel and taxi, where the data is relatively limited, with accuracies of 53.21% and 66.85%, respectively. This is attributed to ER-DST’s ability to learn the most stable relationships between slots from the entire dataset, resulting in performance optimization for domains with initially fewer training samples.

5.2. Ablation Study

The purpose of this ablation study was to evaluate the impact of the external slot relationship memory component on the model’s performance and observe its effect on capturing slot relations. We designed four methods to evaluate the model’s performance: (1) Removing the slot relationship component. (2) Using self-attention alone to capture slot relations. (3) Incorporating layer normalization in the external slot relationship memory component. (4) Incorporating double-normalization in the external slot relationship memory component. We used the MultiWOZ 2.1 for our experiments and employed the same train–validation–test split. The model architecture and hyperparameter settings remained consistent.

As shown in Table 5, we observed the effectiveness of the external slot relationship memory in our model. On the other hand, the use of double-normalization to filter out irrelevant slot information had a significant impact on the external slot relation memory component.

5.3. Results Analysis

First, Table 3 illustrates the comparison of our method with the baseline models in multi-domains, evaluated using the JGA. It can be observed that our method improves the overall performance of the model across multiple domains. Secondly, Table 4 presents the comparison of our method with the baseline models in single domains. The results indicate that the ER-DST outperforms the STAR baseline model in the attraction, hotel, and taxi domains. Moreover, it surpasses all baseline models in the hotel and taxi domains, achieving significant single-domain JGA. Furthermore, we conducted a detailed analysis of our method’s performance on the 30 specific slots in the entire MultiWOZ 2.1, comparing it with the STAR and TripPy in Figure 4. As a result, 17 slots achieved better results compared to the baselines. For instance, slots like “hotel-area”, “restaurant-area”, “taxi-departure”, “taxi-destination”, “train-departure”, and “train-destination” contain values that are geographical locations, often exhibiting relationships among them. Since our method enhances the capability to identify relationships between slots, the performance of these slots has been improved.

The above results indicate ER-DST enhances the perception of slot relationships. However, although our model reduces the error rates for several slots related to “names”, such as “restaurant-name” and “hotel-name”, they still exhibit relatively high error rates. For some slots with numerical values, such as “time”, there is still room for improvement, especially for non-enumerable slots of this type.

6. Conclusions

Regarding the existing problems in multi-domain DST models, (1) learning the relationships between multi-domain slots from a single sample reduces the model’s learning capacity and fails to capture the most important features between slots and (2) it is challenging to exclude irrelevant slot information, leading to redundant information. In this paper, we propose ER-DST, which leverages an external slot relation memory. In our approach, we utilize an additional key-value memory to learn relationships between multiple domain slots from the entire dataset, which serves as a dictionary for slot relations. Then, we introduce a small constraint mechanism to reduce the weight of unrelated slot information. To evaluate the performance of our method, we conducted experiments on two large-scale, multi-domain task-oriented dialogue systems, MultiWOZ 2.0 and MultiWOZ 2.1. From the experimental results, it can be observed that the external slot relationship memory has improved the model’s ability to predict slot values to some extent.

However, in our model, the slot–value matching approach is relatively limited, which can lead to missing slot values, especially for difficult-to-enumerate slot value types such as time, and for unseen domain and slot combinations. In future work, we plan to address this issue in the following ways: (1) By combining two methods of selecting slot values from system response and slot description, we can increase the flexibility of slot value selection and further enhance the model’s ability. (2) By adopting slot value generation techniques to improve the model’s performance in few-shot and zero-shot scenarios. This will enable the model to effectively handle unseen domain and slot as well.

Author Contributions

Conceptualization, X.X.; methodology, C.Y.; software, D.L.; validation, C.Y., D.L., and P.C.; formal analysis, X.Z.; investigation, D.T.; resources, P.C.; data curation, D.T.; writing—original draft preparation, C.Y.; writing—review and editing, X.X.; visualization, D.L.; supervision, X.X.; project administration, X.Z. and Y.Y; funding acquisition X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research Program of Chongqing Science & Technology Commission (Grant No. cstc2021jscx-dxwtBX0019) and the Scientific Research Foundation of Chongqing University of Technology (Grant No. 2021ZDZ025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mrkšić, N.; Séaghdha, D.O.; Wen, T.H.; Thomson, B.; Young, S. Neural belief tracker: Data-driven dialogue state tracking. arXiv 2016, arXiv:1606.03777. [Google Scholar]
Zhong, V.; Xiong, C.; Socher, R. Global-locally self-attentive encoder for dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1458–1467. [Google Scholar]
Lee, H.; Lee, J.; Kim, T.Y. SUMBT: Slot-utterance matching for universal and scalable belief tracking. arXiv 2019, arXiv:1907.07421. [Google Scholar]
Nouri, E.; Hosseini-Asl, E. Toward scalable neural dialogue state tracking model. arXiv 2018, arXiv:1812.00899. [Google Scholar]
Hu, J.; Yang, Y.; Chen, C.; He, L.; Yu, Z. SAS: Dialogue state tracking via slot attention and slot information sharing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6366–6375. [Google Scholar]
Heck, M.; van Niekerk, C.; Lubis, N.; Geishauser, C.; Lin, H.C.; Moresi, M.; Gašić, M. Trippy: A triple copy strategy for value independent neural dialog state tracking. arXiv 2020, arXiv:2005.02877. [Google Scholar]
Ouyang, Y.; Chen, M.; Dai, X.; Zhao, Y.; Huang, S.; Chen, J. Dialogue state tracking with explicit slot connection modeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 34–40. [Google Scholar]
Ye, F.; Manotumruksa, J.; Zhang, Q.; Li, S.; Yilmaz, E. Slot self-attentive dialogue state tracking. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 1598–1608. [Google Scholar]
Budzianowski, P.; Wen, T.H.; Tseng, B.H.; Casanueva, I.; Ultes, S.; Ramadan, O.; Gašić, M. MultiWOZ–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv 2018, arXiv:1810.00278. [Google Scholar]
Eric, M.; Goel, R.; Paul, S.; Kumar, A.; Sethi, A.; Ku, P.; Goyal, A.K.; Agarwal, S.; Gao, S.; Hakkani-Tur, D. MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. arXiv 2019, arXiv:1907.01669. [Google Scholar]
Henderson, M.; Thomson, B.; Williams, J.D. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, USA, 18–20 June 2014; pp. 263–272. [Google Scholar]
Henderson, M.; Thomson, B.; Williams, J.D. The third dialog state tracking challenge. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, 7–10 December 2014; pp. 324–329. [Google Scholar]
Sun, K.; Chen, L.; Zhu, S.; Yu, K. The SJTU system for dialog state tracking challenge 2. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Philadelphia, PA, USA, 18–20 June 2014; pp. 318–326. [Google Scholar]
Henderson, M.; Thomson, B.; Young, S. Deep neural network approach for the dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, Metz, France, 22–24 August 2013; pp. 467–471. [Google Scholar]
Wu, C.S.; Madotto, A.; Hosseini-Asl, E.; Xiong, C.; Socher, R.; Fung, P. Transferable multi-domain state generator for task-oriented dialogue systems. arXiv 2019, arXiv:1905.08743. [Google Scholar]
Kim, S.; Yang, S.; Kim, G.; Lee, S.W. Efficient dialogue state tracking by selectively overwriting memory. arXiv 2019, arXiv:1911.03906. [Google Scholar]
Lesci, P.; Fujinuma, Y.; Hardalov, M.; Shang, C.; Marquez, L. Diable: Efficient Dialogue State Tracking as Operations on Tables. arXiv 2023, arXiv:2305.17020. [Google Scholar]
Zhang, H.; Bao, J.; Sun, H.; Wu, Y.; Li, W.; Cui, S.; He, X. MoNET: Tackle State Momentum via Noise-Enhanced Training for Dialogue State Tracking. arXiv 2022, arXiv:2211.05503. [Google Scholar]
Gao, S.; Sethi, A.; Agarwal, S.; Chung, T.; Hakkani-Tur, D. Dialog state tracking: A neural reading comprehension approach. arXiv 2019, arXiv:1908.01946. [Google Scholar]
Xu, P.; Hu, Q. An end-to-end approach for handling unknown slot values in dialogue state tracking. arXiv 2018, arXiv:1805.01555. [Google Scholar]
Guo, J.; Shuang, K.; Li, J.; Wang, Z.; Liu, Y. Beyond the Granularity: Multi-Perspective Dialogue Collaborative Selection for Dialogue State Tracking. arXiv 2022, arXiv:2205.10059. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhu, S.; Li, J.; Chen, L.; Yu, K. Efficient context and schema fusion networks for multi-domain dialogue state tracking. arXiv 2020, arXiv:2004.03386. [Google Scholar]
Bebensee, B.; Lee, H. Span-Selective Linear Attention Transformers for Effective and Robust Schema-Guided Dialogue State Tracking. arXiv 2023, arXiv:2306.09340. [Google Scholar]
Li, X.; Li, Q.; Wu, W.; Yin, Q. Generation and Extraction Combined Dialogue State Tracking with Hierarchical Ontology Integration. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2241–2249. [Google Scholar]
Wang, Y.; Zhao, J.; Bao, J.; Duan, C.; Wu, Y.; He, X. Luna: Learning slot-turn alignment for dialogue state tracking. arXiv 2022, arXiv:2205.02550. [Google Scholar]
Feng, Y.; Lipani, A.; Ye, F.; Zhang, Q.; Yilmaz, E. Dynamic schema graph fusion network for multi-domain dialogue state tracking. arXiv 2022, arXiv:2204.06677. [Google Scholar]
Jiao, F.; Guo, Y.; Huang, M.; Nie, L. Enhanced Multi-Domain Dialogue State Tracker With Second-Order Slot Interactions. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 265–276. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Wang, X.; Li, Y.; Zhang, H.; Shan, Y. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9168–9178. [Google Scholar]

Figure 1. Block diagram for the proposed framework.

Figure 2. Our ER-DST architecture overview. The diagram on the right depicts the overall structure of ER-DST, while the diagram on the left illustrates the internal structure of the external slot relationship memory. The gray BERT encoder represents frozen parameters that are not fine-tuned, while the blue BERT encoder represents fine-tuned parameters.

Figure 3. The figure displays the distribution of the number of turns in single-domain and multi-domain dialogues in the MultiWOZ.

Figure 4. The accuracy of 30 specific slots in the 5 domains of the MultiWOZ 2.1.

Table 1. In a task-oriented multi-turn dialogue example, the values of the slots “hotel-pricerange” and “hotel-area” should correspond to the same values as the slots “restaurant-pricerange” and “restaurant-area”.

User:	Can you help me find a place to eat in the expensive price range in the south part of town?
Sys:	What about the Cambridge chop house. The address is: 1 kings Parade. Do you like me to book a table?
User:	Yes! Can you book for 2 people at 14:15 on this Sunday?
Sys:	I was able to book your party of 2 into The Chop House on Sunday at 14:15 Reference: MUWLBLLW.
User:	Actually yes I will be needing somewhere to stay too please. A 3 star hotel in the same area and price range
	as my restaurant.
Sys:	I have one matching result. It is the lensfield hotel. Do you want to reserve a room?
User:	Yes please, two people for two nights on Sunday.

Table 2. Introduction of advanced methods.

Methods	Description
CSFN-DST	A pattern graph with context encoding
SOM-DST	a fixed-sized memory and selectively overwritten
TripPY	three copy mechanisms
STAR	a self-attention network

Table 3. The joint goal accuracy (%) tested on the MultiWOZ 2.0 and MultiWOZ 2.1.

	MultiWOZ 2.0	MultiWOZ 2.1
CSFN-DST	51.57	52.88
SOM-DST	51.72	53.01
TripPy	-	55.29
STAR	54.53	56.36
ER-DST(ours)	54.76	56.75

Table 4. The joint goal accuracy (%) for specific domains in the MultiWOZ2.1.

	CSFN-DST	SOM-DST	TripPy	STAR	ER-DST (Ours)
Attraction	64.78	69.83	73.37	70.95	71.16
Hotel	46.29	49.53	50.21	52.99	53.21
Restaurant	64.64	65.72	70.47	69.17	67.03
Taxi	47.35	59.96	37.64	66.67	66.95
Train	69.79	70.36	72.51	75.10	74.64

Table 5. Ablation study on the MultiWOZ 2.1.

Model	Accuracy
No Slot Relation	46.25%
self-attention	56.36%
layer norm	55.19%
Our Full	56.75%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, X.; Yang, C.; Lin, D.; Teng, D.; Chen, P.; Zhang, X. External Slot Relationship Memory for Multi-Domain Dialogue State Tracking. Appl. Sci. 2023, 13, 8943. https://doi.org/10.3390/app13158943

AMA Style

Xing X, Yang C, Lin D, Teng D, Chen P, Zhang X. External Slot Relationship Memory for Multi-Domain Dialogue State Tracking. Applied Sciences. 2023; 13(15):8943. https://doi.org/10.3390/app13158943

Chicago/Turabian Style

Xing, Xinlai, Changmeng Yang, Dafei Lin, Da Teng, Panpan Chen, and Xiaochuan Zhang. 2023. "External Slot Relationship Memory for Multi-Domain Dialogue State Tracking" Applied Sciences 13, no. 15: 8943. https://doi.org/10.3390/app13158943

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

External Slot Relationship Memory for Multi-Domain Dialogue State Tracking

Abstract

1. Introduction

2. Related Work

3. Model

3.1. Problem Definition

3.2. Context Encoder

3.3. Slot Contextual Information Extraction Layer

3.4. External Slot Relationship Memory Layer

3.5. Slot Value Matching

4. Experiments

4.1. Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Baseline Models

5. Experimental Results

5.1. Main Results

5.2. Ablation Study

5.3. Results Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI