STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API

Lu, Hengtong; Yuan, Caixia; Wang, Xiaojie

doi:10.3390/app14125303

Open AccessArticle

STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API

by

Hengtong Lu

,

Caixia Yuan

and

Xiaojie Wang

^*

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(12), 5303; https://doi.org/10.3390/app14125303

Submission received: 8 May 2024 / Revised: 15 June 2024 / Accepted: 17 June 2024 / Published: 19 June 2024

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Task-oriented dialogue systems (TODs) enable users to complete specific goals and are widely used in practice. Although existing models have achieved delightful performance for single-domain dialogues, scalability to new domains is far from well explored. Traditional dialogue systems rely on domain-specific information like dialogue state and database (DB), which limits the scalability of such systems. In this paper, we propose a Scalable Task-Oriented Dialogue modeling framework (STOD). Instead of labeling multiple dialogue components, which have been adopted by previous work, we only predict structured API queries to interact with DB and generate responses based on the complete DB results. Further, we construct a new API-schema-based TOD dataset MultiWOZ-API with API query and DB result annotation based on MultiWOZ 2.1. We then propose MSTOD and CSTOD for multi-domain and cross-domain TOD systems, respectively. We perform extensive qualitative experiments to verify the effectiveness of our proposed framework. We find the following. (1) Scalability across multiple domains: MSTOD achieves 2% improvements than the previous state-of-the-art in the multi-domain TOD. (2) Scalability to new domains: our framework enables satisfying generalization capability to new domains, a significant margin of 10% to existing baselines.

Keywords:

task-oriented dialogue systems; cross-domain; scalability

1. Introduction

Task-oriented dialogue (TOD) systems are playing an increasingly important role in our daily lives, such as personal assistants and customer service. With the increasing number of dialogue tasks, the scalability of TOD models is becoming a key factor hindering the widespread adoption of TOD systems.

Existing work can be mainly classified into two categories. The one is focused on the scalability of a single module in the pipeline system, but it is difficult to compose an effective scalable TOD system by combining multiple scalable modules due to error propagation. The other is end-to-end TOD system modeling. Compared with the pipeline system, the end-to-end system generates dialogue states, dialogue actions, and dialogue responses in a cascaded order, which is easier to scale. However, the existing end-to-end dialogue systems have strong scalability in slots and dialogue actions but show weak scalability in dialogue policy, which is more important and caused by the following reasons.

Firstly, the dialogue state consists of a set of domain-specific slots and their values. When scaling to a new domain, the existing models can predict the correct values of the similar slots with the source domains, but the other dialogue states are difficult to predict correctly. Secondly, the existing models simplify the DB results (only input the number of entities that meet the user’s requirements). Incomplete DB results let the model learn a database-specific dialogue policy, which is biased when scaled to new domains. Finally, the existing models delexicalize the response for learning value-independent parameters [1], making it difficult to learn the grounding ability based on DB results in the dialogue policy.

To address the aforementioned problems, we propose a new scalable TOD modeling framework shown in Figure 1. Firstly, we predict the API query which can be directly applied to interact with the database instead of the dialogue state. Different from the domain-specific dialogue state, the API query is API-specific, and different domains have similar API, which is more scalable between different domains. Secondly, we use the complete DB results to learn the dialogue policy so that the dialogue policy learning is decoupled from the specific database. Finally, we predict the complete response based on the API query and the complete DB results so that the model can learn the grounding capabilities in the dialogue policy.

In order to verify the effectiveness of our proposed framework, firstly we annotated the API query on the multi-domain TOD dataset MulitWOZ2.1 [2]. We interacted with the database according to the annotated API query and obtained the complete DB results. Secondly, we built a multi-domain scalable end-to-end TOD system, which combines natural language utterances and structured API queries into a unified text-to-text manner. Finally, in order to verify the scalability of our model in cross-domain settings, we built a cross-domain end-to-end TOD system.

Our main contributions are as follows.

First, we propose a Scalable Task-Oriented Dialogue modeling framework (STOD). Compared to existing work, it is more friendly to improve the scalability of TOD systems.

Second, we build a new API-schema-based MultiWOZ-API by re-annotating MulitWOZ2.1 [2] to evaluate the scalability of the task-oriented dialogue system.

Third, based on STOD, we propose a scalable multi-domain task-oriented dialogue model (MSTOD) and a cross-domain task-oriented dialogue model (CSTOD). The experimental results on MultiWOZ 3.0 show that MSTOD and CSTOD outperform the SOTA baseline models in multi-domain and cross-domain settings, respectively.

2. Related Work

2.1. Task-Oriented Dialogue System

The task-oriented dialogue system aims to assist the user in completing certain tasks in a specific domain, such as restaurant booking and flight booking [3]. Existing studies can be broadly classified into two categories: pipeline and end-to-end methods. The former is typically divided into several modules, including natural language understanding (NLU) [4,5,6], dialogue state tracking (DST) [7], dialogue policy (Policy) [8,9], and natural language generation (NLG) [10]. In contrast, end-to-end methods [11,12,13,14,15] build the system using a single model, which directly takes a natural language context as input and outputs a natural language response as well. Recent advances in large-scale pre-trained language models (PLMs) [16,17] boost the development of end-to-end dialogue models. Lin et al. [11], Peng et al. [12], and Su et al. [13] generally follow a similar multi-task learning paradigm: build a unified model by simultaneously learning multiple pipeline subtasks based on PLMs. Sun et al. [14] and Yang et al. [15] further mitigate the exposure bias of cascade generation. However, these methods highly rely on a pre-defined domain schema, e.g., the slots of the restaurant domain contain address area, cuisine type, price range, etc. They do not consider the pre-defined ontology in the model design and, therefore, suffer from poor generalization and scalability ability to new domains. They are dependent on heuristic rules to transfer dialogue states (belief spans) to database queries, which are not totally end-to-end.

2.2. Scalability and Transferability of TOD

Most related domain adaptation models focus on one or several subtasks of a pipeline TOD system. He et al. [18] and Wang et al. [19] propose cross-domain NLU models using contrastive learning [20]. Lin et al. [21] encode the slot description and history to generate corresponding slot values for DST. Mo et al. [22] transfer the dialogue policy model between domains by learning act and state transfer functions. Peng et al. [10] use a large set of annotated NLG corpora to perform supervised pre-training. However, these methods only solve a part of a TOD system and lack practical usage in real scenarios. Further, Sun et al. [14] and Yang et al. [15] directly employ end-to-end models to the domain adaptation of TOD but hardly disentangle the dialogue model and domain schema, resulting in poor scalability ability.

3. Construction of MultiWOZ-API

To verify the effectiveness of the proposed framework, we investigate existing task-oriented dialogue datasets. We found that only TicketTalk [23] annotates API query information. However, the dataset is limited to a single domain, which is insufficient for scalability research. Additionally, without a database to process API queries against, obtaining comprehensive results for improving the scalable dialogue policy is not feasible. To this end, we build an API-schema-based multi-domain dataset MultiWOZ-API by re-annotating MultiWOZ2.1 [2], which is a multi-domain task-oriented dialogue dataset and gives a complete database. Specifically, firstly, we define the API schema information of each domain. Secondly, we re-annotate the dataset according to the existing dialogue state information.

3.1. API Schema Definition

Since there are no validation and test data in the hospital and police domains, we only choose data from the attraction, hotel, restaurant, taxi, and train domains of MultiWOZ2.1 to annotate the API schema. First, we define the API category of each domain. There are mainly three categories of API, namely, find, book, and get_attr. The find APIs mainly support querying entities that meet the user’s requirements from the database, book APIs mainly support executing booking action with the predetermined entities, and return whether the booking is successful, and if it fails, return the failure reason. And get_attr APIs mainly support obtaining some attribute values of specific entities. Finally, 12 different kinds of APIs are defined in total.

Second, we figure out the input parameters required by each API query and the output parameters returned by the database. The number of input parameters ranges from 2 to 6, with an average of 3.3. Compared with the existing multi-domain dialogue state tracking task, tracking all 30 slots each turn, we reduce the difficulty of understanding and increase the scalability between different domains. The same type of APIs share the same output parameters, which makes the learning of dialogue policy based on the returned result between different domains more scalable. The API schema details are shown in the Appendix A.

3.2. Data Re-Annotation Process

We re-annotated the entire dataset based on the defined API schema information. The specific process is as follows.

We first determine whether a user turn involves an API and, if so, annotate the API name for the turn, with 0 to 3 APIs involved each turn. The choice of API mainly depends on the intent of the current user turn and whether the next system turn involves new information, such as the number of query entities, the booking status, entity attribute values, and other information from the database. If there is no new information, we do not annotate any APIs for this turn.

Secondly, we extract the parameter values for each input parameter of each API from the dialogue state of each turn and extract part of the parameter values for each output parameter of each API from the dialogue act information of each turn, such as the number of entities of find APIs, the reservation status of book APIs, and the entity attribute values of get_attr APIs. For find APIs and get_attr APIs, we use the extracted parameter values to query the database and obtain the related entity values returned from the database that meet the constraints of the input parameters. We then use the number of entities and entity attribute values to verify that the annotations are correct.

Finally, for the incorrect API annotations, we manually re-annotate them again, and the incorrect annotations mainly involve the following types: (1) the annotations of the API input parameter values are incorrect or missing; (2) the annotations of the dialogue actions are incorrect or missing; and (3) the annotations of the slot value are not standardized and cannot be matched with the information in the database.

3.3. Data Analysis

We perform the analysis on the annotated MultiWOZ-API. As shown in Table 1, there are 2877 single-domain dialogues dialogues, including 13,254 turns, with an average of 4.61 turns per dialogue. There are 7025 multi-domain dialogues, including 56,536 turns, with an average of 8.05 turns per dialogue. In total, there are 9902 dialogues, including 69,790 turns, with an average of 7.05 turns per dialogue.

We then calculate the number of turns annotated with APIs and the scale of the turns of each kind of API in all dialogue turns. As shown in Table 1, about 57.25% of the dialogues interact with the database, and the scale in multi-domain dialogues is slightly higher than that in single-domain dialogues. Among the three kinds of APIs, we find that APIs account for a large proportion, followed by book APIs, while get_attr APIs have the smallest share.

Some turns may call a range from 1 to 4 APIs. We calculate the number of APIs in each turn. As shown in Table 1, about 95.05% of the dialogue turns only call 1 API, about 4.92% involve 2 APIs, and about 0.03% call 3 or more APIs. In the single-domain dialogues, each turn calls at most 2 APIs. While in the multi-domain dialogues, each turn calls at most 4 APIs. The complexity of the APIs of each turn in multi-domain dialogues is higher than in single-domain dialogues.

4. Methodology

4.1. Scalable Task-Oriented Dialogue Modeling Framework

The traditional pipeline-based TOD system consists of three modules, natural language understanding, dialogue management, and natural language generation, and optimizes them separately. Expanding on this definition, the end-to-end TOD methods model the dialogue processing as three sub-tasks, namely, dialogue state tracking, dialogue action generation, and dialogue response generation, and optimize them jointly. In this paper, to make it easier to build highly scalable end-to-end TOD systems, we optimize the traditional end-to-end modeling structure and propose a scalable end-to-end TOD system modeling framework, which consists of two sub-tasks, namely, dialogue API query generation, and dialogue response generation as shown in Figure 2.

4.1.1. API Query Generation

Traditional task-oriented dialogue systems always follow a two-stage method to query the database: first building a dialogue state tracker (DST) to model dialogue states, and then using the states to query the database with predefined handcrafted rules. These methods have two disadvantages: (1) Dialogue states always be domain specific. (2) The current dialogue system only supports predefined handcrafted query rules, which are also domain specific. These problems make it difficult to effectively transfer to other domains.

Considering the above problems, we focus on domain-independent API call generation instead of domain-specific dialogue state tracking. The goal of this task is to build an API query generator, which takes the dialogue context as input and predicts an API query. The API query is used to interact with the database and consists of the API name and the corresponding input parameters.

4.1.2. DB-Grounded Response Generation

Unlike existing end-to-end TOD systems, which learn dialogue policy based on dialogue states and simplified database results, in this task, the goal is to generate responses with the completed returned database results and the dialogue context, which decouples dialogue policy learning from domain-specific databases.

4.2. Multi-Domain Scalable End-to-End Task-Oriented Dialogue System

Based on the above dialogue modeling framework, we construct a multi-domain end-to-end TOD system (MSTOD). MSTOD models a task-oriented dialogue as follows: the model first predicts whether it requires a database query based on the last turn

U_{t}

and the dialogue context

{U_{1}, Q_{1}, D_{1}, R_{1}, \dots, Q_{t - 1}, D_{t - 1}}

. If not, the model will directly generate a response

R_{t}

. Otherwise, the model will generate an API query

Q_{t}

, then interact with the database, and finally, the DB result

D_{t}

will be returned. In this processing, if an API query is generated, both the API query and the returned DB results will be added into the context, then continue to generate until the next natural language response

R_{t}

.

4.2.1. Serialization of API Query and DB Result

Since the API query and the returned DB results are structured and cannot be directly input into the model, we serialize them, respectively, as follows:

Q = a p n [ASN] a i n_{1} [ASV] a i v_{1} \dots [ASV] a i v_{n}

(1)

Among them, apn is the API name, and

a i n_{i}

and

a i v_{i}

are the parameter name and corresponding parameter value of the i-th input parameter in this API query. [ASN] and [ASV] are special tokens prompting the parameter names and parameter values, respectively.

Similarly, we also perform the same serialization operation on the returned DB results:

D = a p n [ASN] a o n_{1} [ASV] a o v_{1} \dots [ASV] a o v_{n}

(2)

Among them,

a p n

is the API query name, and

a o n_{i}

and

a o v_{i}

are the i-th output parameter name and the corresponding parameter value.

4.2.2. Architecture and Training Objective

T5 [24] is a large-scale pre-trained model based on the encoder–decoder architecture. It has been proven to achieve good performance on many structured and unstructured generation tasks. The proposed MSTOD model is built upon T5 and is fine-tuned with our re-annotated MultiWOZ-API using the modeling framework outlined above.

The target of training MSTOD is to maximize the probability of the generated sequence based on the input sequence. Specifically, when the model predicts the API query in the t-th turn, the loss is calculated as follows:

L_{Q} = log p (Q_{t} | U_{1}, Q_{1}, D_{1}, R_{1}, \dots, D_{t - 1}, R_{t - 1}, U_{t})

(3)

When the model predicts the t-th system response, the loss is calculated as follows:

L_{Q} = log p (R_{t} | U_{1}, Q_{1}, D_{1}, R_{1}, \dots, D_{t - 1}, R_{t - 1}, U_{t}, Q_{t}, D_{t})

(4)

The overall loss is

L_{M S T O D} = L_{Q} + L_{R}

(5)

4.3. Cross-Domain Scalable End-to-End Task-Oriented Dialogue System

To validate the effectiveness of the proposed scalable TOD framework in cross-domain settings, we build a cross-domain end-to-end TOD system (CSTOD). Unlike multi-domain TOD, cross-domain TOD involves multiple source domains and a target domain that does not appear in the source domains. We need to train our model on the training data from the source domains and test it on the target domain. The model needs to decouple the policy learned from domain-specific API schemas. Therefore, in addition to user utterances (U), system response utterances (R), API queries, and corresponding DB results, the API schema of each domain is also included as part of the dialogue context. Specifically, the API schema is appended at the start of the dialogue history.

4.3.1. Serialization of API Schema

The API schema of each domain mainly includes domain-specific API queries and the corresponding input and output parameters. We serialize such an API schema as follows:

\begin{matrix} S_{d} = [API_SCHEMA] a p n_{1} a p n_{2} \dots a p n_{N_{d}} [API_ARGS] \\ [API_NAME] a p n_{1} [API_IN] [ASN] a i n_{1}^{1} \dots [ASN] a i n_{1}^{n_{1 i}} [API_OUT] [ASN] a o n_{1}^{1} \dots [ASN] a i n_{1}^{n_{1 o}} \\ \dots [API_NAME] a p n_{N_{d}} [API_IN] \dots [API_OUT] \dots \end{matrix}

(6)

where [API_SCHEMA], [API_ARGS], [API_IN], [API_OUT] and [ASN] are the special tokens which represent the available API, API parameters, API input, API output, and parameter name, respectively.

a p n_{i}

is the i-th API query in the domain,

N_{d}

is the number of available API queries in the domain d, and

a i n_{i}^{j}

and

a o n_{i}^{j}

are the j-th input parameter and the j-th output parameter of the i-th API query, respectively.

4.3.2. Architecture and Training Objective

The training target is the same as MSTOD. Specifically,

\begin{matrix} L_{Q} = log p (Q_{t} | S_{d}, U_{1}, Q_{1}, D_{1}, R_{1}, \dots, D_{t - 1}, R_{t - 1}, U_{t}) \end{matrix}

(7)

\begin{matrix} L_{R} = log p (R_{t} | S_{d}, U_{1}, Q_{1}, D_{1}, R_{1}, \dots, D_{t - 1}, R_{t - 1}, U_{t}, Q_{t}, D_{t}) \end{matrix}

(8)

\begin{matrix} L_{C S T O D} = L_{Q} + L_{R} \end{matrix}

(9)

Among them,

S_{d}

is the serialized API schema of domain d. When constructing training data,

S_{d}

refers to the source domains represented in the dialogue. For testing data,

S_{d}

refers to the target domain.

5. Experiment and Analysis

5.1. Multi-Domain Dialogue Modeling

5.1.1. Experiment Setting

Dataset

We evaluate our proposed multi-domain scalable task-oriented dialogue model MSTOD on MultiWOZ-API described in Section 3. Note that we do not modify the raw dialogue text but the annotations, so we can compare our MSTOD with previous TOD models on MultiWOZ 2.1.

Evaluation Metrics

We use the official evaluation script (we noticed that related works use different scripts, which may lead to unfair comparisons, so we adopted the official benchmark results from https://github.com/budzianowski/multiwoz (accessed on 1 May 2024); these results are generally slightly lower than in the original paper) proposed by Nekvinda and Dušek [25] to evaluate various dialogue models. The evaluation includes the following metrics: Inform measures whether the system provides suitable entities according to the user requirements; Success checks if the system offers the appropriate entity and the information requested; and BLEU assesses the fluency of the generated responses. The overall quality of the dialogue system is determined using the Combined score, calculated as (Inform + Success) × 0.5 + BLEU.

Baseline

To set the stage for evaluating our approach against established methods, we include several strong baseline models that have demonstrated effectiveness in multi-domain task-oriented dialogues. Here is a brief overview of each:

DAMD [26] A multi-action data augmentation model that utilizes one-to-many relationships to generate diverse and appropriate dialogue responses.

AuGPT [27] A pre-trained model that enhances GPT-2 with a new dialogue consistency classification task, aiming to maintain dialogue coherence across turns by learning to classify consistent responses.

MinTL [28] An end-to-end model that uses pre-trained language models (PLMs) in a Seq2Seq manner. It introduces two different decoders to separately track belief states and generate responses, optimizing the task-oriented dialogue system.

SOLOIST [12] A model that introduces an auxiliary task, where the target belief state is replaced by the belief state from unrelated samples to predict consistency in dialogues, thereby improving the generalization of the model to various dialogue scenarios.

UBAR [29] An end-to-end model incorporating all belief states across all dialogue turns, which enriches the context available for generating responses.

PPTOD [13] An innovative model that recasts task-oriented dialogue subtasks into prompts, leveraging the multi-task transfer learning capabilities of T5 to handle diverse dialogue tasks effectively.

BORT [14] A robust model that adds a denoising reconstruction task to the encoder-decoder framework, which helps in reconstructing the original context from altered dialogue states, thus enhancing the model’s understanding of dialogue flow.

MTTOD [30] A pre-trained model that introduces a span prediction task during pre-training, helping the model to better identify and utilize relevant information within the dialogue for response generation.

GALAXY [31] A pre-trained model that focuses on pre-training tasks such as dialogue act prediction to optimize dialogue policies, aiming to improve the strategic decision-making of dialogue systems.

To draw a fair comparison, we only use the raw dialogue context as the model input and output the final natural language response for evaluation. Specifically, these baselines use the generated dialogue states, while our model uses generated API queries.

5.1.2. Automatic End-to-End Evaluation

Table 2 displays the results of models on the multi-domain end-to-end dialogue modeling. We find our model MSTOD achieves state-of-the-art Inform, Success, and Combined scores, demonstrating the effectiveness of our proposed API-focused framework. MSTOD achieves slightly worse BLEU than Galaxy because Galaxy uses an auxiliary large-scale dialogue corpus UnPreDial for pre-training along with MultiWOZ. Further human evaluation results in Table 3 indicate that both systems have good fluency due to prior syntactic knowledge from pre-trained language models, but our MSTOD significantly outperforms Galaxy on the rate of task success and coherency. The task completion ability is essential for a task-oriented system, which we focus on in this paper.

5.1.3. Human-in-the-Loop Evaluation

Although previous offline evaluation results show the superiority of our MSTOD model, we also perform an online interactive human evaluation [32,33] to estimate its practical ability. We compare our STOD with the strong baseline Galaxy by interacting with real human users. In each dialogue session, we randomly sample a user goal to guide the dialogue. (We use the tools of ConvLab-2 (https://github.com/thu-coai/ConvLab-2 (accessed on 1 December 2023)) to construct new user goals. Note that these goals are different from existing dialogues in the original MultiWOZ corpus.) Then, a user is instructed to converse with STOD or Galaxy to complete the task by following the sampled user goal. At the end of each session, the user is asked to give explicit feedback from three perspectives: (1) Success measures if the model successfully completes the user goal (i.e., whether the movie tickets were booked with all the user constraints satisfied). (2) Coherency measures whether the model’s response is logically coherent with the dialogue context. (3) Fluency measures the fluency of the model’s response. Each user goal is independently evaluated by three real users on a 3-point Likert scale (0 for the worst, 1, or 2 for the best). We randomly select 50 user goals and report the average score of three users via online interaction for each user goal.

Table 3 shows the results with the first row denoting strong inter-annotator agreements as measured by the Fleiss kappa coefficient [34]. Compared with Galaxy, our STOD achieves better scores on all metrics. Moreover, we find that STOD significantly outperforms Galaxy on the Success and Coherency metrics, suggesting that STOD generates more semantically coherent responses and achieves a higher task completion rate. On the Fluency metric, both systems achieve decent performance due to the pre-trained language models. Future research should pay more attention to the contextual coherency and task success rate in practice.

5.1.4. Further Analysis

API Query Accuracy

We evaluated the precision, recall, and F1 of the generated API queries in terms of whether the API query needs to be called (Whether Call API), whether the called API is correct (API Correctness), and whether the parameters of the called API are accurate (Args Accurateness). As shown in Table 4, models tend to predict more API queries. Among the turns that need to call API, in our model, all were predicted to call API. But about 15% of the turns predicted to call the wrong APIs, and about 30% of the turns predicted to call the right API but generate the wrong API call parameters. Further, we respectively analyzed API call prediction error types and API parameter prediction error types of three different API types. As shown in Figure 3a, 67% turns predict fewer APIs. As shown in Figure 3b, the API parameters predicted errors of book APIs and get_attr APIs, which have few parameters, which were all mostly caused by the API parameters being wrong, and find APIs, which have more parameters, were prone to generating more or fewer parameters.

Effect of Golden API with Generated API

Since MSTOD incorporates API queries and DB results in the previous and current turns to generate user responses, in the end-to-end setting, we replace the golden API queries and DB results with the generated API queries and corresponding DB results. To measure the effect of API query accuracy, we use the golden API queries and DB results in previous turns and all turns to test the model. As shown in Table 5, if MSTOD tasks golden API information in previous turns, the Inform, Success, and BLEU will increase consistently. And if MSTOD tasks golden API information in all turns, Inform and BLEU will increase further, but the Success will drop a bit. Thus, the accuracy of API information in previous turns is more important for the model performance.

5.2. Cross-Domain Dialogue Modeling

There are two settings for cross-domain evaluation. One is the cross-domain zero-shot setting: Select any domain as the target domain, train on the single-domain training set of the remaining four domains, and test on the test set of the target domain. The other is a cross-domain small sample setting: After training on the single-domain training set of the other four domains, 100 dialogues are randomly selected in the target domain for further training and tested on the target domain test set.

5.2.1. Experiment Setting

Dataset

In order to evaluate the cross-domain scalable task-oriented dialogue model CSTOD, we extract five single-domain data including attraction, hotel, restaurant, taxi, and train from the MultiWOZ dataset for cross-domain evaluation.

Evaluation Metrics and Baseline

The evaluation metrics for the cross-domain task-oriented dialogue are the same as for the multi-domain task-oriented dialogue. For cross-domain task-oriented dialogue, we compare our CSTOD with several baseline models for cross-domain evaluation, mainly, DAMD [26], MinTL [28], UBAR [29], BORT [14], and UBARv2 [15].

5.2.2. Zero-Shot Cross-Domain Results

Table 6 shows the results of models on the cross-domain end-to-end dialogue modeling. To investigate the domain adaptation ability of CSTOD to generalize to unseen domains, we train models on four source domains and directly test on the target domain in a zero-shot learning setting following BORT [14]. Results show that our CSTOD significantly outperforms all the baselines on all the target domains except Train, which proves our framework can enhance the scalability of existing dialogue models. We notice CSTOD achieves a lower Combined score than MinTL and BORT because both Taxi and Train domains have similar slots like departure and destination but different types of API calls. The Taxi domain only has the book API but the Train domain has find, get_attr, and book APIs. Due to the similarity between the two domains, we find CSTOD often generates the book API but seldom other APIs while transferring from the Taxi to the Train domain, which mainly affects its performance of Success. Further few-shot cross-domain experiments in Section 5.2.3 confirm that a few labeled dialogues of the target domain can solve the inconsistent API annotations.

5.2.3. Few-Shot Cross-Domain Results

We conduct further domain adaptation experiments in the few-shot learning in Table 7. In few-shot setting, we utilize all source domain data combined with 100 randomly sampled dialogues from the target domain’s training data for training. Testing is conducted on the target domain’s test dataset. To ensure the stability of the results, we conduct three random samplings for training and testing, and we will report both the mean and variance of the results. We report the results of the state-of-the-art baseline UBARv2 [15] and our CSTOD. Moreover, we also add two model variants of CSTOD(zero) trained on only data of the source domain (see Table 6) as a lower bound and CSTOD(full) trained on all the labeled data of the target domain as an upper bound. We find CSTOD outperforms UBARv2 on all the domains by a large margin and closes the gap between few-shot(low resource) and full data settings. Additionally, using a few annotated dialogues of the target domain significantly improves the performance of complex target domains like Hotel and Res, demonstrating that our framework can effectively alleviate the discrepancy between different domains in the domain adaptation using a few labeled data.

5.2.4. Case Study

Table A2 presents two generated dialogue examples from our CSTOD model and BORT. The user starts the conversation by asking for an attraction named williams art and antique. The agent should inform the user of detailed information about this attraction. But BORT wrongly recognizes its domain as the hotel due to similar slots between hotel and attraction domains. Therefore, the generated state and action are incorrect, leading to incoherent errors. In contrast, our CSTOD successfully calls a find-attraction API and then generates a grounding response. The results show the good scalability of our model instead of just memorizing existing domain knowledge. We aim to make the model explicitly learn domain schema and dialogue skills to promote the transferability of TOD systems. Moreover, inducing DB results into the model makes generated responses that are more informative rather than dull responses like “thank you”.

5.2.5. Limitation

In this paper, we propose a scalable task-oriented dialogue modeling framework (STOD) to improve the scalability of task-based dialogue systems. Although our proposed MSTOD and CSTOD-based STOD achieves satisfying performance in multi-domain and cross-domain task-oriented dialogue systems, there are still some unresolved issues. (1) More datasets: we need to evaluate our model on more datasets to test the generalization of our framework. (2) Effect of domain similarity: we want to explore the effect of domain similarity on the scalability of our model which needs data from more domains. (3) More complex APIs: we defined three types of API constrained by the datasets. It is more challenging to complete tasks in domains with more complex APIs.

6. Conclusions

In this paper, we propose a scalable task-oriented dialogue modeling framework (STOD) to improve the scalability of task-based dialogue systems. Further, we construct MultiWOZ-API which labels the API query and DB results based on MultiWOZ 2.1. Based on STOD and MultiWOZ-API, we propose MSTOD and CSTOD to evaluate the effectiveness of the framework in multi-domain and cross-domain dialogue systems, respectively. The automatic and human evaluations demonstrate that STOD has better scalability than the existing models.

Author Contributions

H.L.: conceptualization, methodology, coding, validation, investigation, data curation, writing—original draft preparation, coding, validation, investigation, data curation; C.Y. and X.W.: writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The work is partially supported by the National Natural Science Foundation of China (NSFC62076032).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in the GitHub repository at [https://github.com/budzianowski/multiwoz, accessed on 1 May 2024].

Acknowledgments

The authors would like to thank Chenxu Lv and Keqing He for their contribution to the labels collection and valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
TOD	Task-oriented dialogue systems
DB	database

Appendix A. API Schema Details

Table A1 shows the input and output parameters of the APIs. The same API type shares the same output parameters.

Table A1. All API input and output parameters.

API Type	API Name	API Input Parameters	API Output Parameters
find	find-attraction	attraction-area, attraction-type	choice, kb_matched_results, kb_recommended_results
	find-hotel	hotel-area, hotel-internet, hotel-parking, hotel-pricerange, hotel-stars, hotel-type
	find-restaurant	restaurant-area, restaurant-food, restaurant-pricerange
	find-train	train-arriveby, train-departure, train-destination, train-leaveat, train-day
book	book-hotel	hotel-name, hotel-book_day, hotel-book_stay, hotel-book_people	book_status, book_reference_number, book_fail_reason
	book-restaurant	restaurant-name, restaurant-book_day, restaurant-book_time, restaurant-book_people
	book-taxi	taxi-arriveby, taxi-departure, taxi-destination, taxi-leaveat
	book-train	train-id, train-book_day, train-book_people
get_attr	get_attr-attraction	attraction-name, attribute_list	attribute_val_list
	get_attr-hotel	hotel-name, attribute_list
	get_attr-restaurant	restaurant-name, attribute_list
	get_attr-train	train-id, attribute_list

Appendix B. Implementation Details

We implemented the MSTOD and CSTOD models based on the HuggingFace’s Transformers library. The maximum input length is 1024 and the maximum output length is 256. In order to distinguish different information in the input sequence, we will add the corresponding special token in front of the user utterances, system response utterances, API calls, and the returned query results. For MSTOD mode, we trained on an A100 GPU with a learning rate of 5 × 10⁻⁵ and batch size of 12 and validated every 1000 steps. For CSTOD, we used a smaller learning rate of 1 × 10⁻⁵ and batch size of 12, trained on an A100 GPU, and validated every 100 steps.

Appendix C. Case Study of Cross-Domain TOD Systems

Table A2 presents two generated dialogue examples from our CSTOD model and BORT.

Table A2. Two generated dialogue examples from our CSTOD model and BORT based on the zero-shot cross-domain setting. (best viewed in color). Gen Dial State and Gen Dial Act denote the generated dialogue state and next dialogue action by BORT. Delexical Resp denotes the generated delexical response by BORT, where specific slot values are replaced by placeholders like value_price. Oracle Reference denotes the golden human response. We have marked the differences between the content generated by the BORT model and the annotated content in red and blue, where red indicates the content generated by the BORT model, and blue indicates the annotated content.

SNG1147	Type	Dial
Turn 1	User	User Utterance	i ’m looking for an attraction called williams art and antiques.
	BORT	Gen Dial State	[hotel] name williams art and antiques
		Oracle Dial State	[attraction] name williams art and antiques
		DB Result	attraction: 1
		Gen Dial Act	[hotel] [inform] price address name
		Oracle Dial Act	[attraction] [inform] name address [offerbook]
		Delexical Resp	[value_name] is an [value_price] restaurant located at [value_address].
		Lexical Resp	williams art and antiques is an [value_price] restaurant located at gwydir street, no. 5 dale’s brewery.
	CSTOD	Gen API Query	[API_IN] find-attraction [ASN] attraction-name [ASV] williams art and antiques
		Oracle API Query	[API_IN] find-attraction [ASN] attraction-name [ASV] williams art and antiques
		DB Result	[API_OUT] find-attraction [CHOICE] 1 [RES_INFO] [ASN] attraction-name [ASV] williams art and antiques [ASN] attraction-address [ASV] gwydir street, no. 5 dale’s brewery [ASN] attraction-area [ASV] centre [ASN] attraction-entrancefee [ASV] free [ASN] attraction-phone [ASV] 01223311687 [ASN] attraction-postcode [ASV] cb12lj [ASN] attraction-type [ASV] museum
		Gen Resp	williams art and antiques is a museum located at gwydir street, no. 5 dale’s brewery. the phone number is 01223311687.
	Human	Oracle Reference	williams art and antiques is at gwydir street, no. 5 dale ’s brewery. would you like me to book tickets?
Turn 2	User	User Utterance	no thank you. just the area and postcode please.
	BORT	Gen Dial State	[hotel] name williams art and antiques
		Oracle Dial State	[attraction] name williams art and antiques
		DB Result	attraction: 1
		Gen Dial Act	[general] [greet]
		Oracle Dial Act	[attraction] [inform] postcode area [general] [reqmore]
		Delexical Resp	thank you.
		Lexical Resp	thank you.
	CSTOD	Gen API Query	[API_IN] get_attr-attraction [ASN] attraction-name [ASV] williams art and antiques [ASN] attribute [ASV] attraction-postcode [AND] attraction-area
		Oracle API Query	[API_IN] get_attr-attraction [ASN] attraction-name [ASV] williams art and antiques [ASN] attribute [ASV] attraction-postcode [AND] attraction-area
		DB Result	[API_OUT] get_attr-attraction [ASN] attraction-postcode [ASV] cb12lj [ASN] attraction-area [ASV] centre
		Gen Resp	the postcode is cb12lj and the area is centre.
	Human	Oracle Reference	it is in the centre and the postcode is cb12lj. is there anything else you need help with today?
Turn 3	User	User Utterance	no that ’s all i needed. thank you!
	BORT	Gen Dial State	[hotel] name williams art and antiques
		Oracle Dial State	[attraction] name williams art and antiques
		DB Result	attraction: 1
		Gen Dial Act	[general] [bye]
		Oracle Dial Act	[general] [greet]
		Delexical Resp	thank you for using our system!
		Lexical Resp	thank you for using our system!
	CSTOD	Gen API Query	None
		Oracle API Query	None
		DB Result	None
		Gen Resp	you are welcome. goodbye.
	Human	Oracle Reference	thank you and enjoy your visit to williams art and antiques!

References

Wen, T.-H.; Gasic, M.; Mrkšić, N.; Su, P.-H.; Vandyke, D.; Young, S. Semantically Conditioned LSTM-Based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1711–1721. [Google Scholar]
Eric, M.; Goel, R.; Paul, S.; Sethi, A.; Agarwal, S.; Gao, S.; Kumar, A.; Goyal, A.; Ku, P.; Hakkani-Tur, D. Multiwoz 2.1: A Consolidated Multi-Domain Dialogue Dataset with State Corrections and State Tracking Baselines. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 422–428. [Google Scholar]
Zhang, Z.; Takanobu, R.; Zhu, Q.; Huang, M.; Zhu, X. Recent Advances and Challenges in Task-oriented Dialog System. arXiv 2020, arXiv:2003.07490. [Google Scholar] [CrossRef]
He, K.; Lei, S.; Yang, Y.; Jiang, H.; Wang, Z. Syntactic Graph Convolutional Network for Spoken Language Understanding. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar]
Qin, L.; Che, W.; Li, Y.; Wen, H.; Liu, T. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Xu, H.; He, K.; Yan, Y.; Liu, S.; Liu, Z.; Xu, W. A Deep Generative Distance-Based Classifier for Out-of-Domain Detection with Mahalanobis Space. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020. [Google Scholar]
Wu, C.-S.; Madotto, A.; Hosseini-Asl, E.; Xiong, C.; Socher, R.; Fung, P. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Seattle, WA, USA, 2019; Volume 1: Long Papers. [Google Scholar]
Liu, S.; Zhang, J.; He, K.; Xu, W.; Zhou, J. Scheduled Dialog Policy Learning: An Automatic Curriculum Learning Framework for Task-Oriented Dialog System. In FINDINGS; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021. [Google Scholar]
Su, S.-Y.; Li, X.; Gao, J.; Liu, J.; Chen, Y.-N. Discriminative deep dyna-q: Robust planning for dialogue policy learning. arXiv 2018, arXiv:1808.09442. [Google Scholar]
Peng, B.; Zhu, C.; Li, C.; Li, X.; Li, J.; Zeng, M.; Gao, J. Few-shot natural language generation for task-oriented dialog. arXiv 2020, arXiv:2002.12328. [Google Scholar]
Lin, Z.; Madotto, A.; Winata, G.I.; Fung, P. Mintl: Minimalist transfer learning for task-oriented dialogue systems. arXiv 2020, arXiv:2009.12005. [Google Scholar]
Peng, B.; Li, C.; Li, J.; Shayandeh, S.; Lidén, L.; Gao, J. Soloist: Building task bots at scale with transfer learning and machine teaching. Trans. Assoc. Comput. Linguist. 2021, 9, 807–824. [Google Scholar] [CrossRef]
Su, Y.; Shu, L.; Mansimov, E.; Gupta, A.; Cai, D.; Lai, Y.-A.; Zhang, Y. Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022. [Google Scholar]
Sun, H.; Bao, J.; Wu, Y.; He, X. BORT: Back and Denoising Reconstruction for End-to-End Task-Oriented Dialog. In Findings of the Association for Computational Linguistics: NAACL 2022; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 2156–2170. [Google Scholar] [CrossRef]
Yang, Y.; Ding, H.; Liu, Q.; Quan, X. Ubarv2: Towards mitigating exposure bias in task-oriented dialogs. arXiv 2022, arXiv:2209.07239. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
He, K.; Zhang, J.; Yan, Y.; Xu, W.; Niu, C.; Zhou, J. Contrastive Zero-Shot Learning for Cross-Domain Slot Filling with Adversarial Attack. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar]
Wang, L.; Li, X.; Liu, J.; He, K.; Yan, Y.; Xu, W. Bridge to Target Domain by Prototypical Contrastive Learning and Label Confusion: Re-Explore Zero-Shot Learning for Slot Filling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 7–11 November 2021. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A simple framework for contrastive learning of visual representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
Lin, Z.; Liu, B.; Moon, A.; Crook, P.A.; Zhou, Z.; Wang, Z.; Yu, Z.; Madotto, A.; Cho, E.; Subba, R. Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue Statetracking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5640–5648. [Google Scholar]
Mo, K.; Zhang, Y.; Yang, Q.; Fung, P. Cross-domain dialogue policy transfer via simultaneous speech-act and slot alignment. arXiv 2018, arXiv:1804.07691. [Google Scholar]
Byrne, B.; Krishnamoorthi, K.; Ganesh, S.; Kale, M. Tickettalk: Toward Human-Level Performance with End-to-End, Transaction-Based Dialog Systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; Volume 1: Long Papers, pp. 671–680. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Nekvinda, T.; Dušek, O. Shades of Bleu, Flavours of Success: The Case of Multiwoz. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), Bangkok, Thailand, 5–6 August 2021; pp. 34–46. [Google Scholar]
Zhang, Y.; Ou, Z.; Yu, Z. Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9604–9611. [Google Scholar]
Kulhánek, J.; Hudecek, V.; Nekvinda, T.; Dušek, O. Augpt: Dialogue with pre-trained language models and data augmentation. arXiv 2021, arXiv:2102.05126. [Google Scholar]
Lin, Z.; Madotto, A.; Winata, G.I.; Fung, P.N. Mintl: Minimalist Transfer Learning for Task-Oriented Dialogue Systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Yang, Y.; Li, Y.; Quan, X. Ubar: Towards Fully End-to-End Task-Oriented Dialog System with gpt-2. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14230–14238. [Google Scholar]
Lee, Y. Improving End-to-end Task-Oriented Dialog System with a Simple Auxiliary Task. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1296–1303. [Google Scholar]
He, W.; Dai, Y.; Zheng, Y.; Wu, Y.; Cao, Z.; Liu, D.; Jiang, P.; Yang, M.; Huang, F.; Si, L.; et al. Galaxy: A Generative Pre-Trained Model for Task-Oriented Dialog with Semi-Supervised Learning and Explicit Policy Injection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 10749–10757. [Google Scholar]
Ou, Z.; Feng, J.; Li, J.; Li, Y.; Liu, H.; Peng, H.; Huang, Y.; Zhao, J. A challenge on semi-supervised and reinforced task-oriented dialog systems. arXiv 2022, arXiv:2207.02657. [Google Scholar]
Peng, B.; Li, X.; Gao, J.; Liu, J.; Wong, K.-F.; Su, S.-Y. Deep Dyna-q: Integrating Planning for Task-Completion Dialogue Policy Learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378–382. [Google Scholar] [CrossRef]

Figure 1. The comparison of our framework with the existing framework. The upper part indicates the interactions between the user, system, and database in the existing framework.

Figure 2. The overview of our proposed STOD. The black arrow shows the query generation process, and the red arrow shows the response generation process. The circled numbers indicate the order of generation.

Figure 3. (a) The ratio of API call prediction error types. (b) The ratio of API query parameters prediction error types of different API types.

Table 1. API call statistics in single domain, multi-domain, and the total data.

		Single	Multi	Total
dial_num		2877	7025	9902
turn_num		13,254	56,536	69,790
average_turn		4.61	8.05	7.05
API_turn ratio (%)	all_API	57.25	63.13	62.01
	find	35.16	39.97	48.22
	book	15.61	15.08	18.74
	get_attr	9.41	11.2	13.4
API_num ratio (%)	1	94.89	95.09	95.05
	2	5.11	4.88	4.92
	>=3	0	0.04	0.03

Table 2. Multi-domain end-to-end evaluation. †: we cite these results from the official benchmark of MultiWOZ 2.1. ‡: we cite these results from Sun et al. [14]. Note that SOLOIST, PPTOD, and GALAXY use auxiliary dialogue corpus to perform pre-training.

Model	Inform	Success	BLEU	Combined
DAMD †	57.9	47.6	16.4	84.8
AuGPT †	76.6	60.5	16.8	85.4
AuGPT †	76.6	60.5	16.8	85.4
MinTL †	73.7	65.4	19.4	89.9
SOLOIST †	82.3	72.4	13.6	90.9
UBAR †	83.7	70.3	17.6	94.4
PPTOD †	83.1	72.7	18.2	96.1
BORT ‡	85.5	77.4	17.9	99.4
MTTOD †	85.9	76.5	19.0	100.2
GALAXY †	85.4	75.7	19.6	100.2
MSTOD	89.0	78.6	18.4	102.2

Table 3. Human evaluation results.

	Success	Coherency	Fluency	Average
Agreement	0.65	0.72	0.82	0.73
Galaxy	1.1	1.3	1.7	1.37
MSTOD (ours)	1.5	1.6	1.8	1.63

Table 4. API query accuracy evaluation.

	Precision	Recall	F1
Whether Call API	86.45	100	92.73
API Correctness	84.45	85.13	84.79
Args Accurateness	78.13	56.00	65.24

Table 5. Different API information evaluation result.

Prev	Cur	Inf.	Suc.	BLEU	Comb.
Gen	Gen	89	78.6	18.4	102.2
GT	Gen	91.7	80.5	19.5	105.6
GT	GT	92.2	79.1	20.8	106.4

Table 6. Zero-shot cross-domain end-to-end evaluation. †: we cite these results from the original paper [14]. Att means we use the other four domains as training data and test on the attraction domain. We only report the combined scores for brevity.

Model	Att	Hotel	Res	Taxi	Train	Avg
DAMD †	28.7	26.9	24.4	52.3	51.4	36.7
UBAR †	28.3	29.5	23.5	59.5	53.9	38.9
MinTL †	33.4	37.3	31.5	60.4	77.1	47.9
BORT †	33.6	38.7	32.0	62.7	85.6	50.5
CSTOD	60.8	48.6	44.8	95.3	67.4	63.4

Table 7. Few-shot cross-domain end-to-end evaluation. †: we cite these results from the original paper [15]. “Few” refers to using all the training data from the four source domains combined with 100 dialogues sampled from the target domain’s training data for training, with testing conducted on the attraction domain. “Full” refers to using all the training data from the target domain as the training dataset. We only report the combined scores for brevity.

Model	Att	Hotel	Res	Taxi	Train	Avg
UBARv2 †	59.7	77.7	91	84.13	80	87.4
CSTOD (zero)	60.8	48.6	44.8	95.3	67.4	63.4
CSTOD (few)	86.8 ±3.1	99.3 ±7.5	106.9 ±4.1	98.7 ±1.9	98.5 ±5	100.6 ±0.8
CSTOD (full)	116.6	108.9	116.4	118.8	112.1	114.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, H.; Yuan, C.; Wang, X. STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API. Appl. Sci. 2024, 14, 5303. https://doi.org/10.3390/app14125303

AMA Style

Lu H, Yuan C, Wang X. STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API. Applied Sciences. 2024; 14(12):5303. https://doi.org/10.3390/app14125303

Chicago/Turabian Style

Lu, Hengtong, Caixia Yuan, and Xiaojie Wang. 2024. "STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API" Applied Sciences 14, no. 12: 5303. https://doi.org/10.3390/app14125303

APA Style

Lu, H., Yuan, C., & Wang, X. (2024). STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API. Applied Sciences, 14(12), 5303. https://doi.org/10.3390/app14125303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STOD: Towards Scalable Task-Oriented Dialogue System on MultiWOZ-API

Abstract

1. Introduction

2. Related Work

2.1. Task-Oriented Dialogue System

2.2. Scalability and Transferability of TOD

3. Construction of MultiWOZ-API

3.1. API Schema Definition

3.2. Data Re-Annotation Process

3.3. Data Analysis

4. Methodology

4.1. Scalable Task-Oriented Dialogue Modeling Framework

4.1.1. API Query Generation

4.1.2. DB-Grounded Response Generation

4.2. Multi-Domain Scalable End-to-End Task-Oriented Dialogue System

4.2.1. Serialization of API Query and DB Result

4.2.2. Architecture and Training Objective

4.3. Cross-Domain Scalable End-to-End Task-Oriented Dialogue System

4.3.1. Serialization of API Schema

4.3.2. Architecture and Training Objective

5. Experiment and Analysis

5.1. Multi-Domain Dialogue Modeling

5.1.1. Experiment Setting

Dataset

Evaluation Metrics

Baseline

5.1.2. Automatic End-to-End Evaluation

5.1.3. Human-in-the-Loop Evaluation

5.1.4. Further Analysis

API Query Accuracy

Effect of Golden API with Generated API

5.2. Cross-Domain Dialogue Modeling

5.2.1. Experiment Setting

Dataset

Evaluation Metrics and Baseline

5.2.2. Zero-Shot Cross-Domain Results

5.2.3. Few-Shot Cross-Domain Results

5.2.4. Case Study

5.2.5. Limitation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. API Schema Details

Appendix B. Implementation Details

Appendix C. Case Study of Cross-Domain TOD Systems

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI