Using Large Language Models for Goal-Oriented Dialogue Systems
Abstract
:Featured Application
Abstract
1. Introduction
2. Literature Review
- We investigated seven large language models in Russian and English on intent mining, NER, response structure, resistance to typos, and possibility of local deployment.
- We presented two approaches to build an LLM-based dialogue agent: a heuristic approach with additional training on labeled dialogues and a general approach without additional training.
- −
- We described a general iterative approach to build a dialogue agent using LLMs and RAG.
- −
- We investigated the impact of the context on the intent mining procedure.
- −
- We compared and evaluated two approaches to construct a dialogue graph using a locally deployed large language model.
3. Materials and Methods
- The English-language dataset MultiWOZ 2.2 (Multi-Domain Wizard-of-Oz) [49] contains text dialogues between people in seven different categories of service provision. The data under study contain a turn-by-turn dialogue of an average length of 14 responses between the USER and the SYSTEM, with each turn of the dialogue representing one utterance of the user or the system. In total, the dataset contains 113,748 utterances.
- The multi-domain information-seeking dialogue MANtIS dataset [50] contains 80,000 information-seeking conversations. In total, the dataset contains 6701 labeled utterances.
- BERTScore [51]—an automatic evaluation metric of text generation by large language models, which calculates the similarity of tokens using contextual embeddings for each token in the generated text and the original marked-up dialogue sentence;
- Rouge [52]—a metric used to evaluate automatic summarization and machine translation;
- Bleu [53]—a metric for assessing the accuracy between machine translation and reference user translations of a single source sentence;
- Meteor [54]—a metric for assessing the quality of machine translation, which is based on the use of n-grams and is focused on the use of statistical and accurate assessment of the source text.
3.1. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM with Additional Training on Labeled Dialogues
- Prepare training and test sets. In the training set, sentences from the MultiWOZ 2.2 dialogue corpus were grouped by three parameters (the step number, the intent of the previous message, and the intent of the current message). Each triple reflected the transition from one state to another, which allowed for a more accurate modeling of the interaction between participants during a dialogue.
- Construct the graph. At this stage, nodes were created from pairs of [step number; cluster]. The vertices contained the user’s message or intent that would be extracted when moving to this vertex. Edges were built on the basis of information triples: for each triple, information was formed about the incoming node [step number; current cluster], the outgoing node [step number-1; previous cluster], and the edge connecting them [step number; current cluster]. It was important to ensure that the graph did not use duplicate edges with the same pairs from one outgoing node. However, using the same edges to the incoming node was acceptable. This ensured that the graph would have unambiguous transition scenarios.
- During the dialogue, the user’s intent was selected, and a step was made along the edge with the selected intent.
- Having reached a certain vertex, the utterance assigned to this vertex was extracted.
Algorithm 1 Pseudo-code of heuristic-based approach |
Graph Construction Input: Training set Output: Graph Procedure BuildDialogueGraph(D) // Corpus preparation For each dialogue in D: Add start and end markers to the dialogue // Step numbering For each dialogue in D: For each message in the dialogue: Assign a step number to the message // Vectorization and clustering For each message in D: Convert the message into a vector Cluster vectors based on feature similarity and step number // Extracting information triplets For each cluster Ci: For each next cluster Cj: If (Ci step number + 1 = Cj step number) AND (any messages in Ci and Cj belong to the same dialogue): Create an information triplet (Ci, Cj, Ci step number) // Building the graph For each triplet: Create vertices and edges in the graph based on the triplet Return the constructed graph G End procedure |
- The standard version assumed the use of a graph whose edges represented a group [step number; cluster].
- With the global heuristic approach, not only were edges formed by groups [step number; cluster], but also edges that indicated a global step were used. Such a heuristic allowed one to create more scenarios and avoid abrupt endings of dialogues. Global steps in the graph were created if there was no local transition for a given intent.
- With the tree heuristic approach, the dialogue graph was formed as a tree. Tree heuristics assumed continuation of separate development after the first disconnection. Such a heuristic allowed one to improve the accuracy of the answers given by increasing the probable scenarios without an answer.
3.2. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM Without Additional Training on Labeled Dialogues
Algorithm 2 Pseudo-code of prompt-based approach |
Procedure dialogue_agent(Ut):
# Initialization init_prompt = Pinit # Define initialization prompt intent_mining_prompt = Pintent # Define intent mining prompt is_end_of_dialog = Pend # Define successful dialog completion prompt validate_response = Pval # Define validation prompt Context = [] # Define dialog context t = 1 # Define iteration number # Loop until end of dialog intent is reached while True: # Intent extraction It = extract_intent(Ut, intent_mining_prompt) # End of dialog detection if is_end_of_dialog(It): St = “end” break # LLM output generation Rt = generate_response (It, Context) # Output validation R’t = validate_response(Rt) # Update Context Context = {(Ui,Ri), i = 0, .., k} # Send response to user send_response(R’t) # Update iteration number t++ |
- The first step is to initialize the dialogue agent in the form of an assistant for user service. The dialogue agent initialization prompt Pinit in Russian/English is presented in Figure 4.
- 2.
- Having received a user utterance Ut, it is necessary to extract the intent using prompt Pintent from Figure 1a.
- 3.
- The next step is to check for the successful dialogue completion. Prompt Pend in Russian/English is presented in Figure 5a. If the dialogue completion is successful, the dialogue state becomes St = “end”. Otherwise, the following is necessary:
- 3.1.
- The next step is to generate an output Rt based on the intent It and the current context of dialogue Context.
- 3.2.
- After generating the LLM output Rt, it is necessary to validate this output. Prompt Pval in Russian/English is presented in Figure 5b.
- 3.3.
- After validation, Context is updated, and the validated response R’t is sent back to the user.
- 3.4.
- The iteration number t is increased by one.
4. Results
4.1. Comparison of Large Language Models
- good_a—number of responses indicating “The best answer is from the ChatGPT/LLAMA model”;
- good_b—number of responses indicating “The best answer is from the YandexGPT/MIXTRAL model”;
- both—number of responses indicating “Both models responded well”;
- none—number of responses indicating “Both models responded poorly”.
4.2. Visualization of Dialogue Graphs with Intent Sequences
4.3. Examples of Dialogue Agent Responses
- −
- Check-in date: December 15
- −
- Check-out date: December 20
- −
- Number of nights: 5
- −
- Number of guests: 2
- −
- Room type: Standard
- −
- Additional services: Breakfast.
4.4. Evaluation of Two Methods for Constructing a Dialogue Graph
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DAPT | Domain-adaptive pre-training |
GAN | Generative adversarial network |
HMM | Hidden Markov models |
LDA | Latent Dirichlet allocation |
LLM | Large language model |
LQR | Layered query retrieval |
NER | Named-entity recognition |
NLP | Natural language processing |
NMF | Non-negative matrix factorization |
OHS | Occupational health and safety |
RAG | Retrieval augmented generation |
TF-IDF | Term frequency–inverse document frequency |
TOD | Task-oriented dialogue |
References
- Sood, P.; Tanwar, H.; Singh, J.; Ruhela, A.K.; Gupta, N.; Kumar, R. Revolutionizing Customer Service: An AI-powered Chatbot Approach using Advanced NLP Techniques. In Proceedings of the 2024 3rd Edition of IEEE Delhi Section Flagship Conference (DELCON), New Delhi, India, 21–23 November 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Rustamov, S.; Bayramova, A.; Alasgarov, E. Development of dialogue management system for banking services. Appl. Sci. 2021, 11, 10995. [Google Scholar] [CrossRef]
- Ngai, E.W.; Lee, M.C.; Luo, M.; Chan, P.S.; Liang, T. An intelligent knowledge-based chatbot for customer service. Electron. Commer. Res. Appl. 2021, 50, 101098. [Google Scholar] [CrossRef]
- Addlesee, A.; Sieińska, W.; Gunson, N.; Garcia, D.H.; Dondrup, C.; Lemon, O. Multi-party Goal Tracking with LLMs: Comparing Pre-training, Fine-tuning, and Prompt Engineering. arXiv 2023, arXiv:2308.15231. [Google Scholar] [CrossRef]
- Lahiri, A.; Sanyal, D.K.; Mukherjee, I. CitePrompt: Using Prompts to Identify Citation Intent in Scientific Papers. arXiv 2023, arXiv:2304.12730. [Google Scholar] [CrossRef]
- Nambanoor Kunnath, S.; Pride, D.; Knoth, P. Prompting Strategies for Citation Classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar] [CrossRef]
- Chang, K.W.; Tseng, W.C.; Li, S.W.; Lee, H.Y. SpeechPrompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv 2022, arXiv:2203.16773. [Google Scholar] [CrossRef]
- Dighe, P.; Su, Y.; Zheng, S.; Liu, Y.; Garg, V.; Niu, X.; Tewfik, A. Leveraging Large Language Models for Exploiting ASR Uncertainty. arXiv 2023, arXiv:2309.04842. [Google Scholar] [CrossRef]
- Gao, N.; Zhao, Z.; Zeng, Z.; Zhang, S.; Weng, D.; Bao, Y. GesGPT: Speech Gesture Synthesis with Text Parsing from GPT. arXiv 2023, arXiv:2303.13013. [Google Scholar] [CrossRef]
- Zhang, R.H.; Sell, P.; Zhang, Y.; Che, L.; Gao, A.; Sathiyajith, K.S.; Bhatt, R.; Nagasubramaniam, P.; Vummanthala, S.; Dave, S.; et al. EvoquerBot: A Multimedia Chatbot Leveraging Synthetic Data for Cross-Domain Assistance. Penn State University: University Park, PA, USA, 2023. [Google Scholar]
- Bragg, J.; Cohan, A.; Lo, K.; Beltagy, I. Flex: Unifying evaluation for few-shot NLP. Adv. Neural Inf. Process. Syst. 2021, 34, 15787–15800. [Google Scholar] [CrossRef]
- Loukas, L.; Stogiannidis, I.; Malakasiotis, P.; Vassos, S. Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance. arXiv 2023, arXiv:2308.14634. [Google Scholar] [CrossRef]
- Wang, P.; He, K.; Wang, Y.; Song, X.; Mou, Y.; Wang, J.; Xian, Y.; Cai, X.; Xu, W. Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection. arXiv 2024, arXiv:2402.17256. [Google Scholar] [CrossRef]
- Abdullin, Y.; Molla-Aliod, D.; Ofoghi, B.; Yearwood, J.; Li, Q. Synthetic Dialogue Dataset Generation using LLM Agents. arXiv 2024, arXiv:2401.17461. [Google Scholar] [CrossRef]
- Ahmed, R.; Rauf, S.A.; Latif, S. Leveraging Large Language Models and Prompt Settings for Context-Aware Financial Sentiment Analysis. In Proceedings of the 2024 5th International Conference on Advancements in Computational Sciences (ICACS), Lahore, Pakistan, 19–20 February 2024; pp. 1–9. [Google Scholar] [CrossRef]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
- Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
- Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
- Wang, Z. CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), Bangkok, Thailand, 16 August 2024; pp. 143–151. [Google Scholar]
- Huang, Y.; Song, J.; Wang, Z.; Zhao, S.; Chen, H.; Juefei-Xu, F.; Ma, L. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv 2023, arXiv:2307.10236. [Google Scholar] [CrossRef]
- Wang, C.; Liu, X.; Yue, Y.; Tang, X.; Zhang, T.; Jiayang, C.; Yao, Y.; Gao, W.; Hu, X.; Qi, Z.; et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv 2023, arXiv:2310.07521. [Google Scholar] [CrossRef]
- Huang, Y.; Sun, L.; Wang, H.; Wu, S.; Zhang, Q.; Li, Y.; Gao, C.; Huang, Y.; Lyu, W.; Zhang, Y.; et al. Trustllm: Trustworthiness in large language models. arXiv 2024, arXiv:2401.05561. [Google Scholar] [CrossRef]
- McIntosh, T.R.; Susnjak, T.; Arachchilage, N.; Liu, T.; Watters, P.; Halgamuge, M.N. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. arXiv 2024, arXiv:2402.09880. [Google Scholar] [CrossRef]
- Ilse, B.; Blackwood, F. Comparative analysis of finetuning strategies and automated evaluation metrics for large language models in customer service chatbots. Res. Sq. 2024, 1–18. [Google Scholar] [CrossRef]
- Kaushal, A.; Lin, C.C.; Chauhan, R.; Kumar, R. Charting the Growth of Text Summarisation: A Data-Driven Exploration of Research Trends and Technological Advancements. Appl. Sci. 2024, 14, 11462. [Google Scholar] [CrossRef]
- Falegnami, A.; Tomassi, A.; Corbelli, G.; Nucci, F.S.; Romano, E. A Generative Artificial-Intelligence-Based Workbench to Test New Methodologies in Organisational Health and Safety. Appl. Sci. 2024, 14, 11586. [Google Scholar] [CrossRef]
- Chen, Q.; Zhou, W.; Cheng, J.; Yang, J. An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain. Appl. Sci. 2024, 14, 11529. [Google Scholar] [CrossRef]
- Vaškevičius, M.; Kapočiūtė-Dzikienė, J. Language Models for Predicting Organic Synthesis Procedures. Appl. Sci. 2024, 14, 11526. [Google Scholar] [CrossRef]
- Dai, Q.; Mao, Y.; Tang, J.; Rong, Y. STGPT2UGAN: Spatio-Temporal GPT-2 United Generative Adversarial Network for Wind Speed Prediction in Turbine Network. Appl. Sci. 2024, 14, 11217. [Google Scholar] [CrossRef]
- Huang, J.; Wang, M.; Cui, Y.; Liu, J.; Chen, L.; Wang, T.; Li, H.; Wu, J. Layered Query Retrieval: An Adaptive Framework for Retrieval-Augmented Generation in Complex Question Answering for Large Language Models. Appl. Sci. 2024, 14, 11014. [Google Scholar] [CrossRef]
- Bensch, C.; Müller, A.; Chojnowski, O.; Richert, A. Beyond Binary Dialogues: Research and Development of a Linguistically Nuanced Conversation Design for Social Robots in Group–Robot Interactions. Appl. Sci. 2024, 14, 10316. [Google Scholar] [CrossRef]
- Smutny, P.; Bojko, M. Comparative Analysis of Chatbots Using Large Language Models for Web Development Tasks. Appl. Sci. 2024, 14, 10048. [Google Scholar] [CrossRef]
- Benzinho, J.; Ferreira, J.; Batista, J.; Pereira, L.; Maximiano, M.; Távora, V.; Gomes, R.; Remédios, O. LLM Based Chatbot for Farm-to-Fork Blockchain Traceability Platform. Appl. Sci. 2024, 14, 8856. [Google Scholar] [CrossRef]
- Liu, J.; Tan, Y.K.; Fu, B.; Lim, K.H. Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification. arXiv 2024, arXiv:2411.14252. [Google Scholar] [CrossRef]
- Gao, J.; Xiang, L.; Wu, H.; Zhao, H.; Tong, Y.; He, Z. An Adaptive Prompt Generation Framework for Task-oriented Dialogue System. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1078–1089. [Google Scholar] [CrossRef]
- Cao, L. Diaggpt: An llm-based chatbot with automatic topic management for task-oriented dialogue. arXiv 2023, arXiv:2308.08043. [Google Scholar] [CrossRef]
- Hu, Z.; Feng, Y.; Deng, Y.; Li, Z.; Ng, S.K.; Luu, A.T.; Hooi, B. Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals. arXiv 2023, arXiv:2309.08949. [Google Scholar] [CrossRef]
- Wu, S.; Shen, X.; Xia, R. A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User’s Intentions. arXiv 2023, arXiv:2310.03293. [Google Scholar] [CrossRef]
- Stacey, J.; Cheng, J.; Torr, J.; Guigue, T.; Driesen, J.; Coca, A.; Gaynor, M.; Johannsen, A. LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues. arXiv 2024, arXiv:2403.00462. [Google Scholar] [CrossRef]
- Li, H.; Yang, C.; Zhang, A.; Deng, Y.; Wang, X.; Chua, T.S. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue. arXiv 2024, arXiv:2406.05925. [Google Scholar] [CrossRef]
- Zhang, X.; Peng, B.; Li, K.; Zhou, J.; Meng, H. Sgp-tod: Building task bots effortlessly via schema-guided llm prompting. arXiv 2023, arXiv:2305.09067. [Google Scholar] [CrossRef]
- Okadome, Y.; Yuguchi, A.; Fukui, R.; Matsumoto, Y. Prompt design using past dialogue summarization for llms to generate the current appropriate dialogue. In International Conference on Artificial Neural Networks; Springer Nature: Cham, Switzerland, 2024; pp. 33–41. [Google Scholar] [CrossRef]
- Robino, G. Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems. arXiv 2025, arXiv:2501.11613. [Google Scholar] [CrossRef]
- De Baer, J.; Doğruöz, A.S.; Demeester, T.; Develder, C. Single-vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources. arXiv 2025, arXiv:2502.18650. [Google Scholar] [CrossRef]
- Huang, Q.; Liu, X.; Ko, T.; Wu, B.; Wang, W.; Zhang, Y.; Tang, L. Selective Prompting Tuning for Personalized Conversations with LLMs. arXiv 2024, arXiv:2406.18187. [Google Scholar] [CrossRef]
- Zang, X.; Rastogi, A.; Sunkara, S.; Gupta, R.; Zhang, J.; Chen, J. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. arXiv 2020, arXiv:2007.12720. [Google Scholar] [CrossRef]
- Penha, G.; Balan, A.; Hauff, C. Introducing mantis: A novel multi-domain information seeking dialogues dataset. arXiv 2019, arXiv:1912.04639. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar] [CrossRef]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
LLM | Dataset/Database | Language | RAG | Dialogue Generation | Metric | Ref. |
---|---|---|---|---|---|---|
LlaMA 2 | Knowledge database consisting of 222 potential user intents | German, English | + | User’s statement, dialogue history, user count, and language are fed to the LLM; GPT-3.5 is used to synthesize plural responses to provide linguistically nuanced responses | – | [34] |
Mistral-8x7B-Instruct-V0.1, Mistral-7B-Instruct-V0.2, Meta-LLamma-3-8B-Instruct, Google Gemma-2B | FAISS, ChromaDB | English, Portuguese | + | Regular expressions are applied to the text blocks to define data types; the user’s chat history is used to provide context to the LLM | Answer precision on a scale of 1–5 | [36] |
XLM-RoBERTa-base | e-commerce question-intent datasets | Portuguese, Indonesian, English/Malay, English/Filipino, English, Thai, traditional Chinese, Vietnamese | - | Multi-turn intent classification chain-of-intent method to generate intent-aware dialogues | Contrastive loss | [37] |
Chatgpt-3.5-turbo | MultiWOZ 2.0 | English | - | Adaptive prompt generation | Inform, success, BLEU | [38] |
gpt-4-0613, gpt-3.5-turbo, gpt-4-turbo-2024-04-09 | LLM-TOD dataset | English | - | Proactive question asking, users’ guidance, dialogue state maintenance | Round count, completion rate, response quality, comparison score | [39] |
GPT-3.5- turbo | MultiWoZ 2.1 | English | - | Proactively goal-driven LLM-induced approach, future dialogue actions and goal-oriented reward | Inform, success | [40] |
text-davinci-001, text-davinci-002, text-davinci003, gpt-3.5-turbo | Context-open-question dataset | English | - | Question generation to generate a variety of questions related to the context of the dialogue; extra knowledge retrieval; enhance the LLM response | BLEU, ROUGE, human evaluate | [41] |
T5, Flan-T5 | LUCID dataset | English | - | Generation of intents, a conversational planner, turn-by-turn generation of conversations, and validation procedure | Intent accuracy, joint goal accuracy | [42] |
ChatGPT; ChatGLM, BlenderBot, BART | MSC and CC datasets | English | - | Historical event perception, dynamic persona extraction, response generation based on retrieved relevant memories | BLEU-N, ROUGE-L, METEOR, accuracy, human evaluation | [43] |
ChatGPT, GPT3.5 | Multiwoz 2.0 and 2.2, RADDLE and STAR datasets | English | - | LLM to generate with user, DST prompter to retrieve database items, policy prompter to elicit proper responses adhering to the provided dialogue policy | Inform, success, BLEU, combined, BERTScore | [44] |
T5 | NUCC, Livedoor news summarization, dolly-15k-ja | Japanese | - | Dialogue summarization-based prompt design with context database | ROUGE-1, ROUGE-2, ROUGE-L, BERT score, Sentence-BERT | [45] |
OpenAI GPT-4o-mini | Train ticket booking system, interactive troubleshooting Copilot data | English, Italian | + | Conversation routine-based embedded business logic within LLM prompts | - | [46] |
GPT-4o, Llama 3.3 | - | English | - | Single-prompt dialogue generation, two-agent dialogue generation | Agreement rate | [47] |
OPT, LLama2 | CONVAI2 | English | + | Selective prompt tuning-based dialogue generation | F1, BLEU ROUGE-1 ROUGE-2 ROUGE-L | [48] |
LLaMA | MultiWOZ 2.2, MANTiS | Russian, English | + | Context-based, LLM-based iterative dialogue generation with and without additional training on labeled dialogues | BERTScore, BLEU, Meteor, human evaluation | Our research |
LLM | Intent Mining | NER | Clearly Structured Response | Resistance to Typos and Word Rearrangements in Prompts | Local Deployment |
---|---|---|---|---|---|
ChatGPT | −+ | + | +− | +− | − |
Mistral-Saiga | −+ | − | −+ | −+ | + |
GigaChat | −+ | − | − | − | − |
Yandex GPT | −+ | −+ | −+ | −+ | −+ |
Gemini | −+ | −+ | + | −+ | − |
LLaMA | −+ | −+ | + | + | + |
MIXTRAL | −+ | − | − | + | + |
Dialogue Utterances | Intent Sequence (Isolated Utterances) | Intent Sequence (Current Dialogue Context as an Input) | Intent Sequence (Whole Dialogue Context as an Input) |
---|---|---|---|
Prompt: What is the user intent in the text? Describe it with one or two words: text: “[input]” | Prompt: What is the user intent in the text? Describe it with one or two words. Use the following pieces of context to answer the question. text: “[context]” text: “[input]” | Prompt: What is the user intent in the text? Describe it with one or two words. Use the following pieces of context to answer the question. text: “[context]” text: “[input]” | |
- Guten tag, I am staying overnight in Cambridge and need a place to sleep. I need free parking and internet. | Accommodation inquiry | Accommodation request | Accommodation inquiry |
- I have 4 different options for you. I have two cheaper guesthouses and two expensive hotels. Do you have a preference? | Preference gathering | Preference inquiry | Presenting options |
- No, but I’d really like to be on the south end of the city. Do any of those fit the bill? | Location preference | Location query | Location preference |
- Sure. Does price matter? We can narrow it down and find exactly what you need. | Clarification | Preference clarification | Inquiry and assistance |
- No I don’t care about the price. Which one do you recommend? | Seek recommendation | Decision-making | Seeking recommendation |
- I would recommend Aylesbray Lodge Guest House. Would you like me to book that for you? | Recommendation, booking | Booking assistance | Recommendation and booking |
- Yes, book it for 4 people and 4 nights starting from Tuesday. | Reservation | Confirm a booking | Booking request |
- The booking was unsuccessful. Would you like another day or a shorter stay? | Options | Adjust booking | Booking adjustment |
- How about for 3 nights? | Booking accommodation | Modification | Booking confirmation |
- Booked! Reference number is: 84ESP6F5 | Confirmation | Booking accommodation | Confirmation/notification. |
- Great. I am all set then. Have a nice day. Bye. | Closure | Farewell | Confirmation and farewell |
- Have a nice stay. Bye. | Closure | Farewell | Confirmation and farewell |
Dataset | Model | BERTScore | Bleu | Meteor |
---|---|---|---|---|
MultiWOZ 2.2 | LLaMA with fine-tuning on dialogues | 0.85 | 0.22 | 0.17 |
MultiWOZ 2.2 | LLaMA without fine-tuning on dialogues | 0.75 | 0.60 | 0.15 |
MANtIS | LLaMA with fine-tuning on dialogues | 0.82 | 0.24 | 0.20 |
MANtIS | LLaMA without fine-tuning on dialogues | 0.72 | 0.62 | 0.14 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Legashev, L.; Shukhman, A.; Badikov, V.; Kurynov, V. Using Large Language Models for Goal-Oriented Dialogue Systems. Appl. Sci. 2025, 15, 4687. https://doi.org/10.3390/app15094687
Legashev L, Shukhman A, Badikov V, Kurynov V. Using Large Language Models for Goal-Oriented Dialogue Systems. Applied Sciences. 2025; 15(9):4687. https://doi.org/10.3390/app15094687
Chicago/Turabian StyleLegashev, Leonid, Alexander Shukhman, Vadim Badikov, and Vladislav Kurynov. 2025. "Using Large Language Models for Goal-Oriented Dialogue Systems" Applied Sciences 15, no. 9: 4687. https://doi.org/10.3390/app15094687
APA StyleLegashev, L., Shukhman, A., Badikov, V., & Kurynov, V. (2025). Using Large Language Models for Goal-Oriented Dialogue Systems. Applied Sciences, 15(9), 4687. https://doi.org/10.3390/app15094687