Using Large Language Models for Goal-Oriented Dialogue Systems

Legashev, Leonid; Shukhman, Alexander; Badikov, Vadim; Kurynov, Vladislav

doi:10.3390/app15094687

Open AccessArticle

Using Large Language Models for Goal-Oriented Dialogue Systems

by

Leonid Legashev

^*

,

Alexander Shukhman

,

Vadim Badikov

and

Vladislav Kurynov

Research Institute of Digital Intelligent Technologies, Orenburg State University, Pobedy Pr. 13, Orenburg 460018, Russia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4687; https://doi.org/10.3390/app15094687

Submission received: 3 March 2025 / Revised: 16 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Featured Application

Leveraging pretrained large language models and RAG technique, this study enhances the scenario graph generation process. The findings highlight the potential of AI approaches in goal-oriented dialogue systems with preservation of dialogue context.

Abstract

In the development of goal-oriented dialogue systems, neural network topic modeling and clustering methods are traditionally used to extract user intentions and operator response scenario blocks. The emergence of generative large language models allows one to radically change the approach to generate dialogue scenarios in the form of a graph with context preservation. In this article we analyzed seven popular large language models on prepared test prompts for Russian and English languages for intent mining and named entity recognition. The present study aimed to investigate the effectiveness of two methods for constructing dialogues in goal-oriented dialogue systems: the heuristic-based approach with additional training on labeled data and the prompt-based approach without such training. The primary objective was to evaluate the impact of incorporating labeled dialogue data on the quality of constructed dialogues, with a focus on dialogue context. The study emphasized the need for dialogue systems to consider the dialogue context in constructing goal-oriented dialogues. The two approaches were compared for the MultiWOZ 2.2 and MANTiS dialogue corpora on a locally deployed LLaMA model. The results showed that the LlaMA model without training on labeled dialogues achieved a BERTScore metric value of 0.75 for the MultiWOZ dataset and 0.72 for the MANTiS dataset, and the LlaMA model with training on labeled dialogues achieved a BERTScore metric value of 0.85 for the MultiWOZ dataset and 0.82 for the MANTiS dataset. This finding has practical implications for the development of more effective dialogue systems in the field of customer service that can engage users in more productive and meaningful machine-to-human interactions.

Keywords:

large language models; natural language processing; prompt engineering; dialogue graph

1. Introduction

The field of artificial intelligence related to natural language processing (NLP) has received a new round of development due to the emergence of generative large language models (LLMs). The main purpose of large language models is to predict text tokens based on the input. One of the most popular areas of research is the development of dialogue systems capable of conducting an intelligent dialogue with users, performing many routine tasks related to the provision of information and services [1,2,3]. As a rule, scenarios for dialogue systems are developed manually by constructing various variations of dialogue trees depending on the user’s intents. At the same time, the manually developed dialogue system is often unable to respond to any user request that differs from the one in the database. In this case, the system refers the user to read the documentation or offers a standard text placeholder, which negatively affects the quality of dialogue perception. Modern machine learning methods can be used to automatically generate scenarios for goal-oriented dialogue systems based on the analysis of available dialogue data to determine the “weaknesses” in the dialogue structure. The key problem is to construct a dialogue scenario graph based on the analysis of user intents within the structure of an arbitrary dialogue in text format.

The rest of this article is organized as follows. Section 2 provides a literature review on the use of large language models in dialogue generation. Section 3 provides the materials and methods to build dialogue agents based on prompt engineering. Section 4 provides a comparison of seven large language models for English and Russian, a visualization of dialogue graphs with intent sequences, examples of dialogue agent responses, and a evaluation of two methods for constructing a dialogue graph. Section 5 summarizes the discussion.

2. Literature Review

The development of modern generative transformer-based models has led to the concept of prompt engineering, which consists of processing text information in such a way that artificial intelligence models understand it. In particular, prompt engineering can be used to extract user intents. Addlesee et al. [4] use prompt engineering in conjunction with the GPT-3.5-turbo family of language models to classify intents. In publications [5,6], prompts are used to identify citation intents in the texts of scientific articles. Chang K. W. et al. [7] describe a paradigm for setting up prompts to optimize the parameters of generative language models. Dighe et al. [8] use prompt engineering to classify user intents in conversational speech understanding tasks. Gao et al. [9] use prompt engineering to transform the problem of generating gestures contextually corresponding to speech or text input into the problem of intent classification. Zhang et al. [10] describe the development of a multimedia chatbot EvoquerBot that solves a variety of problems, including intent classification. Bragg et al. [11] describe UniFew, a prompt-based model that combines pre-training and fine-tuning of prompt formats to train models with a limited amount of labeled data. Loukas et al. [12] use prompt engineering technique to classify texts in the financial sector domain. Wang et al. [13] study the use of large language models in goal-oriented dialogue systems to check whether the user’s intent goes beyond the domain of the dialogue system. Abdullin et al. [14] use large language model agents to synthesize dialogue datasets and assess the quality of the generated data. Ahmed et al. [15] compare four language models for extracting sentiment and contextual information from text based on trained prompts. Neural topic modeling is used to cluster arbitrary text corpora into “topics” and can be applied to user intent extraction and dialogue scenario block tasks. Grootendorst [16] presents a neural topic modeling technique, BERTopic, which generates vector representations of documents using pre-trained language models and forms topic representations using the term frequency–inverse document frequency (TF-IDF) procedure based on classes. Egger et al. [17] evaluate the performance of four topic modeling methods: latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Top2Vec, and BERTopic on Twitter data analysis. Most papers focus on intent classification, text analysis, or sentiment extraction; prompt engineering is used in conjunction with generative language models; and several papers use neural topic modeling to analyze text corpora and extract user intents. Papers employ various techniques, such as LDA, NMF, Top2Vec, and TF-IDF for topic modeling and intent classification. Papers use different model architectures, like GPT-3.5-turbo, BERTopic, and UniFew, which vary in their design and training objectives. The text highlights the increasing importance of prompt engineering in natural language processing and showcases the application of prompt engineering and NLP techniques to various domains.

The performance of large language models is continuously related to the evaluation of the output quality of such models. When comparing the performance of several LLMs, robustness, ethics, bias, and inference validity need to be taken into account. Chang et al. [18] perform an extensive review on the topic of assessing the quality of LLMs and conclude that current LLMs exhibit certain limitations in a variety of tasks, in particular, in reasoning and inference robustness tasks. Liang et al. [19] present a multi-metric approach to holistically evaluate LLMs based on measuring seven metrics (accuracy, calibration, robustness, validity, bias, toxicity, and efficiency). Zhang et al. [20] perform a comprehensive human evaluation of ten LLMs for two popular news summarization benchmarks CNN/DM and XSUM and conclude that current LLMs show excellent results for such task. Zhao et al. [21] present a structured review of transformer-based explanation methods for large language models, examining such metrics for assessing the quality of large language models as comprehensiveness (COMP), sufficiency (SUFF), decision flip—fraction of tokens (DFFOT), and decision flip—most informative token (DFMIT). The authors of [22] present a multivariate method for assessing the causal reasoning of large language models in the form of the CausalBench benchmark. Huang et al. [23] investigate methods for assessing the uncertainty of the inference of large language models using two datasets—HumanEval and MBPP—as benchmarks. Wang et al. [24] examine the issue of the factuality of large language models in terms of the reliability and accuracy of their inferences, highlighting the following metrics: Brier score, MC1 and MC2, BLEU, ROUGE, METEOR, ADEM, BERTScore, BLEURT, and BARTScore. Huang et al. [25] present a method, TrustLLM, for assessing the trustworthiness of large language models, which examines LLMs across six criteria: truthfulness, security, fairness, reliability, privacy, and machine ethics. McIntosh et al. [26] perform a critical analysis of twenty-three modern LLMs, identifying significant limitations including bias, difficulties in measuring genuine reasoning, adaptability, and ignoring cultural and ideological norms in a single comprehensive assessment. Ilse et al. [27] present a domain-adaptive pre-training (DAPT) approach to assess the accuracy, reliability, and response quality levels of large language models in the field of customer service chatbot development. Most papers focus on evaluating the performance of LLMs and discuss the limitations and challenges of current LLMs in various tasks and metrics. Papers vary in their focus, with some focusing on specific tasks (e.g., news summarization or customer service chatbot development), while others take a more general approach to evaluating LLMs. Papers employ various evaluation methods, including human evaluation, multi-metric approaches, and structured reviews.

With the rapid development of techniques in LLMs research, the application of such models to a wide range of problems is being explored. Kaushal et al. [28] explore current trends for text summarization tasks to efficiently process large volumes of text data. Falegnami et al. [29] integrate LLMs and synthetic data generation tools to create a dynamic modeling and evaluation platform for new occupational health and safety (OHS) methodologies. Chen et al. [30] develop the BM25 algorithm-based retrieval-augmented generation (RAG) attention mechanism to construct a vector database of the text knowledge on ocean fronts and eddies. Vaškevičius et al. [31] investigate several transformer models to generate synthesis procedures for a wide range of organic synthesis reactions, and a fine-tuned molT5-large model shows the best results with the obtained BLEU metric value of 47.75. Dai et al. [32] develop a novel Spatio-Temporal GPT-2 United Generative Adversarial Network architecture to predict the future wind speed sequences using a generative adversarial network (GAN) to improve the training of the proposed model. Huang et al. [33] introduce a novel adaptive RAG framework named Layered Query Retrieval (LQR) focusing on query complexity classification, retrieval strategies, and relevance analysis showing high accuracy and efficiency on the HotpotQA dataset. Bensch et al. [34] develop an adaptive conversation design system in managing multi-party interactions utilizing of a robust person detection system based on the YOLOv5m model. Smutny et al. [35] compare the performance of five chatbots—ChatGPT 3.5, Copilot, Gemini, TabNine AI, and BlackBox—using LLMs in handling web development tasks performing an assessment by an average weighted score based on seven criteria. All papers are related to the application of LLMs to various tasks, such as text summarization, data generation, retrieval, generation, and conversation design. Papers span multiple domains, including health and safety, oceanography, organic synthesis, wind prediction, and web development.

There are several state-of-the-art studies on dialogue generation using large language models. Benzinho et al. [36] develop an LLM-based conversational agent using RAG to aggregate non-textual media with the embeddings to enrich the user experience. Liu et al. [37] use LLM-enhanced hidden Markov models (HMMs) to sample a chain of intents and generate intent-aware multiturn dialogues. Gao et al. [38] present an adaptive framework for dynamic prompt generation for task-oriented dialogue. Cao [39] presents an innovative DiagGPT approach that extends the capabilities of the LLM to enable efficient management of the status of all topics throughout the development of a dialogue. Hu et al. [40] present a proactively goal-driven LLM-induced task-oriented dialogue (ToD) approach that anticipates future dialogue actions and the reward for achieving the dialogue goal. Wu et al. [41] present an EDIT framework for the dialogue generation task, solving the problem of generating plain LLM responses. Stacey et al. [42] present an LLM-driven data generation system to create high-quality dialogue datasets. Li et al. [43] present the Long-term Dialogue Agent (LD-Agent)—a model-agnostic framework with event perception, persona extraction, and response generation for appropriate long-term dialogue responses. Zhang et al. [44] present schema-guided prompting to build ToD systems with schema-guided LLM prompting. Okadome et al. [45] propose the dialogue generation prompt design based on a summarization model. Robino [46] present a structured LLM-based prompt engineering framework for developing task-oriented dialogue systems embedding task-oriented logic within LLM prompts. De Baer et al. [47] conduct a comparison of two LLM-based dialogue generation methods with a judge LLM-based evaluation of dialogue quality. Huang et al. [48] propose a selective prompt tuning (SPT) framework for personalized dialogue generation by training a dense retriever with dynamic soft prompt selection. Most papers propose new approaches or frameworks for improving dialogue generation. Many papers discuss the use of LLMs in conjunction with other techniques, such as hidden Markov models (HMMs), schema-guided prompting, or event perception. Several papers discuss the importance of prompt engineering in dialogue generation, with some proposing new frameworks for designing effective prompts. Papers demonstrate the potential of LLMs in dialogue generation, with many proposing new approaches for improving the quality and effectiveness of LLM-based dialogue systems.

Table 1 aggregates the results obtained in the studies on LLM-based dialogue generation.

The analysis of state-of-the-art publications shows the relevance of the research and the great interest in leveraging large language models and RAG to achieve high accuracy in machine-to-human conversations. It should be noted that many of the authors focus on English language studies for either locally deployed models or online accessible models. This study aimed to bridge this research gap by exploring the application of RAG technology and dialogue context to build a dialogue agent in Russian and English. Novel aspects of this study are as follows:

We investigated seven large language models in Russian and English on intent mining, NER, response structure, resistance to typos, and possibility of local deployment.
We presented two approaches to build an LLM-based dialogue agent: a heuristic approach with additional training on labeled dialogues and a general approach without additional training.

Our main contributions in this article are as follows:

−: We described a general iterative approach to build a dialogue agent using LLMs and RAG.
−: We investigated the impact of the context on the intent mining procedure.
−: We compared and evaluated two approaches to construct a dialogue graph using a locally deployed large language model.

3. Materials and Methods

In this study we evaluated and compared the performances of seven LLMs, including GhatGPT (https://chatgpt.com/ (accessed on 20 August 2024)), Mistral (https://mistral.ai/ (accessed on 20 August 2024)), GigaChat (https://giga.chat/ (accessed on 20 August 2024)), Yandex GPT (https://ya.ru/ai/gpt (accessed on 20 August 2024)), LLAMA (https://www.llama.com/ (accessed on 20 August 2024)), MIXTRAL (https://mistral.ai/ (accessed on 20 August 2024)), and Google Gemini (https://gemini.google.com/ (accessed on 20 August 2024)). The ChatGPT 3.5 language model was developed by Open AI based on the fine-tuned GPT 3.0 model, which contains 175 billion parameters. The mistral-Saiga language model is based on the Mistral-7B-v0.1 large language model, which is a pre-trained generative text model with 7 billion parameters. The Russian language model GigaChat is based on the ruGPT-3.5-13B large language model with 13 billion parameters. The Yandex GPT language model (YaLM 100B) uses 100 billion parameters. The Google Gemini language model was obtained by the fine-tuned PaLM model, which contains 540 billion parameters. The LLAMA and MIXTRAL language models each contain 7 billion parameters. The results of the comparison of the seven models are presented in Section 4.1.

We used the prompt engineering technique to prepare a limited number of text prompts to classify user intent and perform named-entity recognition (NER) with an LLM. We used the English and Russian prompts presented in Figure 1.

We evaluated the effectiveness of the proposed approaches on the following datasets:

The English-language dataset MultiWOZ 2.2 (Multi-Domain Wizard-of-Oz) [49] contains text dialogues between people in seven different categories of service provision. The data under study contain a turn-by-turn dialogue of an average length of 14 responses between the USER and the SYSTEM, with each turn of the dialogue representing one utterance of the user or the system. In total, the dataset contains 113,748 utterances.
The multi-domain information-seeking dialogue MANtIS dataset [50] contains 80,000 information-seeking conversations. In total, the dataset contains 6701 labeled utterances.

Using a locally deployed LLM, we compared two methods for constructing a dialogue scenario graph: tuning a large language model to an iterative dialogue manager with context preservation based on the prompt technique without training on labeled dialogues and tuning a large language model to an iterative dialogue manager with context preservation based on the prompt technique with additional training on labeled dialogues in the MultiWOZ 2.2 and MANtIS datasets.

The quality assessment of the large language models was performed based on the following:

BERTScore [51]—an automatic evaluation metric of text generation by large language models, which calculates the similarity of tokens using contextual embeddings for each token in the generated text and the original marked-up dialogue sentence;
Rouge [52]—a metric used to evaluate automatic summarization and machine translation;
Bleu [53]—a metric for assessing the accuracy between machine translation and reference user translations of a single source sentence;
Meteor [54]—a metric for assessing the quality of machine translation, which is based on the use of n-grams and is focused on the use of statistical and accurate assessment of the source text.

3.1. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM with Additional Training on Labeled Dialogues

The general algorithm was as follows:

Prepare training and test sets. In the training set, sentences from the MultiWOZ 2.2 dialogue corpus were grouped by three parameters (the step number, the intent of the previous message, and the intent of the current message). Each triple reflected the transition from one state to another, which allowed for a more accurate modeling of the interaction between participants during a dialogue.
Construct the graph. At this stage, nodes were created from pairs of [step number; cluster]. The vertices contained the user’s message or intent that would be extracted when moving to this vertex. Edges were built on the basis of information triples: for each triple, information was formed about the incoming node [step number; current cluster], the outgoing node [step number-1; previous cluster], and the edge connecting them [step number; current cluster]. It was important to ensure that the graph did not use duplicate edges with the same pairs from one outgoing node. However, using the same edges to the incoming node was acceptable. This ensured that the graph would have unambiguous transition scenarios.
During the dialogue, the user’s intent was selected, and a step was made along the edge with the selected intent.
Having reached a certain vertex, the utterance assigned to this vertex was extracted.

The proposed heuristic-based approach is summarized in Algorithm 1.

Algorithm 1 Pseudo-code of heuristic-based approach

Graph Construction
Input: Training set
Output: Graph
Procedure BuildDialogueGraph(D)
        // Corpus preparation
        For each dialogue in D:
              Add start and end markers to the dialogue

        // Step numbering
        For each dialogue in D:
              For each message in the dialogue:
                  Assign a step number to the message

        // Vectorization and clustering
        For each message in D:
              Convert the message into a vector
        Cluster vectors based on feature similarity and step number

        // Extracting information triplets
        For each cluster C_i:
              For each next cluster C_j:
                    If (C_i step number + 1 = C_j step number) AND
                        (any messages in C_i and C_j belong to the same dialogue):
                           Create an information triplet (C_i, C_j, C_i step number)

        // Building the graph
        For each triplet:
              Create vertices and edges in the graph based on the triplet

        Return the constructed graph G
End procedure

We implemented three different variations of the dialogue graph with different approaches:

The standard version assumed the use of a graph whose edges represented a group [step number; cluster].
With the global heuristic approach, not only were edges formed by groups [step number; cluster], but also edges that indicated a global step were used. Such a heuristic allowed one to create more scenarios and avoid abrupt endings of dialogues. Global steps in the graph were created if there was no local transition for a given intent.
With the tree heuristic approach, the dialogue graph was formed as a tree. Tree heuristics assumed continuation of separate development after the first disconnection. Such a heuristic allowed one to improve the accuracy of the answers given by increasing the probable scenarios without an answer.

Twenty dialogs were selected in a random manner from MultiWOZ 2.2 dataset, and the dialogue graph was visualized for each of the three abovementioned approaches. The visualization of the three approaches is shown in Figure 2. The vertices characterize the intent of each utterance, and the edges characterize the transitions between them. The column number characterizes the dialogue turn. The number of vertices in each column corresponds to the number of intents at each turn of the dialogue. The results of the BLUE and BERTScore metrics on the LlaMA model showed that the dialogue graph with global heuristics showed the best result. The tree heuristics showed the worst result. Based on these results, we could conclude that the lack of response in a dialogue was crucial.

A large number of vertices at the last step of the global heuristic method was used to handle the cases when there is no transition from the i-th step to the i + 1 step in the graph. Visualization of some dialogue graphs is presented in Section 4.2. The results of the comparison of the two methods are presented in Section 4.4.

3.2. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM Without Additional Training on Labeled Dialogues

The basic idea of an LMM-based dialogue agent is shown in Figure 3. RAG technology [55] is used to add relevant information to the context to build a model response to the user. Using RAG is a process of optimizing the output of a large language model to improve its quality.

The proposed prompt-based approach is summarized in Algorithm 2.

Algorithm 2 Pseudo-code of prompt-based approach

Procedure dialogue_agent(U_t):
# Initialization
init_prompt = P^init # Define initialization prompt
intent_mining_prompt = P^intent # Define intent mining prompt
is_end_of_dialog = P^end # Define successful dialog completion prompt
validate_response = P^val # Define validation prompt
Context = [] # Define dialog context
t = 1 # Define iteration number

# Loop until end of dialog intent is reached
while True:
# Intent extraction
I_t = extract_intent(U_t, intent_mining_prompt)

# End of dialog detection
if is_end_of_dialog(I_t):
S_t = “end”
break

# LLM output generation
R_t = generate_response (I_t, Context)

# Output validation
R^’_t = validate_response(R_t)

# Update Context
Context = {(U_i,R_i), i = 0, .., k}

# Send response to user
send_response(R^’_t)

# Update iteration number
t++

The general algorithm of an LMM-based dialogue agent is as follows:

The first step is to initialize the dialogue agent in the form of an assistant for user service. The dialogue agent initialization prompt P^init in Russian/English is presented in Figure 4.

2.

Having received a user utterance U_t, it is necessary to extract the intent using prompt P^intent from Figure 1a.

3.

The next step is to check for the successful dialogue completion. Prompt P^end in Russian/English is presented in Figure 5a. If the dialogue completion is successful, the dialogue state becomes S_t = “end”. Otherwise, the following is necessary:

3.1.: The next step is to generate an output R_t based on the intent I_t and the current context of dialogue Context.
3.2.: After generating the LLM output R_t, it is necessary to validate this output. Prompt P^val in Russian/English is presented in Figure 5b.
3.3.: After validation, Context is updated, and the validated response R^’_t is sent back to the user.
3.4.: The iteration number t is increased by one.

All steps are repeated until the dialogue end intent is reached in step 3. Examples of LLM responses are provided in Section 4.3. The results of comparison of the two methods are presented in Section 4.4.

4. Results

Section 4.1 provides a comparison of large language models on prepared test prompts in a variety of tasks related to intent mining. Section 4.2 provides visualization of some dialogue graphs for the proposed heuristic-based approach. Section 4.3 provides examples of dialogue agent responses for the proposed prompt-based approach. Section 4.4 provides an evaluation of the results of the two proposed methods.

4.1. Comparison of Large Language Models

We conducted a comparison of some English-language and Russian-language LLMs. We evaluated how clearly the model identified user intents/scenario blocks of the operator’s response, how clearly it recognized named entities, in what form the response was presented for subsequent parsing, how resistant the language model’s response was to typos in the prompt text, and whether the model could be deployed locally. The comparison results were evaluated on 150 prepared utterances in Russian and English. Respondents compared the models’ results across four categories: “+” was excellent, “+−” was good, “−+” was average, and “−” was bad. The comparison results are presented in Table 2. Access to the models was obtained in August 2024.

For the Chat GPT 3.5 OPENAI and Yandex GPT2 models, as well as for the local LLaMA and MIXTRAL models, we conducted a pairwise comparison using the side-by-side (SBS) metric. The metric is calculated as follows:

result_a = \frac{good_a + \frac{both}{2}}{good_a + good_b + both},

where

good_a—number of responses indicating “The best answer is from the ChatGPT/LLAMA model”;
good_b—number of responses indicating “The best answer is from the YandexGPT/MIXTRAL model”;
both—number of responses indicating “Both models responded well”;
none—number of responses indicating “Both models responded poorly”.

The respondents were surveyed for 50 Russian-language queries in order to identify the user intent in five categories: “restaurant booking”, “hotel booking”, “leisure and entertainment”, “buying train tickets”, and “searching for and making an appointment with a doctor”. About 5% of the queries were rejected by the models due to the presence of sensitive context (e.g., “jumping on a train”, “making an appointment with an emergency room”), as a result of which the model refused to provide an answer. The following results were obtained for the SBS metric: result_ChatGPT = 0.67 and result_YaGPT = 0.33, result_LLAMA = 0.40 and result_MIXTRAL = 0.59. The LLaMA model showed more structured responses in Russian compared with the MIXTRAL model. We used the LLaMA model for local deployment and testing of the dialogue agent in Russian.

Using the dialogue corpus MultiWOZ 2.2, we compared various prompts to build an intent sequence from the dialogue’s utterances in English. We randomly selected a dialogue from the dataset and, using the large ChatGPT 3.5 language model, performed intent mining. The comparison results are presented in Table 3. In the second column, isolated utterances of the dialogue are fed to the input of the LLM one by one. In the third column, the current utterance and all previous utterances of the dialogue are fed to the input of the LLM. In the fourth column, the full dialogue is fed to the input of the LLM. The model was accessed in September 2024.

As expected, the full dialogue context fed to the input of a large language model showed better results in recognizing most user intents and operator responses with one exception. In the ninth sentence, “- How about for 3 nights?” a modification of the original intent occurred, and this intent was recognized only by the model with a partial dialogue context as an input. Such cases in dialogues are critical for assessing the quality of the dialogue structure of a text dialogue on an arbitrary topic.

4.2. Visualization of Dialogue Graphs with Intent Sequences

We fed the locally deployed LlaMA model the MultiWOZ 2.2 dataset and asked it to recognize the user/system intent for each conversational utterance. We obtained a labeled dataset of 113,747 records. The intent column contained over 10,000 unique intents, many of which were close in meaning (e.g., “Acceptance”, ”Acceptance agreement”, ”Acceptance and gratitude”, ”Acceptance approval”, etc.). Using clustering, we combined similar intents into groups. Using the networkx 3.4.2 library in Python, for each conversation, we could visualize the dialogue graph based on the intent sequence labeled by LlaMA. Examples of several dialogue graphs are shown in Figure 6. The average length of a dialogue graph was 6 utterances, the longest dialogue consisted of 15 utterances, and the shortest dialogue consisted of 2 utterances.

4.3. Examples of Dialogue Agent Responses

An example of a conversation with dialogue agents after the initialization prompt P^init was applied is presented in Figure 7.

The dialogue agent’s positive and negative responses after a successful dialogue completion prompt P^end was applied are presented in Figure 8.

Dialogue agent validation of each response line with the validation prompt P^val applied is presented in Figure 9.

Example of full conversation of human with a dialogue agent presented in Figure 10.

After successfully completing a conversation, the dialogue agent saved the following information in the slots:

−: Check-in date: December 15
−: Check-out date: December 20
−: Number of nights: 5
−: Number of guests: 2
−: Room type: Standard
−: Additional services: Breakfast.

4.4. Evaluation of Two Methods for Constructing a Dialogue Graph

Table 4 presents the results of the LlaMA model validation for two dialogue graph construction approaches using the BERTScore, Bleu, and Meteor metrics.

The LlaMA model with additional training on labeled dialogues had a higher accuracy of the BERTScore metric compared with the approach without training on labeled dialogues. The LlaMA model without additional training on labeled dialogues could be used as a dialogue agent for an arbitrary subject area, which was an advantage.

5. Discussion

The article analyzed seven popular large language models on prepared test prompts for Russian and English. The article compared large language models for extracting knowledge from texts and recognizing user intents/operator responses. Comparison of language models on test prompts showed that the Yandex GPT language model showed the best results for extracting the user intentions/scenario blocks of an operator response for Russian, and the GEMINI language model showed the best results for English. We assume that different grammatical and language structures and linguistic features of Russian and English affect how language models process and understand text. Yandex GPT has been fine-tuned to better handle the nuances of the Russian language, which could have contributed to its better performance on extracting user intentions/scenario blocks. Gemini has been fine-tuned to better handle the nuances of the English language, which could have contributed to its better performance on extracting user intentions/scenario blocks. The ChatGPT language model showed the best results for extracting named entities for Russian and English. The Gemini language model had the most clearly structured response; most of the studied language models showed good resistance to typos and token permutations in the prompt. We assume that ChatGPT is probably better fine-tuned on NER tasks and Gemini is better fine-tuned on generating coherent and structured responses. The Mistral-Saiga, LLaMA, and MIXTRAL models can be deployed locally, which is an advantage in terms of processing sensitive, personal, and confidential data. The best results of pairwise comparison by the side-by-side metric were shown by the ChatGPT language model in comparison with the YandexGPT model and the locally deployed MIXTRAL language model in comparison with the LLaMA model. We assume that the YandexGPT and the LlaMA models might not have been fine-tuned as extensively as the ChatGPT and MIXTRAL models in pairwise comparison by the side-by-side metric. Using the dialogue context as the input to a large language model led to more efficient and accurate intent extraction.

The intent sequences for various prompts for isolated sentences, with partial dialogue context and with full dialogue context, were constructed. The results suggested that the dialogue context was an important factor in scenario graph generation and that the model needed to be able to preserve the context to generate accurate scenario graphs. Several dialogue graphs were visualized using the MultiWOZ 2.2 dataset. Two approaches based on the use of large language models for generating a scenario graph of a goal-oriented dialogue system with preservation of the dialogue context were considered: an approach with additional training on labeled dialogues and an approach without training on labeled dialogues. The results suggested that fine-tuning the model on labeled dialogues was crucial to achieve a higher accuracy in scenario graph generation. The metrics to evaluate LLM on the MultiWOZ 2.2 dataset achieved in other studies [35,37,41,42] were consistent with the results obtained in this study. The results obtained on the scenario graph generation process showed that the RAG technology can be used to add relevant information to the context for model inference and can be a useful tool in enhancing the performance of large language models in dialogue systems.

6. Conclusions

The study had several limitations that should be acknowledged. First, the used datasets, MultiWOZ 2.2 and MANTiS, may not be representative of all possible dialogue scenarios, which could limit the generalizability of the findings. Additionally, the study focused on two specific methods for constructing dialogues in goal-oriented dialogue systems.

Several areas for future research emerge from the present study. First, further investigation is needed to explore the impact of incorporating labeled dialogue data on the performance of dialogue systems in different domains and scenarios. Second, the study highlights the importance of considering dialogue context in constructing goal-oriented dialogues. Future research should investigate the development of more sophisticated dialogue systems that can effectively incorporate context into the dialogue construction process. Future research should investigate the impacts of these factors on the effectiveness of the two proposed methods.

In conclusion, the present study provides valuable insights into the effectiveness of two methods for constructing dialogue scenarios in goal-oriented dialogue systems. The study highlights the importance of incorporating labeled dialogue data into dialogue systems and emphasizes the need for dialogue systems to consider dialogue context in LLM response generation. The findings of the study have significant theoretical and practical implications for the development of more effective dialogue systems in the field of customer service.

Author Contributions

Software, V.B.; investigation, V.B.; visualization, V.B.; writing—review and editing, L.L.; writing—original draft, L.L.; project administration, A.S.; formal analysis, A.S.; methodology, A.S.; investigation, V.K.; data curation, V.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Russian Science Foundation (Grant Number 23-21-00503).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

DAPT	Domain-adaptive pre-training
GAN	Generative adversarial network
HMM	Hidden Markov models
LDA	Latent Dirichlet allocation
LLM	Large language model
LQR	Layered query retrieval
NER	Named-entity recognition
NLP	Natural language processing
NMF	Non-negative matrix factorization
OHS	Occupational health and safety
RAG	Retrieval augmented generation
TF-IDF	Term frequency–inverse document frequency
TOD	Task-oriented dialogue

References

Sood, P.; Tanwar, H.; Singh, J.; Ruhela, A.K.; Gupta, N.; Kumar, R. Revolutionizing Customer Service: An AI-powered Chatbot Approach using Advanced NLP Techniques. In Proceedings of the 2024 3rd Edition of IEEE Delhi Section Flagship Conference (DELCON), New Delhi, India, 21–23 November 2024; pp. 1–5. [Google Scholar] [CrossRef]
Rustamov, S.; Bayramova, A.; Alasgarov, E. Development of dialogue management system for banking services. Appl. Sci. 2021, 11, 10995. [Google Scholar] [CrossRef]
Ngai, E.W.; Lee, M.C.; Luo, M.; Chan, P.S.; Liang, T. An intelligent knowledge-based chatbot for customer service. Electron. Commer. Res. Appl. 2021, 50, 101098. [Google Scholar] [CrossRef]
Addlesee, A.; Sieińska, W.; Gunson, N.; Garcia, D.H.; Dondrup, C.; Lemon, O. Multi-party Goal Tracking with LLMs: Comparing Pre-training, Fine-tuning, and Prompt Engineering. arXiv 2023, arXiv:2308.15231. [Google Scholar] [CrossRef]
Lahiri, A.; Sanyal, D.K.; Mukherjee, I. CitePrompt: Using Prompts to Identify Citation Intent in Scientific Papers. arXiv 2023, arXiv:2304.12730. [Google Scholar] [CrossRef]
Nambanoor Kunnath, S.; Pride, D.; Knoth, P. Prompting Strategies for Citation Classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar] [CrossRef]
Chang, K.W.; Tseng, W.C.; Li, S.W.; Lee, H.Y. SpeechPrompt: An exploration of prompt tuning on generative spoken language model for speech processing tasks. arXiv 2022, arXiv:2203.16773. [Google Scholar] [CrossRef]
Dighe, P.; Su, Y.; Zheng, S.; Liu, Y.; Garg, V.; Niu, X.; Tewfik, A. Leveraging Large Language Models for Exploiting ASR Uncertainty. arXiv 2023, arXiv:2309.04842. [Google Scholar] [CrossRef]
Gao, N.; Zhao, Z.; Zeng, Z.; Zhang, S.; Weng, D.; Bao, Y. GesGPT: Speech Gesture Synthesis with Text Parsing from GPT. arXiv 2023, arXiv:2303.13013. [Google Scholar] [CrossRef]
Zhang, R.H.; Sell, P.; Zhang, Y.; Che, L.; Gao, A.; Sathiyajith, K.S.; Bhatt, R.; Nagasubramaniam, P.; Vummanthala, S.; Dave, S.; et al. EvoquerBot: A Multimedia Chatbot Leveraging Synthetic Data for Cross-Domain Assistance. Penn State University: University Park, PA, USA, 2023. [Google Scholar]
Bragg, J.; Cohan, A.; Lo, K.; Beltagy, I. Flex: Unifying evaluation for few-shot NLP. Adv. Neural Inf. Process. Syst. 2021, 34, 15787–15800. [Google Scholar] [CrossRef]
Loukas, L.; Stogiannidis, I.; Malakasiotis, P.; Vassos, S. Breaking the Bank with ChatGPT: Few-Shot Text Classification for Finance. arXiv 2023, arXiv:2308.14634. [Google Scholar] [CrossRef]
Wang, P.; He, K.; Wang, Y.; Song, X.; Mou, Y.; Wang, J.; Xian, Y.; Cai, X.; Xu, W. Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection. arXiv 2024, arXiv:2402.17256. [Google Scholar] [CrossRef]
Abdullin, Y.; Molla-Aliod, D.; Ofoghi, B.; Yearwood, J.; Li, Q. Synthetic Dialogue Dataset Generation using LLM Agents. arXiv 2024, arXiv:2401.17461. [Google Scholar] [CrossRef]
Ahmed, R.; Rauf, S.A.; Latif, S. Leveraging Large Language Models and Prompt Settings for Context-Aware Financial Sentiment Analysis. In Proceedings of the 2024 5th International Conference on Advancements in Computational Sciences (ICACS), Lahore, Pakistan, 19–20 February 2024; pp. 1–9. [Google Scholar] [CrossRef]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar] [CrossRef]
Egger, R.; Yu, J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef] [PubMed]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
Zhang, T.; Ladhak, F.; Durmus, E.; Liang, P.; McKeown, K.; Hashimoto, T.B. Benchmarking large language models for news summarization. Trans. Assoc. Comput. Linguist. 2024, 12, 39–57. [Google Scholar] [CrossRef]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Wang, S.; Yin, D.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Wang, Z. CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), Bangkok, Thailand, 16 August 2024; pp. 143–151. [Google Scholar]
Huang, Y.; Song, J.; Wang, Z.; Zhao, S.; Chen, H.; Juefei-Xu, F.; Ma, L. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv 2023, arXiv:2307.10236. [Google Scholar] [CrossRef]
Wang, C.; Liu, X.; Yue, Y.; Tang, X.; Zhang, T.; Jiayang, C.; Yao, Y.; Gao, W.; Hu, X.; Qi, Z.; et al. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv 2023, arXiv:2310.07521. [Google Scholar] [CrossRef]
Huang, Y.; Sun, L.; Wang, H.; Wu, S.; Zhang, Q.; Li, Y.; Gao, C.; Huang, Y.; Lyu, W.; Zhang, Y.; et al. Trustllm: Trustworthiness in large language models. arXiv 2024, arXiv:2401.05561. [Google Scholar] [CrossRef]
McIntosh, T.R.; Susnjak, T.; Arachchilage, N.; Liu, T.; Watters, P.; Halgamuge, M.N. Inadequacies of large language model benchmarks in the era of generative artificial intelligence. arXiv 2024, arXiv:2402.09880. [Google Scholar] [CrossRef]
Ilse, B.; Blackwood, F. Comparative analysis of finetuning strategies and automated evaluation metrics for large language models in customer service chatbots. Res. Sq. 2024, 1–18. [Google Scholar] [CrossRef]
Kaushal, A.; Lin, C.C.; Chauhan, R.; Kumar, R. Charting the Growth of Text Summarisation: A Data-Driven Exploration of Research Trends and Technological Advancements. Appl. Sci. 2024, 14, 11462. [Google Scholar] [CrossRef]
Falegnami, A.; Tomassi, A.; Corbelli, G.; Nucci, F.S.; Romano, E. A Generative Artificial-Intelligence-Based Workbench to Test New Methodologies in Organisational Health and Safety. Appl. Sci. 2024, 14, 11586. [Google Scholar] [CrossRef]
Chen, Q.; Zhou, W.; Cheng, J.; Yang, J. An Enhanced Retrieval Scheme for a Large Language Model with a Joint Strategy of Probabilistic Relevance and Semantic Association in the Vertical Domain. Appl. Sci. 2024, 14, 11529. [Google Scholar] [CrossRef]
Vaškevičius, M.; Kapočiūtė-Dzikienė, J. Language Models for Predicting Organic Synthesis Procedures. Appl. Sci. 2024, 14, 11526. [Google Scholar] [CrossRef]
Dai, Q.; Mao, Y.; Tang, J.; Rong, Y. STGPT2UGAN: Spatio-Temporal GPT-2 United Generative Adversarial Network for Wind Speed Prediction in Turbine Network. Appl. Sci. 2024, 14, 11217. [Google Scholar] [CrossRef]
Huang, J.; Wang, M.; Cui, Y.; Liu, J.; Chen, L.; Wang, T.; Li, H.; Wu, J. Layered Query Retrieval: An Adaptive Framework for Retrieval-Augmented Generation in Complex Question Answering for Large Language Models. Appl. Sci. 2024, 14, 11014. [Google Scholar] [CrossRef]
Bensch, C.; Müller, A.; Chojnowski, O.; Richert, A. Beyond Binary Dialogues: Research and Development of a Linguistically Nuanced Conversation Design for Social Robots in Group–Robot Interactions. Appl. Sci. 2024, 14, 10316. [Google Scholar] [CrossRef]
Smutny, P.; Bojko, M. Comparative Analysis of Chatbots Using Large Language Models for Web Development Tasks. Appl. Sci. 2024, 14, 10048. [Google Scholar] [CrossRef]
Benzinho, J.; Ferreira, J.; Batista, J.; Pereira, L.; Maximiano, M.; Távora, V.; Gomes, R.; Remédios, O. LLM Based Chatbot for Farm-to-Fork Blockchain Traceability Platform. Appl. Sci. 2024, 14, 8856. [Google Scholar] [CrossRef]
Liu, J.; Tan, Y.K.; Fu, B.; Lim, K.H. Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification. arXiv 2024, arXiv:2411.14252. [Google Scholar] [CrossRef]
Gao, J.; Xiang, L.; Wu, H.; Zhao, H.; Tong, Y.; He, Z. An Adaptive Prompt Generation Framework for Task-oriented Dialogue System. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 1078–1089. [Google Scholar] [CrossRef]
Cao, L. Diaggpt: An llm-based chatbot with automatic topic management for task-oriented dialogue. arXiv 2023, arXiv:2308.08043. [Google Scholar] [CrossRef]
Hu, Z.; Feng, Y.; Deng, Y.; Li, Z.; Ng, S.K.; Luu, A.T.; Hooi, B. Enhancing Large Language Model Induced Task-Oriented Dialogue Systems Through Look-Forward Motivated Goals. arXiv 2023, arXiv:2309.08949. [Google Scholar] [CrossRef]
Wu, S.; Shen, X.; Xia, R. A New Dialogue Response Generation Agent for Large Language Models by Asking Questions to Detect User’s Intentions. arXiv 2023, arXiv:2310.03293. [Google Scholar] [CrossRef]
Stacey, J.; Cheng, J.; Torr, J.; Guigue, T.; Driesen, J.; Coca, A.; Gaynor, M.; Johannsen, A. LUCID: LLM-Generated Utterances for Complex and Interesting Dialogues. arXiv 2024, arXiv:2403.00462. [Google Scholar] [CrossRef]
Li, H.; Yang, C.; Zhang, A.; Deng, Y.; Wang, X.; Chua, T.S. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue. arXiv 2024, arXiv:2406.05925. [Google Scholar] [CrossRef]
Zhang, X.; Peng, B.; Li, K.; Zhou, J.; Meng, H. Sgp-tod: Building task bots effortlessly via schema-guided llm prompting. arXiv 2023, arXiv:2305.09067. [Google Scholar] [CrossRef]
Okadome, Y.; Yuguchi, A.; Fukui, R.; Matsumoto, Y. Prompt design using past dialogue summarization for llms to generate the current appropriate dialogue. In International Conference on Artificial Neural Networks; Springer Nature: Cham, Switzerland, 2024; pp. 33–41. [Google Scholar] [CrossRef]
Robino, G. Conversation Routines: A Prompt Engineering Framework for Task-Oriented Dialog Systems. arXiv 2025, arXiv:2501.11613. [Google Scholar] [CrossRef]
De Baer, J.; Doğruöz, A.S.; Demeester, T.; Develder, C. Single-vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources. arXiv 2025, arXiv:2502.18650. [Google Scholar] [CrossRef]
Huang, Q.; Liu, X.; Ko, T.; Wu, B.; Wang, W.; Zhang, Y.; Tang, L. Selective Prompting Tuning for Personalized Conversations with LLMs. arXiv 2024, arXiv:2406.18187. [Google Scholar] [CrossRef]
Zang, X.; Rastogi, A.; Sunkara, S.; Gupta, R.; Zhang, J.; Chen, J. Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines. arXiv 2020, arXiv:2007.12720. [Google Scholar] [CrossRef]
Penha, G.; Balan, A.; Hauff, C. Introducing mantis: A novel multi-domain information seeking dialogues dataset. arXiv 2019, arXiv:1912.04639. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar] [CrossRef]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]

Figure 1. (a) Prompt for intent mining in Russian/English. (b) Prompt for named entity recognition in Russian/English.

Figure 2. (a) Visualization of a standard dialogue graph with BLUE = 0.21483 and BERTScore = 0.82314 metric values. (b) Visualization of a dialogue graph with a global heuristic represented by the last column with BLUE = 0.22152 and BERTScore = 0.85113 metric values. (c) Visualization of a dialogue graph with a tree-based heuristic with BLUE = 0.19863 and BERTScore = 0.76367 metric values.

Figure 3. The basic idea of an LMM-based dialogue agent iterative operation.

Figure 4. The dialogue agent initialization prompt in Russian/English.

Figure 5. (a) The dialogue completion check prompt in Russian/English. (b) The output validation prompt in Russian/English.

Figure 6. Examples of visualization of four (a–d) completed dialogues graphs.

Figure 7. Conversation with dialogue agent after initialization prompt in Russian/English.

Figure 8. Positive and negative responses after “end of dialogue check” prompt in Russian/English.

Figure 9. Dialogue agent validation results after validation prompt in Russian/English.

Figure 10. Full conversation of a human with a dialogue agent in Russian/English.

Table 1. The short scientific literature on the use of large language models in dialogue generation.

LLM	Dataset/Database	Language	RAG	Dialogue Generation	Metric	Ref.
LlaMA 2	Knowledge database consisting of 222 potential user intents	German, English	+	User’s statement, dialogue history, user count, and language are fed to the LLM; GPT-3.5 is used to synthesize plural responses to provide linguistically nuanced responses	–	[34]
Mistral-8x7B-Instruct-V0.1, Mistral-7B-Instruct-V0.2, Meta-LLamma-3-8B-Instruct, Google Gemma-2B	FAISS, ChromaDB	English, Portuguese	+	Regular expressions are applied to the text blocks to define data types; the user’s chat history is used to provide context to the LLM	Answer precision on a scale of 1–5	[36]
XLM-RoBERTa-base	e-commerce question-intent datasets	Portuguese, Indonesian, English/Malay, English/Filipino, English, Thai, traditional Chinese, Vietnamese	-	Multi-turn intent classification chain-of-intent method to generate intent-aware dialogues	Contrastive loss	[37]
Chatgpt-3.5-turbo	MultiWOZ 2.0	English	-	Adaptive prompt generation	Inform, success, BLEU	[38]
gpt-4-0613, gpt-3.5-turbo, gpt-4-turbo-2024-04-09	LLM-TOD dataset	English	-	Proactive question asking, users’ guidance, dialogue state maintenance	Round count, completion rate, response quality, comparison score	[39]
GPT-3.5- turbo	MultiWoZ 2.1	English	-	Proactively goal-driven LLM-induced approach, future dialogue actions and goal-oriented reward	Inform, success	[40]
text-davinci-001, text-davinci-002, text-davinci003, gpt-3.5-turbo	Context-open-question dataset	English	-	Question generation to generate a variety of questions related to the context of the dialogue; extra knowledge retrieval; enhance the LLM response	BLEU, ROUGE, human evaluate	[41]
T5, Flan-T5	LUCID dataset	English	-	Generation of intents, a conversational planner, turn-by-turn generation of conversations, and validation procedure	Intent accuracy, joint goal accuracy	[42]
ChatGPT; ChatGLM, BlenderBot, BART	MSC and CC datasets	English	-	Historical event perception, dynamic persona extraction, response generation based on retrieved relevant memories	BLEU-N, ROUGE-L, METEOR, accuracy, human evaluation	[43]
ChatGPT, GPT3.5	Multiwoz 2.0 and 2.2, RADDLE and STAR datasets	English	-	LLM to generate with user, DST prompter to retrieve database items, policy prompter to elicit proper responses adhering to the provided dialogue policy	Inform, success, BLEU, combined, BERTScore	[44]
T5	NUCC, Livedoor news summarization, dolly-15k-ja	Japanese	-	Dialogue summarization-based prompt design with context database	ROUGE-1, ROUGE-2, ROUGE-L, BERT score, Sentence-BERT	[45]
OpenAI GPT-4o-mini	Train ticket booking system, interactive troubleshooting Copilot data	English, Italian	+	Conversation routine-based embedded business logic within LLM prompts	-	[46]
GPT-4o, Llama 3.3	-	English	-	Single-prompt dialogue generation, two-agent dialogue generation	Agreement rate	[47]
OPT, LLama2	CONVAI2	English	+	Selective prompt tuning-based dialogue generation	F1, BLEU ROUGE-1 ROUGE-2 ROUGE-L	[48]
LLaMA	MultiWOZ 2.2, MANTiS	Russian, English	+	Context-based, LLM-based iterative dialogue generation with and without additional training on labeled dialogues	BERTScore, BLEU, Meteor, human evaluation	Our research

Table 2. Comparison of large language models on prepared test prompts.

LLM	Intent Mining	NER	Clearly Structured Response	Resistance to Typos and Word Rearrangements in Prompts	Local Deployment
ChatGPT	−+	+	+−	+−	−
Mistral-Saiga	−+	−	−+	−+	+
GigaChat	−+	−	−	−	−
Yandex GPT	−+	−+	−+	−+	−+
Gemini	−+	−+	+	−+	−
LLaMA	−+	−+	+	+	+
MIXTRAL	−+	−	−	+	+

Table 3. Construction of intent sequence with various prompts for the ChatGPT 3.5 LLM.

Dialogue Utterances	Intent Sequence (Isolated Utterances)	Intent Sequence (Current Dialogue Context as an Input)	Intent Sequence (Whole Dialogue Context as an Input)
	Prompt: What is the user intent in the text? Describe it with one or two words: text: “[input]”	Prompt: What is the user intent in the text? Describe it with one or two words. Use the following pieces of context to answer the question. text: “[context]” text: “[input]”	Prompt: What is the user intent in the text? Describe it with one or two words. Use the following pieces of context to answer the question. text: “[context]” text: “[input]”
- Guten tag, I am staying overnight in Cambridge and need a place to sleep. I need free parking and internet.	Accommodation inquiry	Accommodation request	Accommodation inquiry
- I have 4 different options for you. I have two cheaper guesthouses and two expensive hotels. Do you have a preference?	Preference gathering	Preference inquiry	Presenting options
- No, but I’d really like to be on the south end of the city. Do any of those fit the bill?	Location preference	Location query	Location preference
- Sure. Does price matter? We can narrow it down and find exactly what you need.	Clarification	Preference clarification	Inquiry and assistance
- No I don’t care about the price. Which one do you recommend?	Seek recommendation	Decision-making	Seeking recommendation
- I would recommend Aylesbray Lodge Guest House. Would you like me to book that for you?	Recommendation, booking	Booking assistance	Recommendation and booking
- Yes, book it for 4 people and 4 nights starting from Tuesday.	Reservation	Confirm a booking	Booking request
- The booking was unsuccessful. Would you like another day or a shorter stay?	Options	Adjust booking	Booking adjustment
- How about for 3 nights?	Booking accommodation	Modification	Booking confirmation
- Booked! Reference number is: 84ESP6F5	Confirmation	Booking accommodation	Confirmation/notification.
- Great. I am all set then. Have a nice day. Bye.	Closure	Farewell	Confirmation and farewell
- Have a nice stay. Bye.	Closure	Farewell	Confirmation and farewell

Table 4. Comparison of two methods for constructing a dialogue graph.

Dataset	Model	BERTScore	Bleu	Meteor
MultiWOZ 2.2	LLaMA with fine-tuning on dialogues	0.85	0.22	0.17
MultiWOZ 2.2	LLaMA without fine-tuning on dialogues	0.75	0.60	0.15
MANtIS	LLaMA with fine-tuning on dialogues	0.82	0.24	0.20
MANtIS	LLaMA without fine-tuning on dialogues	0.72	0.62	0.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Legashev, L.; Shukhman, A.; Badikov, V.; Kurynov, V. Using Large Language Models for Goal-Oriented Dialogue Systems. Appl. Sci. 2025, 15, 4687. https://doi.org/10.3390/app15094687

AMA Style

Legashev L, Shukhman A, Badikov V, Kurynov V. Using Large Language Models for Goal-Oriented Dialogue Systems. Applied Sciences. 2025; 15(9):4687. https://doi.org/10.3390/app15094687

Chicago/Turabian Style

Legashev, Leonid, Alexander Shukhman, Vadim Badikov, and Vladislav Kurynov. 2025. "Using Large Language Models for Goal-Oriented Dialogue Systems" Applied Sciences 15, no. 9: 4687. https://doi.org/10.3390/app15094687

APA Style

Legashev, L., Shukhman, A., Badikov, V., & Kurynov, V. (2025). Using Large Language Models for Goal-Oriented Dialogue Systems. Applied Sciences, 15(9), 4687. https://doi.org/10.3390/app15094687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Large Language Models for Goal-Oriented Dialogue Systems

Abstract

Featured Application

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM with Additional Training on Labeled Dialogues

3.2. Generation of a Scenario Graph for a Goal-Oriented Dialogue System with Preservation of the Dialogue Context Based on an LLM Without Additional Training on Labeled Dialogues

4. Results

4.1. Comparison of Large Language Models

4.2. Visualization of Dialogue Graphs with Intent Sequences

4.3. Examples of Dialogue Agent Responses

4.4. Evaluation of Two Methods for Constructing a Dialogue Graph

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI