1. Introduction
Modern businesses increasingly use chatbots to improve communication with customers. Their role as the first line of contact in customer service is steadily growing, increasingly automating the handling of routine tasks. This dynamic development drives intensive research on AI-based dialog systems, aiming to enhance their capabilities, improve user experience, and address the challenges of human-like interaction.
The growing demand for and success of conversational agents have spurred the development of various technologies to create these systems. Tech giants like Google (Dialogflow), IBM (Watson Assistant), Microsoft (Azure AI Bot Service), and Amazon (Lex) have released their own chatbot creation platforms. Smaller companies like Rasa, ManyChat, FlowXO, and Pandorabots have also proposed their tools, offering a wider range of solutions for businesses of different sizes [
1,
2].
Given the diversity of terminology across these systems, this work primarily adopted the terminology from Rasa Pro, a widely adopted open-source platform with a large and active community [
3]. Its highly modular and configurable architecture, combined with the active involvement of a global community of over six hundred developers and ten thousand forum members, allows for continuous innovation and the integration of cutting-edge AI research and techniques [
4].
In the initial phase of development, dialog systems were primarily based on Natural Language Understanding (NLU) models. These models function by defining intents, which represent the user’s goal or purpose, and training them on a set of example phrases that express each intent. This allows the NLU model to recognize user intents based on the provided phrases. These defined intents are then combined into conversation scenarios, forming the foundation for more complex interactions [
5,
6]. Chatbots built this way need to be continuously retrained for both new intents and those that are not effectively recognized. A significant advantage of NLU models is their relatively low computational requirements. They can be trained quickly and efficiently, and their small size minimizes the need for extensive hardware resources, making them cost-effective to deploy and maintain [
7]. In practice, this meant that companies could deploy such chatbots without investing in expensive infrastructure. NLU models are also more understandable and easier to manage for developers, speeding up the iterative learning process and improving response quality.
The emergence of large language models (LLMs) has revolutionized AI dialog platforms, enabling more sophisticated conversation flows. Each flow is designed to not only recognize the customer’s initial intent but also to manage the entire conversation. This is possible because conversation flows leverage the powerful capabilities of LLMs, which allow for continuous understanding of the conversation context and can effectively process subsequent customer statements without the need for explicit definition of additional intents [
8].
The aim of this research was to analyze the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models. This paper presents several key contributions to the field of AI-based dialog systems:
A comprehensive analysis of the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models such as Gemini-1.5-Flash-8B and Gemma2-9B-IT.
Demonstration of the significant impact of prompt engineering techniques, such as using structured formats like YAML and JSON, on the accuracy and efficiency of chatbot responses.
Presentation of practical insights for chatbot designers, emphasizing the importance of model selection and prompt construction in optimizing chatbot performance.
The structure of this paper is as follows:
Section 2 presents a review of related work,
Section 3 describes the methodology used in this study,
Section 4 discusses the results, and
Section 5 concludes with insights and future research directions.
2. Related Work
A key advantage of LLM-based conversation flows compared to NLU-based scenarios is the reduced need for frequent retraining. In many cases, it is sufficient to formulate an appropriate prompt that allows the LLM to indicate the appropriate action to be taken, making the system more adaptable to changing conversational needs [
9]. However, utilizing LLMs typically involves significant computational resources. Running LLMs on cloud platforms incurs costs, while deploying them on-premises requires substantial hardware resources, including high-performance GPUs and ample memory [
10]. Therefore, knowledge about the efficiency of individual LLMs in specific applications is crucial to select the most optimal model for one’s needs.
In their research, the creators of the Rasa Pro system emphasize the importance of exploring the potential of smaller, more resource-efficient language models. They highlight the need to investigate whether these models can achieve comparable performance to larger models while offering significant cost and latency advantages. A more comprehensive evaluation of the current system, including real-world case studies of production systems, is suggested to better understand the practical implications of these findings and guide future research and development [
11].
Various LLMs, including Meta’s Llama2 Chat 7B and 13B, Mistral Instruct v0.1 7B and v0.2 7B, Google’s Gemma 2B and 7B, and OpenAI’s GPT-3.5 and GPT-4, were compared across a range of tasks, such as factuality, toxicity, hallucination, bias, jailbreaking, out-of-scope requests, and multi-step conversations. Llama2 demonstrated strong performance in tasks related to factuality and toxicity handling but showed limitations in identifying and appropriately responding to out-of-scope requests. Mistral excelled in tasks involving hallucination and multi-step conversations yet exhibited weaknesses in detecting and mitigating toxic content generation. Gemma achieved the highest scores in tasks related to bias and jailbreaking, although it frequently declined to respond to certain prompts, particularly those that were deemed inappropriate or potentially harmful. Notably, GPT-4 significantly outperformed all other models in safety tests, highlighting its advanced technological capabilities [
12]. Recent studies have highlighted the vulnerabilities of safety alignment in open-access LLMs. Research has demonstrated that safety-aligned LLMs could be reverse-aligned to output harmful content through techniques like reverse supervised fine-tuning and reverse preference optimization, emphasizing the need for robust safety alignment methods [
13]. Additionally, studies have explored clean-label backdoor attacks in language models, introducing a method that injects text style as an abstract trigger without external triggers, which poses significant risks to model integrity [
14]. Further research investigated jailbreaking attacks against multimodal large language models, revealing the potential for these models to generate objectionable responses to harmful queries through image-based prompts, further underscoring the importance of robust safety measures [
15].
The performance of Gemma-2B and Gemma-7B was evaluated across various domains, including cybersecurity, medicine, and finance, by comparing their responses to general knowledge questions and domain-specific queries. The study demonstrated a significant correlation between both model size and prompt type and the length, quality, and relevance of the generated responses. While general queries often elicited diverse and inconsistent outputs, domain-specific queries consistently produced more concise and relevant responses within reasonable timeframes [
16]. This finding highlights the importance of prompt engineering in optimizing LLM performance across diverse domains.
A family of lightweight, modern open models, known as Gemma, was introduced, built upon the foundational research of the Gemini models, with the aim of providing high-quality language generation capabilities while being more accessible and resource-efficient. Available in 2B and 7B parameter versions, Gemma models underwent rigorous evaluation across various domains using both automated benchmarks and human assessments, including human preference tests and expert evaluations. In human preference tests, Gemma-7B demonstrated superior performance, achieving 61.2% positive results, indicating that human evaluators preferred its outputs more often, compared to 45% for Gemma-2B [
17].
A diverse collection of powerful open-source language models, known as Llama, ranging in size from 7B to 65B parameters, was introduced. Trained exclusively on publicly available datasets, these models achieved performance comparable to leading models such as Chinchilla-70B and Palm-540B, demonstrating the potential of high-quality models trained on open data. Notably, Llama-13B surpassed GPT-3 (175B) in performance across most benchmarks, demonstrating significant capabilities despite being ten times smaller, highlighting the potential for developing powerful yet resource-efficient language models. The release of these models to the research community aims to democratize access to advanced language models [
18].
The impact of varying prompt templates, including plain text, Markdown, YAML, and JSON formats, on the performance of LLM models, specifically GPT-3.5 and GPT-4, was investigated across a range of tasks. Experimental findings demonstrated that GPT-3.5-turbo’s performance could exhibit significant variability, with up to a 40% difference in accuracy observed across different prompt templates. While larger models, such as GPT-4, demonstrated greater robustness to prompt format variations compared to GPT-3.5, the performance of all GPT models evaluated in that study was observed to be influenced by the chosen prompt format. These findings underscore the critical importance of prompt engineering and highlight the lack of a universally optimal prompt template for all GPT models. The significant impact of prompt formatting on model performance emphasizes the crucial need to consider the influence of different prompt formats in future LLM evaluations to ensure accurate assessments and facilitate performance improvements [
19].
The effectiveness of various generative language models (LLMs), including Claude v3 Haiku, SetFit with negative augmentation, and Mistral-7B, for intent detection within dialog systems was examined. Claude v3 Haiku demonstrated the highest performance, while SetFit with negative augmentation exhibited an 8% performance decrease. Notably, Mistral-7B demonstrated a significant improvement in query detection accuracy and F1 score, exceeding baseline performance by over 5%, indicating its potential for robust intent detection in dialog systems [
20].
The impact of varying prompt formulations, including different levels of instruction detail, the inclusion of examples, and the use of different linguistic styles, on the performance of large language models (LLMs), including ChatGPT-4 and models from the Llama, Mistral, and Gemma families, was investigated. Experimental results revealed significant performance variability across different prompt formulations, with a notable 45.48% difference in accuracy observed between the best and worst performing prompts for Llama2-70B. Furthermore, the study found that common prompt improvement techniques, such as self-improving prompts, voting, and distillation, had limited success in mitigating the impact of poorly performing prompts, highlighting the critical importance of careful prompt engineering in achieving optimal LLM performance [
21].
The literature review highlights several key findings. Firstly, it emphasizes the potential of smaller LLM models, such as Llama2 and Gemma, which can offer comparable performance to larger models with lower computational costs and latency. Secondly, the review underscores the diverse strengths and weaknesses of different LLMs. For instance, Llama2 excels in factuality tasks, while GPT-4 demonstrates superior safety performance. Furthermore, the review highlights the significant impact of prompt engineering on LLM performance. Experiments with various prompt formats (plain text, Markdown, YAML, JSON) demonstrate significant performance variations across different models. Notably, prompt engineering techniques, while beneficial in some cases, do not consistently mitigate the impact of poorly designed prompts. Finally, the literature review indicates a gap in research regarding the application of LLMs in real-world dialog systems. The authors of the Rasa system, among others, emphasize the need for more comprehensive evaluations, including case studies of production systems, to better understand the practical implications of these findings.
Despite the extensive research on LLMs, there remains a gap in understanding the practical application of these models in real-world dialog systems. This study aims to fill this gap by providing a comprehensive analysis of the performance of various LLMs on the Rasa Pro platform, highlighting the effectiveness of smaller models. By focusing on prompt engineering techniques and model selection, this research offers practical insights for optimizing chatbot performance, thereby contributing to the advancement of AI-based dialog systems. Rasa Pro was chosen for this study due to its robust features and flexibility. Rasa Pro is an open-core product that extends Rasa Open Source, which has over 50 million downloads and is the most popular open-source framework for building chat and voice-based AI assistants. Rasa Pro extends Rasa Open Source with Conversational AI with Language Models (CALM), a generative AI-native approach to developing assistants, combined with enterprise-ready analytics, security, and observability capabilities. Additionally, Rasa Pro allows for the customization of LLM models and prompt structures, making it an ideal platform for research and experimentation. The platform is widely used by other researchers, further validating its reliability and effectiveness [
3].
3. Research Methodology
This study employs a mixed-methods approach, combining both quantitative and qualitative research methods to investigate how the choice of LLM language model and prompt engineering techniques influence the quality of responses generated by chatbots.
3.1. Quantitative Methods
The quantitative part of the study involved the analysis of historical customer service conversations conducted by a mobile phone provider. A total of 1054 real phone conversations and 545 real chat conversations were analyzed, which are described in more detail in my previous studies [
22,
23]. Based on this analysis, 10 main conversation flows were identified, and a total of 400 sample test phrases were developed for these flows (
Table 1). The datasets were carefully constructed to ensure accurate and reliable evaluation. An expert reviewed each phrase to ensure its correct assignment to the appropriate conversation flow while introducing lexical diversity. For example, in the context of reporting device damage, synonyms such as phone, device, gadget, screen were used.
3.2. Qualitative Methods
The qualitative part of the study involved an in-depth analysis of the literature and a detailed study of the Rasa Pro platform. This analysis led to the formulation of the following research hypotheses:
The use of smaller language models (LLMs) can lead to achieving comparable results in terms of accuracy for user intent recognition and the selection of appropriate conversation flows compared to larger models [
11].
Transforming the bot description within the prompt, such as providing information about the chatbot’s purpose, capabilities, and intended use cases, from plain text to a structured format (Markdown, YAML, JSON) can contribute to increasing the accuracy of responses generated by LLM models [
19].
Specifying expected outcomes within the prompt precisely should translate into greater accuracy of responses generated by LLM models [
21].
3.3. Inter-Rater Agreement
To ensure the reliability of the qualitative analysis, the inter-rater agreement was assessed. Several invited experts independently reviewed the assignment of phrases to conversation flows. The level of agreement among the experts was high. In cases where disagreements occurred, they were resolved through discussion and consensus among the experts.
To ensure the reliability and repeatability of studies comparing the accuracy of various language models (LLMs) in the context of chatbots serving telecommunications customers, a research environment was designed to meet the following criteria:
Availability and scalability: the chatbot system based on the Rasa Pro platform and the tested language models were configured to operate in a cloud environment, allowing for the easy scaling of computational resources and availability for other researchers.
Openness and reproducibility: all system components were selected to be available for free, at least in a limited version.
This allows other researchers to easily replicate the experiments conducted, both for the same and different conversation flows and datasets.
In this study, a telecommunications chatbot was developed using the Rasa Pro platform. The solution architecture, shown in
Figure 1, includes the Rasa Inspector web application [
24] designed for user interaction. The core of the system is the Rasa Pro platform, deployed in the GitHub Codespaces cloud environment [
25], which has been integrated with other cloud environments such as Gemini API on Google Cloud [
26] and Groq Cloud [
27]. This integration provides the chatbot with access to a wide range of advanced language models, enhancing its capabilities and enabling the exploration of different AI/ML models.
Table 2 presents an overview of the LLMs used in the experiments. For each model, the following information was provided:
Working name: adopted in this study for easier identification.
Cloud platform: where the model is available.
Full model name: according to the provider’s naming convention.
Number of parameters: characterizing the size and complexity of the model.
Reference to the literature: allowing for a detailed review of the model description.
To compare the efficiency of different language models (LLMs) and various prompt formulations, a series of experiments was conducted. Each experiment consisted of 400 iterations, during which each of the analyzed phrases was tested on 10 defined conversation flows. Before starting each experiment, the Rasa Pro configuration was modified, changing both the prompt template and the provider and language model (LLM). Sample configurations are listed in
Table 3. This approach ensured a variety of experiments, comparing different LLM models and various prompt formulations. The specific models were selected based on several key factors: performance, scalability, customization capabilities, and their availability on popular cloud platforms. These models represent a range of sizes and complexities, allowing for a comprehensive evaluation of their effectiveness in different scenarios. Additionally, the selected models are widely used in the research community, providing a solid foundation for comparison and validation of results [
17,
18,
30].
For a detailed analysis of interactions, Rasa Inspector was run in debug mode. This made it possible to track both the prompts sent to the LLM models, the LLM’s responses, and the chatbot’s actions.
Table 4 presents the abbreviated prompt structure, including potential conversation flows, an example input phrase, possible actions, and the final command that instructs the LLM on how to generate the chatbot’s response. These fragments allow for the reconstruction of the full prompts used in the study using the Rasa Pro documentation and by running Rasa Inspector in debug mode. The full prompts were not included in the article due to their length, as each prompt contained between 58 and 116 lines for each format.
Despite the expectation that the LLM would consistently generate actions in the required format, instances were observed where the model provided more descriptive responses. This inconsistency with expectations affected the further course of the conversation, as the Rasa Pro system was unable to interpret the model’s responses. Since the purpose of the phrases was to unambiguously initiate a specific flow, only the actions StartFlow(flow_name) and Clarify(flow_name[1..n]) were considered correct. Other types of responses were treated as incorrect.
To quantitatively assess the model’s effectiveness in generating correct actions, the commonly used metric “Accuracy” was applied [
33]. For each LLM response, accuracy was calculated as follows:
Actions StartFlow(flow_name): If the given flow_name corresponded to an existing flow, the accuracy was 1; otherwise, it was 0.
Actions Clarify(flow_name[1..n]): If at least one of the given flow_name was correct, the accuracy was calculated as the inverse of the number of provided flow names.
If none of the provided flows were correct, the accuracy was 0. Sample accuracy values for different model configurations are presented in
Table 5. The accuracy for each row is explained as follows:
The LLM response correctly identified the flow name, resulting in an accuracy of 100.00%.
The LLM response did not match the correct flow name, resulting in an accuracy of 0.00%.
The LLM response included the correct flow name purchase_phone_number among the provided options. Since one out of two provided flow names was correct, the accuracy was calculated as one half, resulting in an accuracy of 50.00%.
The LLM response included the correct flow name payment_inquiry among the provided options. Since one out of three provided flow names was correct, the accuracy was calculated as one-third, resulting in an accuracy of 33.33%.
The LLM response did not include the correct flow name, resulting in an accuracy of 0.00%.
In the further part of the study, accuracy, expressed as a percentage, serves as the primary measure for evaluating the quality of responses generated by the model.
To verify the hypothesis that modifying the bot description in the prompt from plain text to a structured format affects the accuracy of LLM responses, experiments were conducted with different prompt formats. The bot descriptions and available actions were presented in the following formats: plain text, Markdown, YAML and JSON (
Table 6).
Given the high level of detail in the default prompts in Rasa Pro, both in the flow definitions and action descriptions, including extensive explanations of the chatbot’s capabilities and potential user interactions and the observation that smaller LLMs tend to generate extensive descriptions instead of specific actions, it was decided to conduct tests with modified final commands. The default command “Your action list” was named the “Concise” command, and the more precise command “Return only the action or only the list of actions, no additional descriptions” was named the “Precise” command. The purpose of the tests was to examine the impact of these changes, particularly on the performance of smaller language models in generating the desired response.
4. Results and Analysis
The aim of the study was to determine the impact of different language models and prompt formats on the accuracy and efficiency of the chatbot. To this end, 64 experiments were conducted, each consisting of 400 iterations, in which various input phrases were tested on 10 defined conversation flows. A wide range of combinations of different language models, prompt formats (plain text, Markdown, YAML, JSON), and final commands (“Concise”, “Precise”) were used. Each iteration corresponded to a single user interaction with the chatbot.
In
Table 7 and
Figure 2, the results for all models using different prompt formats are presented. The analysis of the experimental results conducted on LLM models for the plain text format indicated a positive correlation between model size and performance. The smallest models, such as Llama-1B and Llama-3B, achieved very low scores of 7.46 and 32.33, respectively. Models with 8B and 9B parameters (Llama-8B, Gemini1.5F-8B, Gemma-9B) showed significantly better results: 68.99, 64.65, and 69.11. The largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, achieved the best results of 84.03, 90.25, and 83.27, respectively, which is not surprising given their sizes.
Subsequent experiments investigated the impact of different prompt formats on model performance. The accuracy analysis for structured formats indicated that they could improve results. The JSON format improved the performance of the Llama-3B, Gemini1.5F-8B, Gemma-9B, and Llama-70B models, while the YAML format improved the performance of the Gemini1.5F and Gemini2.0F models. The best improvement was shown by the Gemini1.5F-8B model, whose accuracy increased from 64.65 for the plain text format to 86.27 for the JSON format, representing an improvement of 21.62 points.
These findings are consistent with previous studies that have demonstrated the significant impact of prompt engineering on LLM performance [
19]. Similar to my results, He et al. (2024) found that structured prompt formats, such as JSON and YAML, could significantly improve the accuracy of LLM responses.
The analysis of experimental results demonstrated that while larger language models generally exhibited higher performance, the use of structured prompt formats, such as JSON and YAML, enabled smaller LLM models to achieve comparable accuracy, highlighting the crucial role of prompt engineering in optimizing model performance.
These findings also align with the work of Bocklisch et al. (2024), who emphasized the potential of smaller, more resource-efficient language models. They suggested that smaller models could achieve comparable performance to larger models, which is supported by the results showing that models like Gemini1.5F-8B and Gemma-9B can perform nearly as well as larger models when using optimized prompts.
In
Table 8 and
Figure 3, the results of the experiments after changing the prompt command from “Concise” to “Precise” are presented. The analysis of the results shows that the prompt modification proved particularly effective for models that initially exhibited lower baseline accuracy, likely due to their tendency to generate more verbose responses. The most significant improvements were observed in the plain text format for Llama-1B (+3.04) and Llama-3B (+14.48). In the Markdown format, improvements were noted for Llama-1B (+6.68), Llama-3B (+7.11), and Llama-8B (+6.35). The YAML format showed varied responses, with significant improvements for Llama-3B (+11.07), Gemma-9B (+7.20), and Gemini1.5F-8B (+4.00), while Llama-1B (−4.73) and Gemini1.5F (−1.78) experienced a decrease in performance. In the JSON format, improvements were observed for Llama-1B (+5.83), Llama-3B (+5.36), Llama-8B (+3.63), and Gemini1.5F-8B (+1.83), while Gemini2.0F (−1.82) showed negative changes. Notably, Gemini1.5F-8B and Gemma-9B showed improvements with the “Precise” command in some formats.
Table 9 and
Figure 4 present a comparison of the effectiveness of the highest-performing prompts compared to “Plain text/Concise” commands for each LLM model. The smallest models, Llama-1B (13.50), Llama-3B (51.90), and Llama-8B (73.17), despite improvements, achieved ultimately low accuracy. In contrast, Gemini1.5F-8B (88.10) and Gemma-9B (83.28), despite their relatively small size, demonstrated surprisingly good results with the “JSON/Precise” and “YAML/Precise” formats, respectively. For the larger models, Llama-70B (89.64) and Gemini2.0F (89.92), changing the prompt format was beneficial, but changing the command did not help. The most mature model, Gemini1.5F (91.15), achieved the best results, but neither the format nor the final command had a significant impact on its performance.
The analysis of the results shows that the YAML format often outperformed plain text, especially for smaller models, resulting in higher accuracy scores. The JSON format was effective for some models, particularly those of medium size. Significantly, Gemini1.5F-8B and Gemma-9B, despite their relatively small size, demonstrated strong performance after changing the prompt format and command precision. The “Precise” command was beneficial mainly for models that initially achieved lower results, but it did not always bring improvement for high-performing models. The largest models, such as Llama-70B, Gemini1.5F and Gemini2.0F, achieved the best results in the YAML and JSON formats, but changing the command to “Precise” was not always beneficial. The analysis of the results highlights the crucial role of prompt engineering in optimizing LLM performance. The choice of the appropriate prompt format and command can significantly impact the accuracy of smaller models, enabling them to achieve results comparable to larger models.
5. Conclusions
The aim of the conducted research was to determine the optimal operating conditions for a telecommunications chatbot based on the Rasa Pro platform. To this end, a series of experiments were conducted using various language models (LLMs) and diverse prompts. The research focused on the impact of model size, prompt format, and command precision on the quality of the chatbot’s responses.
The obtained results confirmed several significant hypotheses, as summarized in
Table 10. Firstly, it was found that smaller language models, such as Gemini1.5F-8B and Gemma-9B, could achieve results only slightly worse than more complex models, such as Gemini1.5F. This finding suggests that in some cases, it is not necessary to use the most complex and computationally expensive models to achieve satisfactory results.
Secondly, the study confirmed the significant impact of prompt format on response quality. Using structured formats, such as YAML or JSON, brought a clear improvement in response accuracy for many tested models. Particularly beneficial effects were observed for models Llama-3B, Gemini1.5F-8B, and Gemini-9B, where the difference in results was most noticeable.
Thirdly, the hypothesis assuming a direct relationship between command precision in the prompt and response quality was confirmed for smaller models Llama-1B, Llama-3B, Llama-8B, Gemini1.5F-8B, and Gemma-9B, while for the largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, it was not significant.
The most important discovery is that relatively small models such as Gemini1.5F-8B and Gemma-9B, after applying prompt engineering (formats and the “Precise” command), improved their results and did not significantly differ from the largest models. This finding suggests that these models are suitable for use in chatbots, offering satisfactory performance at lower computational costs.
The methodology and experimental results presented in this study, although tested on the Rasa Pro system, can be generalized to other dialogue platforms based on language models (LLMs). Since the methodology focuses on constructing prompts for LLMs, the principles of prompt engineering and the evaluation metrics used are universal and can be applied to any system utilizing LLMs for generating responses. This makes the findings of this study broadly applicable beyond the specific context of Rasa Pro. In particular, these principles can be effectively applied to other platforms where there is the possibility of replacing the LLM, such as ManyChat, FlowXO, and Pandorabots, as well as modifying prompts. For larger platforms like Google Dialogflow, IBM Watson Assistant, and Microsoft Azure AI Bot Service, while there may not always be the possibility to modify the LLM, prompt modifications are still feasible [
1,
2]. The flexibility and modularity of these platforms allow for the adaptation of the prompt engineering techniques discussed in this study, ensuring that the insights gained can enhance the performance and reliability of various conversational AI systems. By leveraging the capabilities of these platforms, researchers and developers can experiment with different LLMs and prompt structures, further validating and extending the applicability of the findings presented here.
Limitations of this study include the restriction to specific language models available in the cloud environment, as listed in
Table 2. These models were chosen due to their availability and the possibility of free testing. As mentioned in the methodology, the research environment was designed to ensure availability, scalability, openness, and reproducibility, allowing other researchers to easily replicate the experiments conducted.
Future research should focus on more complex conversation scenarios, such as emotion recognition [
23,
34], contextual language understanding, and multitasking. Additionally, the impact of various machine learning techniques on improving chatbot performance is worth investigating [
35]. Future studies should also include additional LLM models for which licenses can be obtained [
36]. An interesting research direction is to evaluate how LLMs handle non-obvious phrases, such as those containing sarcasm, to better understand their capabilities and limitations in real-world applications [
37].
The results of the conducted research open new perspectives in the field of LLM-based chatbots. They suggest that optimizing chatbot performance does not always require the use of the most powerful available models. Equally important is the proper preparation of the prompt and the choice of the appropriate data format.
The conducted research provided valuable insights into the impact of various factors on the quality of LLM-based chatbots. The research results can contribute to the development of more advanced and efficient solutions in the field of customer service.