How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness

Pawlik, Lukasz

doi:10.3390/electronics14050888

Open AccessEditor’s ChoiceArticle

How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness

by

Lukasz Pawlik

Department of Information Systems, Kielce University of Technology, 7 Tysiąclecia Państwa Polskiego Ave., 25-314 Kielce, Poland

Electronics 2025, 14(5), 888; https://doi.org/10.3390/electronics14050888

Submission received: 7 February 2025 / Revised: 20 February 2025 / Accepted: 21 February 2025 / Published: 24 February 2025

(This article belongs to the Special Issue New Trends in Artificial Neural Networks and Its Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Modern businesses increasingly rely on chatbots to enhance customer communication and automate routine tasks. The research aimed to determine the optimal configurations of a telecommunications chatbot on the Rasa Pro platform, including the selection of large language models (LLMs), prompt formats, and command structures. The impact of various LLMs, prompt formats, and command precision on response quality was analyzed. Smaller models, like Gemini-1.5-Flash-8B and Gemma2-9B-IT, can achieve results comparable to larger models, offering a cost-effective solution. Specifically, the Gemini-1.5-Flash-8B model achieved an accuracy improvement of 21.62 points when using the JSON prompt format. This emphasizes the importance of prompt engineering techniques, like using structured formats (YAML, JSON) and precise commands. The study utilized a dataset of 400 sample test phrases created based on real customer service conversations with a mobile phone operator’s customers. Results suggest optimizing chatbot performance does not always require the most powerful models. Proper prompt preparation and data format choice are crucial. The theoretical framework focuses on the interaction between model size, prompt format, and command precision. Findings provide insights for chatbot designers to optimize performance through LLM selection and prompt construction. These findings have practical implications for businesses seeking cost-effective and efficient chatbot solutions.

Keywords:

large language models; optimization methods; chatbots; prompt engineering

1. Introduction

Modern businesses increasingly use chatbots to improve communication with customers. Their role as the first line of contact in customer service is steadily growing, increasingly automating the handling of routine tasks. This dynamic development drives intensive research on AI-based dialog systems, aiming to enhance their capabilities, improve user experience, and address the challenges of human-like interaction.

The growing demand for and success of conversational agents have spurred the development of various technologies to create these systems. Tech giants like Google (Dialogflow), IBM (Watson Assistant), Microsoft (Azure AI Bot Service), and Amazon (Lex) have released their own chatbot creation platforms. Smaller companies like Rasa, ManyChat, FlowXO, and Pandorabots have also proposed their tools, offering a wider range of solutions for businesses of different sizes [1,2].

Given the diversity of terminology across these systems, this work primarily adopted the terminology from Rasa Pro, a widely adopted open-source platform with a large and active community [3]. Its highly modular and configurable architecture, combined with the active involvement of a global community of over six hundred developers and ten thousand forum members, allows for continuous innovation and the integration of cutting-edge AI research and techniques [4].

In the initial phase of development, dialog systems were primarily based on Natural Language Understanding (NLU) models. These models function by defining intents, which represent the user’s goal or purpose, and training them on a set of example phrases that express each intent. This allows the NLU model to recognize user intents based on the provided phrases. These defined intents are then combined into conversation scenarios, forming the foundation for more complex interactions [5,6]. Chatbots built this way need to be continuously retrained for both new intents and those that are not effectively recognized. A significant advantage of NLU models is their relatively low computational requirements. They can be trained quickly and efficiently, and their small size minimizes the need for extensive hardware resources, making them cost-effective to deploy and maintain [7]. In practice, this meant that companies could deploy such chatbots without investing in expensive infrastructure. NLU models are also more understandable and easier to manage for developers, speeding up the iterative learning process and improving response quality.

The emergence of large language models (LLMs) has revolutionized AI dialog platforms, enabling more sophisticated conversation flows. Each flow is designed to not only recognize the customer’s initial intent but also to manage the entire conversation. This is possible because conversation flows leverage the powerful capabilities of LLMs, which allow for continuous understanding of the conversation context and can effectively process subsequent customer statements without the need for explicit definition of additional intents [8].

The aim of this research was to analyze the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models. This paper presents several key contributions to the field of AI-based dialog systems:

A comprehensive analysis of the performance of various large language models (LLMs) on the Rasa Pro platform, highlighting the effectiveness of smaller models such as Gemini-1.5-Flash-8B and Gemma2-9B-IT.
Demonstration of the significant impact of prompt engineering techniques, such as using structured formats like YAML and JSON, on the accuracy and efficiency of chatbot responses.
Presentation of practical insights for chatbot designers, emphasizing the importance of model selection and prompt construction in optimizing chatbot performance.

The structure of this paper is as follows: Section 2 presents a review of related work, Section 3 describes the methodology used in this study, Section 4 discusses the results, and Section 5 concludes with insights and future research directions.

2. Related Work

A key advantage of LLM-based conversation flows compared to NLU-based scenarios is the reduced need for frequent retraining. In many cases, it is sufficient to formulate an appropriate prompt that allows the LLM to indicate the appropriate action to be taken, making the system more adaptable to changing conversational needs [9]. However, utilizing LLMs typically involves significant computational resources. Running LLMs on cloud platforms incurs costs, while deploying them on-premises requires substantial hardware resources, including high-performance GPUs and ample memory [10]. Therefore, knowledge about the efficiency of individual LLMs in specific applications is crucial to select the most optimal model for one’s needs.

In their research, the creators of the Rasa Pro system emphasize the importance of exploring the potential of smaller, more resource-efficient language models. They highlight the need to investigate whether these models can achieve comparable performance to larger models while offering significant cost and latency advantages. A more comprehensive evaluation of the current system, including real-world case studies of production systems, is suggested to better understand the practical implications of these findings and guide future research and development [11].

Various LLMs, including Meta’s Llama2 Chat 7B and 13B, Mistral Instruct v0.1 7B and v0.2 7B, Google’s Gemma 2B and 7B, and OpenAI’s GPT-3.5 and GPT-4, were compared across a range of tasks, such as factuality, toxicity, hallucination, bias, jailbreaking, out-of-scope requests, and multi-step conversations. Llama2 demonstrated strong performance in tasks related to factuality and toxicity handling but showed limitations in identifying and appropriately responding to out-of-scope requests. Mistral excelled in tasks involving hallucination and multi-step conversations yet exhibited weaknesses in detecting and mitigating toxic content generation. Gemma achieved the highest scores in tasks related to bias and jailbreaking, although it frequently declined to respond to certain prompts, particularly those that were deemed inappropriate or potentially harmful. Notably, GPT-4 significantly outperformed all other models in safety tests, highlighting its advanced technological capabilities [12]. Recent studies have highlighted the vulnerabilities of safety alignment in open-access LLMs. Research has demonstrated that safety-aligned LLMs could be reverse-aligned to output harmful content through techniques like reverse supervised fine-tuning and reverse preference optimization, emphasizing the need for robust safety alignment methods [13]. Additionally, studies have explored clean-label backdoor attacks in language models, introducing a method that injects text style as an abstract trigger without external triggers, which poses significant risks to model integrity [14]. Further research investigated jailbreaking attacks against multimodal large language models, revealing the potential for these models to generate objectionable responses to harmful queries through image-based prompts, further underscoring the importance of robust safety measures [15].

The performance of Gemma-2B and Gemma-7B was evaluated across various domains, including cybersecurity, medicine, and finance, by comparing their responses to general knowledge questions and domain-specific queries. The study demonstrated a significant correlation between both model size and prompt type and the length, quality, and relevance of the generated responses. While general queries often elicited diverse and inconsistent outputs, domain-specific queries consistently produced more concise and relevant responses within reasonable timeframes [16]. This finding highlights the importance of prompt engineering in optimizing LLM performance across diverse domains.

A family of lightweight, modern open models, known as Gemma, was introduced, built upon the foundational research of the Gemini models, with the aim of providing high-quality language generation capabilities while being more accessible and resource-efficient. Available in 2B and 7B parameter versions, Gemma models underwent rigorous evaluation across various domains using both automated benchmarks and human assessments, including human preference tests and expert evaluations. In human preference tests, Gemma-7B demonstrated superior performance, achieving 61.2% positive results, indicating that human evaluators preferred its outputs more often, compared to 45% for Gemma-2B [17].

A diverse collection of powerful open-source language models, known as Llama, ranging in size from 7B to 65B parameters, was introduced. Trained exclusively on publicly available datasets, these models achieved performance comparable to leading models such as Chinchilla-70B and Palm-540B, demonstrating the potential of high-quality models trained on open data. Notably, Llama-13B surpassed GPT-3 (175B) in performance across most benchmarks, demonstrating significant capabilities despite being ten times smaller, highlighting the potential for developing powerful yet resource-efficient language models. The release of these models to the research community aims to democratize access to advanced language models [18].

The impact of varying prompt templates, including plain text, Markdown, YAML, and JSON formats, on the performance of LLM models, specifically GPT-3.5 and GPT-4, was investigated across a range of tasks. Experimental findings demonstrated that GPT-3.5-turbo’s performance could exhibit significant variability, with up to a 40% difference in accuracy observed across different prompt templates. While larger models, such as GPT-4, demonstrated greater robustness to prompt format variations compared to GPT-3.5, the performance of all GPT models evaluated in that study was observed to be influenced by the chosen prompt format. These findings underscore the critical importance of prompt engineering and highlight the lack of a universally optimal prompt template for all GPT models. The significant impact of prompt formatting on model performance emphasizes the crucial need to consider the influence of different prompt formats in future LLM evaluations to ensure accurate assessments and facilitate performance improvements [19].

The effectiveness of various generative language models (LLMs), including Claude v3 Haiku, SetFit with negative augmentation, and Mistral-7B, for intent detection within dialog systems was examined. Claude v3 Haiku demonstrated the highest performance, while SetFit with negative augmentation exhibited an 8% performance decrease. Notably, Mistral-7B demonstrated a significant improvement in query detection accuracy and F1 score, exceeding baseline performance by over 5%, indicating its potential for robust intent detection in dialog systems [20].

The impact of varying prompt formulations, including different levels of instruction detail, the inclusion of examples, and the use of different linguistic styles, on the performance of large language models (LLMs), including ChatGPT-4 and models from the Llama, Mistral, and Gemma families, was investigated. Experimental results revealed significant performance variability across different prompt formulations, with a notable 45.48% difference in accuracy observed between the best and worst performing prompts for Llama2-70B. Furthermore, the study found that common prompt improvement techniques, such as self-improving prompts, voting, and distillation, had limited success in mitigating the impact of poorly performing prompts, highlighting the critical importance of careful prompt engineering in achieving optimal LLM performance [21].

The literature review highlights several key findings. Firstly, it emphasizes the potential of smaller LLM models, such as Llama2 and Gemma, which can offer comparable performance to larger models with lower computational costs and latency. Secondly, the review underscores the diverse strengths and weaknesses of different LLMs. For instance, Llama2 excels in factuality tasks, while GPT-4 demonstrates superior safety performance. Furthermore, the review highlights the significant impact of prompt engineering on LLM performance. Experiments with various prompt formats (plain text, Markdown, YAML, JSON) demonstrate significant performance variations across different models. Notably, prompt engineering techniques, while beneficial in some cases, do not consistently mitigate the impact of poorly designed prompts. Finally, the literature review indicates a gap in research regarding the application of LLMs in real-world dialog systems. The authors of the Rasa system, among others, emphasize the need for more comprehensive evaluations, including case studies of production systems, to better understand the practical implications of these findings.

Despite the extensive research on LLMs, there remains a gap in understanding the practical application of these models in real-world dialog systems. This study aims to fill this gap by providing a comprehensive analysis of the performance of various LLMs on the Rasa Pro platform, highlighting the effectiveness of smaller models. By focusing on prompt engineering techniques and model selection, this research offers practical insights for optimizing chatbot performance, thereby contributing to the advancement of AI-based dialog systems. Rasa Pro was chosen for this study due to its robust features and flexibility. Rasa Pro is an open-core product that extends Rasa Open Source, which has over 50 million downloads and is the most popular open-source framework for building chat and voice-based AI assistants. Rasa Pro extends Rasa Open Source with Conversational AI with Language Models (CALM), a generative AI-native approach to developing assistants, combined with enterprise-ready analytics, security, and observability capabilities. Additionally, Rasa Pro allows for the customization of LLM models and prompt structures, making it an ideal platform for research and experimentation. The platform is widely used by other researchers, further validating its reliability and effectiveness [3].

3. Research Methodology

This study employs a mixed-methods approach, combining both quantitative and qualitative research methods to investigate how the choice of LLM language model and prompt engineering techniques influence the quality of responses generated by chatbots.

3.1. Quantitative Methods

The quantitative part of the study involved the analysis of historical customer service conversations conducted by a mobile phone provider. A total of 1054 real phone conversations and 545 real chat conversations were analyzed, which are described in more detail in my previous studies [22,23]. Based on this analysis, 10 main conversation flows were identified, and a total of 400 sample test phrases were developed for these flows (Table 1). The datasets were carefully constructed to ensure accurate and reliable evaluation. An expert reviewed each phrase to ensure its correct assignment to the appropriate conversation flow while introducing lexical diversity. For example, in the context of reporting device damage, synonyms such as phone, device, gadget, screen were used.

3.2. Qualitative Methods

The qualitative part of the study involved an in-depth analysis of the literature and a detailed study of the Rasa Pro platform. This analysis led to the formulation of the following research hypotheses:

The use of smaller language models (LLMs) can lead to achieving comparable results in terms of accuracy for user intent recognition and the selection of appropriate conversation flows compared to larger models [11].
Transforming the bot description within the prompt, such as providing information about the chatbot’s purpose, capabilities, and intended use cases, from plain text to a structured format (Markdown, YAML, JSON) can contribute to increasing the accuracy of responses generated by LLM models [19].
Specifying expected outcomes within the prompt precisely should translate into greater accuracy of responses generated by LLM models [21].

3.3. Inter-Rater Agreement

To ensure the reliability of the qualitative analysis, the inter-rater agreement was assessed. Several invited experts independently reviewed the assignment of phrases to conversation flows. The level of agreement among the experts was high. In cases where disagreements occurred, they were resolved through discussion and consensus among the experts.

To ensure the reliability and repeatability of studies comparing the accuracy of various language models (LLMs) in the context of chatbots serving telecommunications customers, a research environment was designed to meet the following criteria:

Availability and scalability: the chatbot system based on the Rasa Pro platform and the tested language models were configured to operate in a cloud environment, allowing for the easy scaling of computational resources and availability for other researchers.
Openness and reproducibility: all system components were selected to be available for free, at least in a limited version.

This allows other researchers to easily replicate the experiments conducted, both for the same and different conversation flows and datasets.

In this study, a telecommunications chatbot was developed using the Rasa Pro platform. The solution architecture, shown in Figure 1, includes the Rasa Inspector web application [24] designed for user interaction. The core of the system is the Rasa Pro platform, deployed in the GitHub Codespaces cloud environment [25], which has been integrated with other cloud environments such as Gemini API on Google Cloud [26] and Groq Cloud [27]. This integration provides the chatbot with access to a wide range of advanced language models, enhancing its capabilities and enabling the exploration of different AI/ML models.

Table 2 presents an overview of the LLMs used in the experiments. For each model, the following information was provided:

Working name: adopted in this study for easier identification.
Cloud platform: where the model is available.
Full model name: according to the provider’s naming convention.
Number of parameters: characterizing the size and complexity of the model.
Reference to the literature: allowing for a detailed review of the model description.

To compare the efficiency of different language models (LLMs) and various prompt formulations, a series of experiments was conducted. Each experiment consisted of 400 iterations, during which each of the analyzed phrases was tested on 10 defined conversation flows. Before starting each experiment, the Rasa Pro configuration was modified, changing both the prompt template and the provider and language model (LLM). Sample configurations are listed in Table 3. This approach ensured a variety of experiments, comparing different LLM models and various prompt formulations. The specific models were selected based on several key factors: performance, scalability, customization capabilities, and their availability on popular cloud platforms. These models represent a range of sizes and complexities, allowing for a comprehensive evaluation of their effectiveness in different scenarios. Additionally, the selected models are widely used in the research community, providing a solid foundation for comparison and validation of results [17,18,30].

For a detailed analysis of interactions, Rasa Inspector was run in debug mode. This made it possible to track both the prompts sent to the LLM models, the LLM’s responses, and the chatbot’s actions. Table 4 presents the abbreviated prompt structure, including potential conversation flows, an example input phrase, possible actions, and the final command that instructs the LLM on how to generate the chatbot’s response. These fragments allow for the reconstruction of the full prompts used in the study using the Rasa Pro documentation and by running Rasa Inspector in debug mode. The full prompts were not included in the article due to their length, as each prompt contained between 58 and 116 lines for each format.

Despite the expectation that the LLM would consistently generate actions in the required format, instances were observed where the model provided more descriptive responses. This inconsistency with expectations affected the further course of the conversation, as the Rasa Pro system was unable to interpret the model’s responses. Since the purpose of the phrases was to unambiguously initiate a specific flow, only the actions StartFlow(flow_name) and Clarify(flow_name[1..n]) were considered correct. Other types of responses were treated as incorrect.

To quantitatively assess the model’s effectiveness in generating correct actions, the commonly used metric “Accuracy” was applied [33]. For each LLM response, accuracy was calculated as follows:

Actions StartFlow(flow_name): If the given flow_name corresponded to an existing flow, the accuracy was 1; otherwise, it was 0.
Actions Clarify(flow_name[1..n]): If at least one of the given flow_name was correct, the accuracy was calculated as the inverse of the number of provided flow names.

If none of the provided flows were correct, the accuracy was 0. Sample accuracy values for different model configurations are presented in Table 5. The accuracy for each row is explained as follows:

The LLM response correctly identified the flow name, resulting in an accuracy of 100.00%.
The LLM response did not match the correct flow name, resulting in an accuracy of 0.00%.
The LLM response included the correct flow name purchase_phone_number among the provided options. Since one out of two provided flow names was correct, the accuracy was calculated as one half, resulting in an accuracy of 50.00%.
The LLM response included the correct flow name payment_inquiry among the provided options. Since one out of three provided flow names was correct, the accuracy was calculated as one-third, resulting in an accuracy of 33.33%.
The LLM response did not include the correct flow name, resulting in an accuracy of 0.00%.

In the further part of the study, accuracy, expressed as a percentage, serves as the primary measure for evaluating the quality of responses generated by the model.

To verify the hypothesis that modifying the bot description in the prompt from plain text to a structured format affects the accuracy of LLM responses, experiments were conducted with different prompt formats. The bot descriptions and available actions were presented in the following formats: plain text, Markdown, YAML and JSON (Table 6).

Given the high level of detail in the default prompts in Rasa Pro, both in the flow definitions and action descriptions, including extensive explanations of the chatbot’s capabilities and potential user interactions and the observation that smaller LLMs tend to generate extensive descriptions instead of specific actions, it was decided to conduct tests with modified final commands. The default command “Your action list” was named the “Concise” command, and the more precise command “Return only the action or only the list of actions, no additional descriptions” was named the “Precise” command. The purpose of the tests was to examine the impact of these changes, particularly on the performance of smaller language models in generating the desired response.

4. Results and Analysis

The aim of the study was to determine the impact of different language models and prompt formats on the accuracy and efficiency of the chatbot. To this end, 64 experiments were conducted, each consisting of 400 iterations, in which various input phrases were tested on 10 defined conversation flows. A wide range of combinations of different language models, prompt formats (plain text, Markdown, YAML, JSON), and final commands (“Concise”, “Precise”) were used. Each iteration corresponded to a single user interaction with the chatbot.

In Table 7 and Figure 2, the results for all models using different prompt formats are presented. The analysis of the experimental results conducted on LLM models for the plain text format indicated a positive correlation between model size and performance. The smallest models, such as Llama-1B and Llama-3B, achieved very low scores of 7.46 and 32.33, respectively. Models with 8B and 9B parameters (Llama-8B, Gemini1.5F-8B, Gemma-9B) showed significantly better results: 68.99, 64.65, and 69.11. The largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, achieved the best results of 84.03, 90.25, and 83.27, respectively, which is not surprising given their sizes.

Subsequent experiments investigated the impact of different prompt formats on model performance. The accuracy analysis for structured formats indicated that they could improve results. The JSON format improved the performance of the Llama-3B, Gemini1.5F-8B, Gemma-9B, and Llama-70B models, while the YAML format improved the performance of the Gemini1.5F and Gemini2.0F models. The best improvement was shown by the Gemini1.5F-8B model, whose accuracy increased from 64.65 for the plain text format to 86.27 for the JSON format, representing an improvement of 21.62 points.

These findings are consistent with previous studies that have demonstrated the significant impact of prompt engineering on LLM performance [19]. Similar to my results, He et al. (2024) found that structured prompt formats, such as JSON and YAML, could significantly improve the accuracy of LLM responses.

The analysis of experimental results demonstrated that while larger language models generally exhibited higher performance, the use of structured prompt formats, such as JSON and YAML, enabled smaller LLM models to achieve comparable accuracy, highlighting the crucial role of prompt engineering in optimizing model performance.

These findings also align with the work of Bocklisch et al. (2024), who emphasized the potential of smaller, more resource-efficient language models. They suggested that smaller models could achieve comparable performance to larger models, which is supported by the results showing that models like Gemini1.5F-8B and Gemma-9B can perform nearly as well as larger models when using optimized prompts.

In Table 8 and Figure 3, the results of the experiments after changing the prompt command from “Concise” to “Precise” are presented. The analysis of the results shows that the prompt modification proved particularly effective for models that initially exhibited lower baseline accuracy, likely due to their tendency to generate more verbose responses. The most significant improvements were observed in the plain text format for Llama-1B (+3.04) and Llama-3B (+14.48). In the Markdown format, improvements were noted for Llama-1B (+6.68), Llama-3B (+7.11), and Llama-8B (+6.35). The YAML format showed varied responses, with significant improvements for Llama-3B (+11.07), Gemma-9B (+7.20), and Gemini1.5F-8B (+4.00), while Llama-1B (−4.73) and Gemini1.5F (−1.78) experienced a decrease in performance. In the JSON format, improvements were observed for Llama-1B (+5.83), Llama-3B (+5.36), Llama-8B (+3.63), and Gemini1.5F-8B (+1.83), while Gemini2.0F (−1.82) showed negative changes. Notably, Gemini1.5F-8B and Gemma-9B showed improvements with the “Precise” command in some formats.

Table 9 and Figure 4 present a comparison of the effectiveness of the highest-performing prompts compared to “Plain text/Concise” commands for each LLM model. The smallest models, Llama-1B (13.50), Llama-3B (51.90), and Llama-8B (73.17), despite improvements, achieved ultimately low accuracy. In contrast, Gemini1.5F-8B (88.10) and Gemma-9B (83.28), despite their relatively small size, demonstrated surprisingly good results with the “JSON/Precise” and “YAML/Precise” formats, respectively. For the larger models, Llama-70B (89.64) and Gemini2.0F (89.92), changing the prompt format was beneficial, but changing the command did not help. The most mature model, Gemini1.5F (91.15), achieved the best results, but neither the format nor the final command had a significant impact on its performance.

The analysis of the results shows that the YAML format often outperformed plain text, especially for smaller models, resulting in higher accuracy scores. The JSON format was effective for some models, particularly those of medium size. Significantly, Gemini1.5F-8B and Gemma-9B, despite their relatively small size, demonstrated strong performance after changing the prompt format and command precision. The “Precise” command was beneficial mainly for models that initially achieved lower results, but it did not always bring improvement for high-performing models. The largest models, such as Llama-70B, Gemini1.5F and Gemini2.0F, achieved the best results in the YAML and JSON formats, but changing the command to “Precise” was not always beneficial. The analysis of the results highlights the crucial role of prompt engineering in optimizing LLM performance. The choice of the appropriate prompt format and command can significantly impact the accuracy of smaller models, enabling them to achieve results comparable to larger models.

5. Conclusions

The aim of the conducted research was to determine the optimal operating conditions for a telecommunications chatbot based on the Rasa Pro platform. To this end, a series of experiments were conducted using various language models (LLMs) and diverse prompts. The research focused on the impact of model size, prompt format, and command precision on the quality of the chatbot’s responses.

The obtained results confirmed several significant hypotheses, as summarized in Table 10. Firstly, it was found that smaller language models, such as Gemini1.5F-8B and Gemma-9B, could achieve results only slightly worse than more complex models, such as Gemini1.5F. This finding suggests that in some cases, it is not necessary to use the most complex and computationally expensive models to achieve satisfactory results.

Secondly, the study confirmed the significant impact of prompt format on response quality. Using structured formats, such as YAML or JSON, brought a clear improvement in response accuracy for many tested models. Particularly beneficial effects were observed for models Llama-3B, Gemini1.5F-8B, and Gemini-9B, where the difference in results was most noticeable.

Thirdly, the hypothesis assuming a direct relationship between command precision in the prompt and response quality was confirmed for smaller models Llama-1B, Llama-3B, Llama-8B, Gemini1.5F-8B, and Gemma-9B, while for the largest models, such as Llama-70B, Gemini1.5F, and Gemini2.0F, it was not significant.

The most important discovery is that relatively small models such as Gemini1.5F-8B and Gemma-9B, after applying prompt engineering (formats and the “Precise” command), improved their results and did not significantly differ from the largest models. This finding suggests that these models are suitable for use in chatbots, offering satisfactory performance at lower computational costs.

The methodology and experimental results presented in this study, although tested on the Rasa Pro system, can be generalized to other dialogue platforms based on language models (LLMs). Since the methodology focuses on constructing prompts for LLMs, the principles of prompt engineering and the evaluation metrics used are universal and can be applied to any system utilizing LLMs for generating responses. This makes the findings of this study broadly applicable beyond the specific context of Rasa Pro. In particular, these principles can be effectively applied to other platforms where there is the possibility of replacing the LLM, such as ManyChat, FlowXO, and Pandorabots, as well as modifying prompts. For larger platforms like Google Dialogflow, IBM Watson Assistant, and Microsoft Azure AI Bot Service, while there may not always be the possibility to modify the LLM, prompt modifications are still feasible [1,2]. The flexibility and modularity of these platforms allow for the adaptation of the prompt engineering techniques discussed in this study, ensuring that the insights gained can enhance the performance and reliability of various conversational AI systems. By leveraging the capabilities of these platforms, researchers and developers can experiment with different LLMs and prompt structures, further validating and extending the applicability of the findings presented here.

Limitations of this study include the restriction to specific language models available in the cloud environment, as listed in Table 2. These models were chosen due to their availability and the possibility of free testing. As mentioned in the methodology, the research environment was designed to ensure availability, scalability, openness, and reproducibility, allowing other researchers to easily replicate the experiments conducted.

Future research should focus on more complex conversation scenarios, such as emotion recognition [23,34], contextual language understanding, and multitasking. Additionally, the impact of various machine learning techniques on improving chatbot performance is worth investigating [35]. Future studies should also include additional LLM models for which licenses can be obtained [36]. An interesting research direction is to evaluate how LLMs handle non-obvious phrases, such as those containing sarcasm, to better understand their capabilities and limitations in real-world applications [37].

The results of the conducted research open new perspectives in the field of LLM-based chatbots. They suggest that optimizing chatbot performance does not always require the use of the most powerful available models. Equally important is the proper preparation of the prompt and the choice of the appropriate data format.

The conducted research provided valuable insights into the impact of various factors on the quality of LLM-based chatbots. The research results can contribute to the development of more advanced and efficient solutions in the field of customer service.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

Benaddi, L.; Ouaddi, C.; Khriss, I.; Ouchao, B. Analysis of Tools for the Development of Conversational Agents. Comput. Sci. Math. Forum 2023, 6, 5. [Google Scholar] [CrossRef]
Dagkoulis, I.; Moussiades, L. A Comparative Evaluation of Chatbot Development Platforms. In Proceedings of the 26th Pan-Hellenic Conference on Informatics, Athens, Greece, 25–27 November 2022. [Google Scholar] [CrossRef]
Introduction to Rasa Pro. 2025. Available online: https://rasa.com/docs/rasa-pro/ (accessed on 22 January 2025).
Costa, L.A.L.F.d.; Melchiades, M.B.; Girelli, V.S.; Colombelli, F.; Araujo, D.A.d.; Rigo, S.J.; Ramos, G.d.O.; Costa, C.A.d.; Righi, R.d.R.; Barbosa, J.L.V. Advancing Chatbot Conversations: A Review of Knowledge Update Approaches. J. Braz. Comput. Soc. 2024, 30, 55–68. [Google Scholar] [CrossRef]
Tamrakar, R.; Wani, N. Design and Development of CHATBOT: A Review. In Proceedings of the International Conference on “Latest Trends in Civil, Mechanical and Electrical Engineering”, Online, 12–13 April 2021. [Google Scholar]
Brabra, H.; Baez, M.; Benatallah, B.; Gaaloul, W.; Bouguelia, S.; Zamanirad, S. Dialogue Management in Conversational Systems: A Review of Approaches, Challenges, and Opportunities. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 783–798. [Google Scholar] [CrossRef]
Matic, R.; Kabiljo, M.; Zivkovic, M.; Cabarkapa, M. Extensible Chatbot Architecture Using Metamodels of Natural Language Understanding. Electronics 2021, 10, 2300. [Google Scholar] [CrossRef]
Sanchez Cuadrado, J.; Perez-Soler, S.; Guerra, E.; De Lara, J. Automating the Development of Task-oriented LLM-based Chatbots. In Proceedings of the 6th ACM Conference on Conversational User Interfaces, Luxembourg, 8–10 July 2024; CUI ’24. Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–10. [Google Scholar] [CrossRef]
Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt Engineering in Large Language Models. In Proceedings of the Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Jacob, I.J., Piramuthu, S., Falkowski-Gilski, P., Eds.; IEEE: Piscataway, NJ, USA, 2024; pp. 387–402. [Google Scholar] [CrossRef]
Benram, G. Understanding the Cost of Large Language Models (LLMs). 2024. Available online: https://www.tensorops.ai/post/understanding-the-cost-of-large-language-models-llms (accessed on 20 January 2025).
Bocklisch, T.; Werkmeister, T.; Varshneya, D.; Nichol, A. Task-Oriented Dialogue with In-Context Learning. arXiv 2024. [Google Scholar] [CrossRef]
Nadeau, D.; Kroutikov, M.; McNeil, K.; Baribeau, S. Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations. arXiv 2024. [Google Scholar] [CrossRef]
Yi, J.; Ye, R.; Chen, Q.; Zhu, B.; Chen, S.; Lian, D.; Sun, G.; Xie, X.; Wu, F. On the Vulnerability of Safety Alignment in Open-Access LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; IEEE: Piscataway, NJ, USA, 2024; pp. 9236–9260. [Google Scholar] [CrossRef]
Zhao, S.; Tuan, L.A.; Fu, J.; Wen, J.; Luo, W. Exploring Clean Label Backdoor Attacks and Defense in Language Models. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3014–3024. [Google Scholar] [CrossRef]
Niu, Z.; Ren, H.; Gao, X.; Hua, G.; Jin, R. Jailbreaking Attack against Multimodal Large Language Model. arXiv 2024, arXiv:2402.02309. [Google Scholar]
Amujo, O.E.; Yang, S.J. Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making. arXiv 2024, arXiv:2407.11006. [Google Scholar]
Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Roziere, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2024. [Google Scholar] [CrossRef]
He, J.; Rungta, M.; Koleczek, D.; Sekhon, A.; Wang, F.X.; Hasan, S. Does Prompt Formatting Have Any Impact on LLM Performance? arXiv 2024. [Google Scholar] [CrossRef]
Arora, G.; Jain, S.; Merugu, S. Intent Detection in the Age of LLMs. arXiv 2024. [Google Scholar] [CrossRef]
Cao, B.; Cai, D.; Zhang, Z.; Zou, Y.; Lam, W. On the Worst Prompt Performance of Large Language Models. arXiv 2024. [Google Scholar] [CrossRef]
Płaza, M.; Pawlik, Ł.; Deniziak, S. Call Transcription Methodology for Contact Center Systems. IEEE Access 2021, 9, 110975–110988. [Google Scholar] [CrossRef]
Pawlik, L.; Plaza, M.; Deniziak, S.; Boksa, E. A method for improving bot effectiveness by recognising implicit customer intent in contact centre conversations. Speech Commun. 2022, 143, 33–45. [Google Scholar] [CrossRef]
Rasa Inspector. 2025. Available online: https://rasa.com/docs/rasa-pro/production/inspect-assistant/ (accessed on 21 January 2025).
Codespaces Documentation. Available online: https://docs.github.com/en/codespaces (accessed on 21 January 2025).
Gemini API. Available online: https://ai.google.dev/gemini-api/docs (accessed on 21 January 2025).
GroqCloud. Available online: https://groq.com/groqcloud/ (accessed on 21 January 2025).
llama-models/models/llama3_2/MODEL_CARD.md at main · meta-llama/llama-models. Available online: https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md (accessed on 22 January 2025).
llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models. Available online: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md (accessed on 22 January 2025).
Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024. [Google Scholar] [CrossRef]
google/gemma-2-9b-it · Hugging Face. 2024. Available online: https://huggingface.co/google/gemma-2-9b-it (accessed on 22 January 2025).
Gemini 2.0 Flash (Experimental)|Gemini API. Available online: https://ai.google.dev/gemini-api/docs/models/gemini-v2 (accessed on 22 January 2025).
Banerjee, D.; Singh, P.; Avadhanam, A.; Srivastava, S. Benchmarking LLM powered Chatbots: Methods and Metrics. arXiv 2023. [Google Scholar] [CrossRef]
Kossack, P.; Unger, H. Emotion-Aware Chatbots: Understanding, Reacting and Adapting to Human Emotions in Text Conversations. In Proceedings of the Advances in Real-Time and Autonomous Systems; Unger, H., Schaible, M., Eds.; Springer: Cham, Switzerland, 2024; pp. 158–175. [Google Scholar]
Vishal, M.; Vishalakshi Prabhu, H. A Comprehensive Review of Conversational AI-Based Chatbots: Types, Applications, and Future Trends. In Internet of Things (IoT): Key Digital Trends Shaping the Future; Misra, R., Rajarajan, M., Veeravalli, B., Kesswani, N., Patel, A., Eds.; Springer: Singapore, 2023; pp. 293–303. [Google Scholar]
Dam, S.K.; Hong, C.S.; Qiao, Y.; Zhang, C. A Complete Survey on LLM-based AI Chatbots. arXiv 2024, arXiv:2406.16937. [Google Scholar]
Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Bias in Emotion Recognition with ChatGPT. arXiv 2023, arXiv:2310.11753. [Google Scholar]

Figure 1. Chatbot solution architecture.

Figure 2. Bar chart comparing the accuracy of different LLM models across various prompt formats.

Figure 3. Bar chart comparing accuracy improvements for different LLMs with “Precise” vs. “Concise” command.

Figure 4. Bar chart illustrating the highest accuracy achieved by different LLM models with the best prompt formats and commands.

Table 1. Defined conversation flows and example phrases.

Flow	Example Phrase
Add additional phone number	How can I add another phone to my plan?
Cancel device insurance	How do I cancel the insurance on my phone?
Payment inquiry	Can you explain the recent charges on my account?
Purchase internet service	What internet plans do you offer?
Purchase phone number	What’s the process for getting a new phone number with you?
Report service issues	My internet is acting up. Can you fix it?
Request device insurance	I’m interested in getting insurance for my new phone.
Request device repair	I need help repairing my broken phone.
Top up account	How can I increase my data usage?
Transfer phone number	Can you help me move my phone number to you?

Table 2. Overview of LLMs used in the research.

Name	Platform	Model	Parameters	Ref.
Llama-1B	Groq Cloud	llama-3.2-1b-preview	1.23 billion	[28]
Llama-3B	Groq Cloud	llama-3.2-3b-preview	3.21 billion	[28]
Llama-8B	Groq Cloud	llama-3.1-8b-instant	8 billion	[29]
Gemini1.5F-8B	Google Cloud	gemini-1.5-flash-8b	8 billion	[30]
Gemma-9B	Groq Cloud	gemma2-9b-it	9.24 billion	[31]
Llama-70B	Groq Cloud	llama-3.1-70b-versatile	70 billion	[29]
Gemini1.5F	Google Cloud	gemini-1.5-flash	Very large ¹	[30]
Gemini2.0F	Google Cloud	gemini-2.0-flash-exp	Very large ¹	[32]

¹ Number of parameters is not publicly available.

Table 3. Selected configurations of prompt templates, LLM provider, and models.

Prompt Template	Provider	Model	Configuration
Default	Groq	Llama-3B	`- name: SingleStepLLMCommandGenerator` `llm:` `provider: groq` `model: llama-3.2-1b-preview`
YAML	Groq	Gemma-9B	`- name: SingleStepLLMCommandGenerator` `prompt_template: data/yaml.jinja2` `llm:` `provider: groq` `model: gemma2-9b-it`
JSON	Gemini	Gemini1.5F	`- name: SingleStepLLMCommandGenerator` `prompt_template: data/json.jinja2` `llm:` `provider: gemini` `model: gemini-1.5-flash`

Table 4. Abbreviated prompt structure.

Prompt Section	Prompt Content
Flows	`These are the flows that can be started:` `purchase_phone_number: Assist users in purchasing a new phone number.` `cancel_device_insurance: Help users cancel their device insurance.`
Phrase	`The user just said "Please cancel my device insurance".`
Actions	`These are your available actions:` `* Starting another flow, described by "StartFlow(flow_name)"` `* Clarifying which flow should be started.` `An example would be Clarify(list_contacts, add_contact)`
Command	`Your action list:`

Table 5. Sample accuracy results for different LLM responses.

Flow	LLM Response	Accuracy [%]
Purchase internet service	`StarfFlow(purchase_internet_service)`	100.00
Purchase phone number	`StarfFlow(purchase_internet_service)`	0.00
Purchase phone number	`Clarify(purchase_internet_service,` `purchase_phone_number)`	50.00
Payment inquiry	`Clarify(payment_inquiry, top_up_account` `purchase_phone_number)`	33.33
Transfer phone number	`Clarify(purchase_internet_service,` `purchase_phone_number)`	0.00

Table 6. Prompt fragments in different formats.

Format	Content
Plain text	`These are the flows that can be started:` `purchase_phone_number: Assist users in purchasing a new phone number.` `These are your available actions:` `* Starting another flow, described by StartFlow(flow_name)`
Markdown	`### Flows` `* purchase_phone_number: Assist users in purchasing a new phone number.` `### Actions` `* StartFlow(flow_name): Starting another flow.`
YAML	`flows:` `- flow_name: "purchase_phone_number"` `flow_description: "Assist users in purchasing a new phone number."` `actions:` `- action_name: StartFlow(flow_name)` `action_description: Starting another flow.`
JSON	`"flows": [{"flow_name": "purchase_phone_number",` `"flow_description": "Assist users in purchasing a new phone number."}]` `"actions": [{"action_name": "StartFlow(flow_name)",` `"action_description": "Starting another flow"}]`

Table 7. Accuracy of different LLM models across various prompt formats.

LLM	Plain Text	Markdown	YAML	JSON	Best Format	Best Improv.
Llama-1B	7.46	6.82	12.71	5.32	YAML	5.25
Llama-3B	32.33	38.68	40.83	46.37	JSON	14.04
Llama-8B	68.99	66.82	68.42	65.53	Plain text	−0.57
Gemini1.5F-8B	64.65	81.52	81.40	86.27	JSON	21.62
Gemma-9B	69.11	75.37	76.08	77.86	JSON	6.26
Llama-70B	84.03	83.50	82.89	89.64	JSON	5.61
Gemini1.5F	90.25	88.63	91.15	88.74	YAML	0.90
Gemini2.0F	83.27	85.38	87.77	89.92	YAML	4.50

Table 8. Accuracy improvement with “Precise” command compared to “Concise” command for different LLMs.

LLM	Plain text		Markdown		YAML		JSON
LLM	Precise	Improv.	Precise	Improv.	Precise	Improv.	Precise	Improv.
Llama-1B	10.50	+3.04	13.50	+6.68	7.98	−4.73	11.15	+5.83
Llama-3B	46.81	+14.48	45.79	+7.11	51.90	+11.07	51.73	+5.36
Llama-8B	69.16	+0.17	73.17	+6.35	71.59	+3.17	69.16	+3.63
Gemini1.5F-8B	67.34	+2.69	85.20	+3.68	85.40	+4.00	88.10	+1.83
Gemma-9B	67.90	−1.21	79.77	+4.40	83.28	+7.20	77.07	−0.79
Llama-70B	82.71	−1.32	84.22	+0.72	83.32	+0.43	88.72	−0.92
Gemini1.5F	88.49	−1.76	88.74	+0.11	89.37	−1.78	90.65	+1.91
Gemini2.0F	84.50	+1.23	86.65	+1.27	86.98	−0.79	88.10	−1.82

Table 9. The highest accuracy achieved by different LLM models with the best prompt formats and commands.

LLM	Best Prompt	Plain Text/Concise	Best Prompt	Best Improv.
Llama-1B	Markdown/Precise	7.46	13.50	+6.04
Llama-3B	YAML/Precise	32.33	51.90	+19.57
Llama-8B	Markdown/Precise	68.99	73.17	+4.18
Gemini1.5F-8B	JSON/Precise	64.65	88.10	+23.45
Gemma-9B	YAML/Precise	69.11	83.28	+14.17
Llama-70B	JSON/Concise	84.03	89.64	+5.61
Gemini1.5F	YAML/Concise	90.25	91.15	+0.90
Gemini2.0F	JSON/Concise	83.27	89.92	+6.65

Table 10. Summary of main findings.

Finding	Description
Smaller models’ performance	Smaller models (e.g., Gemini1.5F-8B, Gemma-9B) can achieve results close to larger models with optimized prompts.
Impact of prompt format	Structured formats (YAML, JSON) significantly improve response accuracy for many models.
Command precision	Precise commands improve response quality for smaller models but have less impact on larger models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pawlik, L. How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics 2025, 14, 888. https://doi.org/10.3390/electronics14050888

AMA Style

Pawlik L. How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics. 2025; 14(5):888. https://doi.org/10.3390/electronics14050888

Chicago/Turabian Style

Pawlik, Lukasz. 2025. "How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness" Electronics 14, no. 5: 888. https://doi.org/10.3390/electronics14050888

APA Style

Pawlik, L. (2025). How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness. Electronics, 14(5), 888. https://doi.org/10.3390/electronics14050888

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

How the Choice of LLM and Prompt Engineering Affects Chatbot Effectiveness

Abstract

1. Introduction

2. Related Work

3. Research Methodology

3.1. Quantitative Methods

3.2. Qualitative Methods

3.3. Inter-Rater Agreement

4. Results and Analysis

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI