1. Introduction
Knee osteoarthritis (KOA) is a common condition worldwide related to pain and disability [
1]. Globally, 22.9% of people aged 40 and older suffer from KOA, affecting around 654 million people in this age group as of 2020 [
2]. This condition can cause joint pain, muscle weakness, and physical disability and significantly reduce quality of life [
3,
4,
5,
6]. Risk factors for developing KOA include obesity [
2], anterior cruciate ligament injuries, meniscal injuries, chondral injuries, and knee fractures [
7]. With rising obesity rates, healthcare systems are likely to face increasing challenges in managing persistent knee pain and physical disability related to KOA [
8].
Treatment of KOA through knee arthroplasty is recommended only for people in the late stage of the disease when conservative treatments are not effective [
9,
10]. In line with this, the clinical practice guidelines [
10,
11,
12,
13] for KOA emphasize the importance of self-effectiveness, exercise, and body weight control (where necessary) as fundamental management strategies. Long-term adherence to an exercise program is low, especially when there is no continuous support from a healthcare professional [
14,
15]. For example, in a recent trial, only 65% of patients with KOA followed their prescribed exercise program by a physiotherapist for 8 weeks [
16], and other studies have reported that only 30% of patients with hip or knee osteoarthritis maintained their long-term adherence [
17].
Until a few years ago, SMS technology (Short Message Service) or text messaging via mobile phones had established itself as a tool capable of generating positive changes in health behaviors [
18]. According to research in this field, where text messages are used to support adherence to knee osteoarthritis treatments [
19,
20,
21], it has been determined that promoting physical activity through this tool can positively influence the behavioral and psychosocial outcomes of patients undergoing treatment. The software applications developed in these studies automated the process of sending messages to deliver reminders to patients. In one of these studies, interactions were conducted with patients to identify barriers and facilitators that could improve adherence to treatments. These software applications were based on heuristic methodologies grounded in the evidence and experience of the researchers. Algorithms were used to automate these processes with the aim of facilitating administrative tasks while always maintaining standardized questions and answers. Currently, thanks to the use of language models that simulate human conversations, known as chatbots, the experiences and applications of user interaction have significantly improved. In the field of medicine, there is evidence that chatbots promote healthy lifestyles and improve mental well-being. These systems are widely used in treatment, education, and the detection of various conditions, standing out for their accessibility [
22].
Recent advances in large language models (LLMs) in the medical field have shown great potential for improving the accuracy and relevance of responses generated by chatbots through the integration of Retrieval-Augmented Generation (RAG) systems. LLM-based chatbots offer several general benefits in healthcare. Firstly, they provide immediate and accessible assistance to patients, enhancing the efficiency of the healthcare system by reducing the workload on medical staff for routine consultations [
23]. Additionally, these chatbots can offer consistent, evidence-based responses, increasing the accuracy and reliability of the information provided [
24]. They can also personalize interactions based on the patient’s history and specific needs, improving user experience and promoting greater adherence to treatments [
25].
For instance, a study described how RAG can significantly enhance the accuracy of medical responses by integrating information retrieved from external databases [
26]. Similarly, another study highlighted the importance of the quality of retrieved documents to ensure precise responses [
27]. Other studies have emphasized the ability of RAG to combine various data sources and improve the accuracy of responses through iterative processes [
28,
29]. There are various ways to implement RAG that provide benefits and advantages that should be considered within the framework of each implementation. The choice of RAG implementation method can affect the system’s efficiency, the relevance of the retrieved information, and the model’s ability to handle variability in user queries. In this work, we analyze three main types of RAG in the healthcare domain: RAG, Corrective Retrieval-Augmented Generation (CRAG), and Self-Reflective Retrieval-Augmented Generation (SELF-RAG) [
30,
31,
32].
RAG integrates documents retrieved from an external database to improve the accuracy of the responses generated by the model, with the relevance of the retrieved documents being critical to ensuring the accuracy and relevance of the provided information [
28,
29]. On the other hand, CRAG introduces a retrieval evaluator that automatically corrects the retrieved information before it is used to generate responses, mitigating the possibility of hallucinations and improving the reliability of the responses [
28]. SELF-RAG includes self-reflection mechanisms that allow the model to continuously evaluate the relevance of the retrieved information and automatically reformulate user queries to ensure that the responses are accurate and relevant. This approach also enhances the fluency of conversations and addresses significant hallucination problems in medical domains [
32]. The relevance of responses is crucial in the healthcare context, as incorrect or irrelevant information can have significant consequences for patient well-being. The ability of these systems to provide accurate, evidence-based responses is fundamental to their effectiveness and reliability in medical applications.
Chatbots have been utilized in various areas of healthcare, including cancer care [
33], behavioral change [
34], and psychiatry [
35], among others. Specifically, in musculoskeletal care, they have been studied in individuals with chronic pain [
36], shoulder arthroplasty [
37], and back pain [
38]. Regarding adherence, Blasco et al. [
37] found that a virtual assistant, functioning as a chatbot through an instant messaging smartphone application, can be an effective approach to enhance adherence and improve compliance rates with early postoperative home rehabilitation in patients undergoing reverse shoulder arthroplasty. However, in the osteoarthritis population, there is currently no evidence of its effectiveness.
Hallucinations in large language models (LLMs), where the model generates incorrect or unsupported information, represent a critical challenge in the medical domain, where accuracy and relevance are essential. SELF-Reflective Retrieval-Augmented Generation (SELF-RAG) is an advanced approach that addresses this issue by integrating self-reflection mechanisms into traditional Retrieval-Augmented Generation (RAG) systems. SELF-RAG dynamically evaluates the relevance and accuracy of retrieved information, allowing the model to reformulate user queries when necessary to improve the quality and contextual alignment of its responses [
39]. This mechanism reduces hallucinations by enabling the model to iteratively assess and refine its outputs, a feature particularly important for applications where incorrect information can have serious implications. Additionally, SELF-RAG enhances conversational coherence and fluency, making interactions more effective and contextually relevant. Studies have shown that self-reflection allows the model to detect and correct inaccuracies during response generation, significantly improving reliability [
40]. In this study, SELF-RAG plays a central role in developing a chatbot designed to improve treatment adherence in patients with knee osteoarthritis (KOA). By generating evidence-based responses tailored to patient needs, SELF-RAG supports personalized medical care and promotes adherence to prescribed treatments. Moreover, the reflective and structured processes within SELF-RAG enable the model to handle variability in user queries effectively, ensuring response accuracy in dynamic and complex clinical scenarios [
41,
42]. Existing research further highlights the importance of integrating robust retrieval and reflection mechanisms to enhance both the relevance and precision of outputs, especially in high-stakes environments like healthcare [
43,
44].
The Chain of Thought (CoT) [
45] is a technique in the context of LLMs that enhances the reasoning capabilities of these models by generating a series of intermediate natural language steps leading to a final response. This technique has proven effective in improving accuracy in complex tasks by breaking down problems into more manageable steps, allowing models to allocate more computational resources to problems requiring a higher level of reasoning. CoT facilitates the generation of reasoning sequences similar to human thought processes, enhancing not only the accuracy and relevance of responses but also the fluency and coherence of conversations.
SELF-RAG benefits from the CoT structure by reducing hallucinations through the creation of a logical sequence of reasoning steps that the model follows. This process ensures that the generated responses are accurate, relevant, and easy to understand, which is essential for effective interaction between the chatbot and patients [
20]. In summary, implementing Chain of Thought in SELF-RAG in developing a chatbot for managing KOA provided a robust approach that ensured precise and coherent responses, improving patient experience and treatment adherence. To effectively implement Chain of Thought (CoT) and SELF-RAG, it is crucial to have an infrastructure that allows for modeling complex processes and managing iterative and conditional workflows. The ability to break down problems into intermediate steps and continuously adjust responses requires a robust and flexible system. Graphs are fundamental for modeling these processes, as they allow for representing sequences of tasks and the conditions necessary to progress between them. Graphs facilitate the visualization and management of complex workflows, ensuring that each step is executed in the correct order and under the appropriate conditions.
LangGraph is a library specifically designed for this purpose (LangChain, n.d.). In this study, version 0.2.50 of LangGraph was utilized, which provides tools for creating and managing graphs that represent complex workflows in LLM applications as part of LangChain. The use of this specific version ensures reproducibility and consistency in the described processes. LangChain allows defining chains of computation (Directed Acyclic Graphs or DAGs), while LangGraph introduces the ability to add cycles, enabling more complex, agent-like behaviors where you can call an LLM in a loop, asking it what action to take next [
21,
22]. This framework can model graphs by defining workflows that include nodes for document retrieval, relevance evaluation, response generation, query transformation, and translation [
21]. The ability to handle cycles and branching allows for continuous iteration over queries and responses, refining them continually. The edges can be both sequential and conditional, adding flexibility to the decision-making process within the workflow [
22], which also provides a solid foundation for implementing SELF-RAG in our chatbot, facilitating the creation of complex workflows and ensuring the accuracy and relevance of generated responses [
23]. Our main aim was to develop a chatbot based on LLMs with SELF-RAG to improve treatment adherence in patients with KOA.
Previous interventions for improving treatment adherence, such as SMS-based systems, relied on standardized and heuristic processes that lacked personalization and adaptability to individual patient needs. While chatbots have demonstrated potential in areas such as mental health and behavioral modification, their application to treatment adherence in knee osteoarthritis (KOA) remains underexplored. Moreover, existing systems fail to incorporate advanced techniques such as Self-Reflective Retrieval-Augmented Generation (SELF-RAG) and Chain of Thought (CoT), which are critical for mitigating hallucinations and ensuring the delivery of accurate, context-aware, and evidence-based responses. In clinical applications, response quality is crucial, as inaccurate or irrelevant information can directly affect patient outcomes. This study addresses these challenges by curating a robust knowledge base through a systematic review guided by the PRISMA framework, ensuring the chatbot’s responses are grounded in high-quality, evidence-based clinical guidelines. Additionally, the implementation of a graph-based infrastructure enhances the conversational tone, aligning it with the needs of clinical interactions. This approach improves the chatbot’s capacity to deliver precise and empathetic responses while fostering a more engaging user experience. By integrating these innovations, this work establishes a comprehensive framework for addressing adherence challenges in the management of chronic conditions such as KOA.
2. Materials and Methods
We conducted a systematic review following PRISMA guidelines to build an evidence-based knowledge base. The chatbot was developed using a SELF-RAG framework, which combines advanced language modeling with iterative self-evaluation implementation carried out on the Telegram platform. The entire process is detailed in the following sections.
2.1. Knowledge Base Construction
To generate the body of knowledge, a rapid systematic review was conducted, adhering to the PRISMA [
46] declaration for systematic reviews. For the criteria used to select studies, we included clinical practice guidelines, along with the studies included in them that we were able to retrieve, using the AGREE II tool to assess the methodological rigor and transparency. Guidelines were included if they scored ≥ 60% in domains 3 (rigor of development), 6 (editorial independence), and at least one other domain, ensuring only high-quality recommendations were considered. Studies with a high risk of bias were excluded. Participants included those diagnosed with knee osteoarthritis (KOA), and any intervention related to KOA management was accepted. Additionally, all types of outcome measures were included.
The search strategy involved querying the following databases up to March 2024 without restrictions on language, publication year, or publication status: CENTRAL, MEDLINE (via PubMed), and SCOPUS. We also searched guideline repositories (Clinical Practice Guidelines Portal (National Health and Medical Research Council), Epistemonikos, Guidelines International Network, Turning Research Into Practice (TRIP), Cochrane, Orthoguidelines.org, and the National Institute for Health and Care Excellence (NICE)). The search strategy used includes the following terms: “knee osteoarthritis” OR “osteoarthritis of the knee” OR gonarthritis AND “clinical practice guideline” OR “practice guideline” OR guideline OR recommendation OR consensus.
The selection and management of studies were conducted independently by two authors (J.G.A. and H.F.) using the Rayyan software (
https://www.rayyan.ai/, accessed on 16 January 2025). They screened the titles and abstracts of all studies identified in the initial searches based on the inclusion criteria, excluding any clearly irrelevant studies at this stage. If there was any doubt, we retrieved the full-text article for further assessment when we could not determine from the title and abstract if it was eligible. The two authors independently reviewed all full-text studies, and disagreements were resolved by discussion. In the scope of quality appraisal, two reviewers (J.G. and D.O.) independently appraised each guideline for quality using the online version of the AGREE II tool. The AGREE II includes 23 items across 6 domains (scope and purpose, stakeholder involvement, rigor of development, clarity of presentation, applicability, and editorial independence), along with an overall rating item and a recommendation decision for the guideline [
47]. They scored each item on a 7-point scale (1 = strongly disagree; 7 = strongly agree). As mentioned before, we include guidelines for high quality defined as a score of ≥60% in domains 3 (rigor of development), 6 (editorial independence), and at least one other domain.
Data Preprocessing
The final dataset consists of 113 full-text articles screened from a total of 9355 articles initially identified.
Figure 1 describes the selection process. Twenty-nine guidelines were assessed for quality appraisal; however, only six high-quality guidelines were identified [
10,
11,
12,
13,
48,
49]. We also added the high-quality studies (defined by the clinical guideline) that we were able to retrieve from each included clinical guideline (
Appendix A).
Table 1 describes the results of the AGREE II tool for each domain, and
Table 2 summarizes the general characteristics of each guideline.
Given that the proposed system utilizes a Retrieval-Augmented Generation approach, the data preprocessing primarily focused on structuring and formatting the knowledge base described in the Knowledge Base Construction section to ensure compatibility with the retrieval system. The dataset, available at (
https://ialab.s3.us-east-2.amazonaws.com/osteoarthritis_publications.zip) (accessed on 20 January 2025), was first tokenized and indexed to enable efficient retrieval during user interactions. No additional feature engineering or significant transformations were necessary, as the model directly retrieves relevant information. All documents were embedded using the ChromaDB vector database, which facilitated efficient vectorized searches during inference. This tool is relevant to the system as it allows for efficient storage and retrieval of the vector representation (embeddings) of documents selected as relevant in the inference process, according to user interactions. Furthermore, the embeddings were processed to ensure alignment with the query format used by the large language model.
2.2. Model Selection: Detailed Description of Mistral and Its Integration
At the time of writing, Mistral 7B [
50] was among the leading models in natural language processing (NLP) benchmarks relevant to the medical domain of this study. Its performance in language comprehension and generation tasks, coupled with its deployment efficiency, positioned it as a suitable choice for developing the proposed chatbot. Moreover, its computational efficiency and adaptability through techniques such as Parameter-Efficient Fine-Tuning (PEFT) enabled the optimization of available resources without compromising the accuracy of generated responses. Mistral 7B is an open-weight model designed for high-performance NLP tasks. It employs a Transformer architecture, which facilitates the handling of complex linguistic relationships and supports a wide range of tasks with high accuracy. With seven billion parameters, the model achieves a balance between computational efficiency and robust task performance, making it particularly well-suited for integration into resource-constrained systems such as healthcare chatbots. Additionally, Mistral 7B incorporates principles inspired by Mixture-of-Experts (MoE), enabling the dynamic allocation of computational resources to subsets of its architecture based on input complexity. This mechanism enhances both scalability and efficiency by selectively activating relevant parameters during inference, maintaining accuracy while optimizing resource usage. The model was pretrained on diverse, high-quality datasets, including medical corpora, enabling effective generalization to clinical tasks. Its capabilities in text generation, translation, and question-answering form the basis of the chatbot’s functionality. In this work, Mistral 7B is instrumental in generating accurate and context-aware responses, meeting the stringent requirements of clinical precision and reliability.
Recent studies have demonstrated Mistral’s efficacy in the medical domain. For instance, the implementation of BioMistral has improved pretraining efficiency and accuracy in answering complex medical questions [
51]. Similarly, Mistral-Plus has developed reward models that ensure helpful and safe responses in conversational interactions [
52]. These advancements illustrate that Mistral is a suitable choice for our project, providing a solid foundation for generating accurate and relevant responses in the medical context. Additionally, advanced techniques such as supervised fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) enable efficient utilization of computational resources while maintaining high performance [
53,
54]. Mistral 7B incorporates strategies to mitigate hallucinations, a critical challenge in medical applications. For instance, reward models, such as those implemented in Mistral-Plus, adjust responses based on utility and safety criteria, ensuring that generated outputs align with clinical requirements. Furthermore, self-reflection mechanisms are integrated to continuously evaluate the relevance and accuracy of generated responses, improving the model’s reliability in delivering evidence-based, contextually appropriate outputs.
The efficiency of Mistral 7B in inference is further enhanced by architectural optimizations, including layer compression and sparsity techniques. These optimizations reduce computational overhead, enabling the deployment of the model in resource-constrained environments without compromising accuracy. Combined with the Retrieval-Augmented Generation (RAG) approach, Mistral 7B enhances the chatbot’s ability to deliver personalized, evidence-based responses, improving patient adherence to treatments. This functionality supports continuous refinement of interactions while upholding rigorous standards of accuracy and relevance, which is key to the success of the proposed system.
2.3. Computing Infrastructure
As previously mentioned, this proposal is based on the RAG approach, which eliminates the need for fine-tuning pre-trained models. This approach optimizes computational resources by leveraging large datasets to retrieve relevant information and generate responses efficiently. Given that the system’s requirements focus on a moderately capable GPU but with substantial RAM, the infrastructure effectively facilitated agile and accurate user interactions with the chatbot, thereby optimizing resource utilization.
The system utilized for this work is equipped with a NVIDIA GeForce RTX 3060 Ti GPU featuring 8 GB of memory and 125.7 GB of RAM, which enables efficient handling of deep learning processes and information retrieval.
Table 3 summarizes the key hardware and software components employed in the infrastructure. On the software side, several critical tools were employed. Ollama 0.1.17 was used to support the Mistral model, optimizing natural language processing tasks. PyTorch 1.10.1 served as the primary framework for developing and training deep learning models, while LangChain 0.1.20 facilitated the implementation of SELF-RAG, orchestrating retrieval and response generation processes. Furthermore, ChromaDB 0.5.0 was used for efficient embedding storage and retrieval, ensuring robust data management throughout the system.
2.4. Chatbot Architecture
The implementation of this application is based on several services deployed in Docker containers that integrate to respond to user requests through the Telegram application. As detailed in
Figure 2, the software architecture of this application consists of three main components: the Telegram application, a web server that acts as a webhook [
55], and a set of applications that respond to user queries, referred to as the LLM server [
56].
Telegram is a messaging application that establishes the interface between the user and the responses generated by the language model based on the user’s questions or comments in the application. The webhook server has two main functions. The first is to schedule messages reminding the user to perform the daily exercises prescribed by the specialist. The second function is to structure the user’s conversation flow based on Nelligan et al. [
21] (
Figure 3).
The relevant component of this architecture is the LLM server, which organizes all the tools that enable the generation of bot responses according to user inputs. As detailed in
Figure 4, this architecture integrates the following libraries to achieve these results:
Ollama: The company Ollama offers a library that facilitates the integration of their developed models into various applications, allowing for the generation of automatic responses, text analysis, and other tasks related to natural language processing. This library enables the integration of the advanced language model Mistral for generating the bot’s response text according to user inputs. This library was used to deploy the Mistral model in the project and integrate it with the workflow that processes user inputs to generate an appropriate response in the context of KOA.
LangServe: A service that facilitates the implementation and management of large-scale NLP models. It is designed to streamline the integration of advanced language models, enabling efficient deployment, scaling, and maintenance of these models. It provided the project’s application with an Application Programming Interface (API) to handle requests to the NLP service, facilitating interoperability with the project’s webhook service.
LangChain: An open-source framework designed to help developers build language model-driven applications more effectively and efficiently. LangChain enables the integration of LLM models without the need to manage all the underlying infrastructure and interoperates with other tools or services. This allows for easy integration with APIs, databases, and other components of the software ecosystem.
LangGraph: A framework designed to create complex, stateful workflows in applications like chatbots and automated support agents. It allows developers to define nodes and edges that represent different states and transitions within a workflow. Each node can be a function or a model call, and the edges determine how the process flows from one node to another.
ChromaDB: An open-source vector store used to store and retrieve embeddings. Its primary use in the project is to store document vectors along with metadata for later use by the Mistral LLM model through the workflow defined in LangGraph.
2.5. Chatbot Workflow
The chatbot workflow was designed to optimize efficiency and accuracy in processing medical queries, utilizing a combination of advanced technologies. In the initial stage, the Mistral 7B model is deployed on a cloud-based server to ensure scalability and accessibility. This deployment is managed using Docker containers, facilitating the scalability and maintenance of the system. The connection with LangChain and LangGraph is crucial for the seamless integration of the Mistral 7B model into the chatbot’s workflow. LangChain enables the smooth incorporation of the model into chatbot applications, while LangGraph manages the complex workflows necessary to effectively handle user queries. LangGraph’s ability to define and manage Directed Acyclic Graphs (DAGs) and introduce cycles ensures that the chatbot can handle iterative processes and continuously refine its responses.
The chatbot’s workflow design, as depicted in the provided diagram (
Figure 5), is implemented using LangGraph, which allows for the definition of nodes and edges that represent different states and transitions within the interaction process. Key nodes include document retrieval (Retrieve), relevance evaluation (GradeDocuments), response generation (Generate), and query transformation (TransformQuery). The edges define the flow of operations, ensuring that each step is executed in the correct sequence and under appropriate conditions.
Finally, to facilitate user interaction, the chatbot is integrated into the Telegram messaging platform. This integration allows patients to conveniently interact with the chatbot, receiving timely reminders and responses tailored to their specific needs.
2.6. First Iteration of SELF-RAG Implementation: Workflow for Initial Query Processing and Spanish Translation Evaluated via User Surveys
The first version of the SELF-RAG implementation features a detailed workflow designed to process user queries and generate accurate, contextually relevant responses. As shown in
Figure 5, the workflow begins with the submission of a user query, which is then processed through several stages: document retrieval, relevance evaluation, query transformation, response generation, and final evaluation. The output is provided in the target language, ensuring its suitability for medical contexts. The integration of the CoT approach enhances the language model’s reasoning capabilities by breaking down complex queries into intermediate steps. This method ensures that each stage of the workflow, from document retrieval to response generation, incorporates reflective processes that continuously refine and validate the information.
Figure 5 shows the workflow.
The workflow begins with retrieving relevant documents based on the user’s initial query from ChromaDB (node: Retrieve). These documents are then evaluated for relevance (node: GradeDocuments); those considered relevant move to the generation phase (node: Generate), while non-relevant documents trigger a transformation of the original query to improve retrieval outcomes (node: TransformQuery). Using the relevant documents, the system generates a response (node: Generate), which is then evaluated (node: GradeGeneration). If the response is supported and useful, it proceeds to the translation phase (node: TranslateQuery); otherwise, it may be sent back for reformulation or regeneration. Decisions between these stages are represented by langraph edges that define the sequence and conditions under which information moves through the system. This structured approach, integrating CoT within the SELF-RAG framework, significantly enhances the reliability and fluency of the responses, addressing key challenges in medical chatbot applications. The prompts for each workflow stage are detailed in
Appendix B.
To explore the satisfaction of the answers provided by the chatbot among healthcare providers, a convenience sample was conducted, and 40 providers were recruited in the city of La Serena, Chile, between May 2024 and July 2024. The recruitment was carried out through digital media (electronic mailing and social networks), and forty-five potentially eligible persons were gathered, five of whom were excluded because they did not meet the inclusion and exclusion criteria. The 40 participants who met the inclusion criteria were invited to participate, and their written consent was given. The inclusion criteria were to (1) be a general practitioner or medical specialist, physiotherapist, occupational therapist, or nurse; (2) know the core treatment of KOA. An online survey instrument on a Likert scale (1 = lower score, 5 = higher) was used for a series of five questions related to the treatment of KOA. This survey consisted of the following questions: (1) What non-pharmacological treatment do you recommend for my knee osteoarthritis? (2) What pharmacological treatment do you recommend for my knee osteoarthritis (3) What treatment do you recommend for my knee osteoarthritis? (4) Is surgery better than exercise for my knee osteoarthritis? (5) Do you recommend electric stimulation for my knee osteoarthritis? Participants were asked to inquire these questions to our chatbot and ChatGPT and then replied how satisfied they were with the answer given by both chatbots on a five-point Likert scale. The composition of the survey sample was 35% general practitioners, 32.9% nurses, 20% orthopedic surgeons, and 12.5% physiotherapists. On the other hand, the sample consisted of 50% women.
The initial implementation of the chatbot, integrating SELF-RAG and CoT methodologies, demonstrated significant improvements in generating coherent and contextually relevant responses. This was evident from the survey results comparing the KOA chatbot to a generic ChatGPT version GPT-4o mini (18 July 2024), where the former showed superior performance in content accuracy and relevance (
Figure 6). However, despite these improvements, the tone of the responses lacked the sensitivity and professionalism expected in a patient–doctor interaction. In response to this feedback, as illustrated in the updated Algorithm 1, the second version of the chatbot incorporated an additional node in the workflow dedicated to formatting and improving the tone and wording of the translated responses. This enhancement was motivated by the need to ensure that the chatbot’s interactions were not only accurate but also empathetic and appropriate for medical contexts.
As shown in Algorithm 1, the updated workflow integrates a key enhancement: FormalQuery, a node specifically designed to refine the tone and phrasing of LLM-generated responses, ensuring they adhere to the professional and empathetic communication style appropriate for patient-doctor interactions. This additional step, FormalQuery, leverages natural language processing techniques to refine the translated responses, ensuring they are delivered in a professional and empathetic manner. By addressing the tone and wording of the responses, the second version aims to enhance user satisfaction and trust, which are crucial for effective patient engagement and adherence to medical advice. The detailed algorithm, as depicted in Algorithm 1, illustrates the comprehensive workflow, incorporating this new process to enhance system performance. The algorithm outlines key stages, including initialization, retrieval, evaluation, generation, translation, and the final formatting of the response to align with a patient-centric medical tone. Furthermore, as shown in
Figure 7, the specific prompt code used to achieve this alignment ensures the outputs meet the required standards for medical communication, addressing the limitations identified in the previous version.
Algorithm 1 SELF-RAG: SelRReRective Retrieval-Augmented Generation. This algorithm outlines the steps of the SELF-RAG process, detailing the sequence from receiving a user question to generating and self-reflecting on responses, ensuring enhanced accuracy and relevance. The updated workflow includes the addition of a FormalQuery node, which refines responses to make them more patient-friendly, improving communication between patients and healthcare providers. |
Require: Input query Q, knowledge base K, LLM model M |
Ensure: Generated response R |
1: Initialize empty list D for retrieved documents |
2: Retrieve documents D from K using query Q |
3: R ← M(Q, D) | ▷ Generate initial response |
4: E ← ExtractEntities(R) | ▷ Extract entities from response |
5: for each e ∈ E do |
6: D′ ← Retrieve(K, e) | ▷ Retrieve documents for each entity |
7: if D′ is relevant then |
8: D ← D ∪ D′ | ▷ Update retrieved documents |
9: end if |
10: end for |
11: R′ ← M(Q, D) | ▷ Generate refined response |
12: Reflect and validate R′ against D |
13: if Validation fails then |
14: Q′ ← ReformulateQuery(Q, R′) | ▷ Reformulate query based on reflection |
15: R″ ← M(Q′, D) | ▷ Generate final response |
16: end if |
17: Translate R″ to Spanish |
18: Rewrite R″ in patient-friendly medical tone |
19: return R″ |
The updated survey results (
Figure 8) show an increase in user satisfaction across all questions, with a reduction in the percentage of dissatisfied responses. This indicates that the chatbot not only maintained its strength in providing accurate and relevant content but also succeeded in delivering responses that are better aligned with the expectations of a medical consultation.
To more accurately evaluate the responses provided by the chatbot in its clinical dimension, a categorization of the questionnaire questions was developed, taking into account the professional profiles of the participants, which included orthopedic surgeons, general practitioners, nurses, and physical therapists. As shown in
Table 4, the questions were grouped into four main categories: Non-Pharmacological Treatments (NPTs), Pharmacological Treatments (PTs), General Treatments (GTs), and Treatment Comparison (TC). This grouping was performed with the aim of reflecting the areas of expertise and clinical focus of each professional. For instance, physical therapists and nurses, who frequently address non-pharmacological interventions, significantly contributed to the NPT category, while orthopedic surgeons and general practitioners, with a greater emphasis on pharmacological prescriptions and surgical decision-making, contributed to the PT and TC categories, respectively. This categorization allows for a structured and comparative evaluation of the chatbot’s recommendations, facilitating the identification of specific improvements in the quality of the clinical responses provided.
3. Results
Consequently, a comparative analysis of the pre- and post-improvement versions of the chatbot was conducted, focusing on user satisfaction with the answers provided, using the aforementioned clinical categories as a reference framework. The results of this analysis are presented in
Figure 9, which graphically represents the comparison of average scores between the two versions of the chatbot, grouped according to the established clinical categories. The boxplots summarize the distribution of the data, highlighting the medians, interquartile ranges, and overall variability. Additionally, the orange points in the figure represent individual data values (raw scores) within each category and version. These points provide critical insights into the distribution of the data beyond the statistical summaries offered by the boxplots, allowing the identification of potential outliers or patterns that may not be immediately evident. The metrics displayed in
Figure 9 include medians, interquartile ranges (IQRs), and raw scores (individual evaluations). These metrics were chosen to provide both summary statistics and a detailed view of the data distribution, allowing the identification of trends and potential outliers in the chatbot’s performance. In the NPT category, the average score increased from approximately 3.75 in the pre-improvement version to 4.25 in the post-improvement version, showing a reduction in data dispersion and suggesting greater consistency in the recommendations. Similarly, in the PT category, an increase in the median scores is observed, with less variability in the data from the improved version, indicating an enhancement in the accuracy of the responses. The individual evaluations, represented by the orange points, further confirm this trend by showing a tighter clustering of scores around the median in the post-improvement version.
Overall, all categories demonstrate an increase in median scores and reduced dispersion, reflecting a notable improvement in the quality and consistency of the responses following the implemented enhancements to the chatbot. The inclusion of individual data points alongside the summary statistics provides a comprehensive view of the performance improvements, reinforcing the robustness of the observed results.
A more detailed analysis by professional specialization, as illustrated in
Figure 10, reveals how the aggregated total scores vary across different professions—orthopedic surgeons, general practitioners, nurses, and physical therapists. The post-improvement version of the chatbot generally shows higher median scores and reduced data dispersion compared to the pre-improvement version, particularly among general practitioners and physical therapists. The metrics displayed in
Figure 10 include medians, interquartile ranges (IQRs), and raw scores (individual evaluations), which provide both summary statistics and a detailed view of the data distribution. These metrics enable the identification of trends and potential outliers, offering valuable insights into how the chatbot’s performance varies across professional groups. This suggests that the enhancements made to the chatbot have improved its ability to deliver more consistent and accurate recommendations tailored to the specific clinical focus of each profession. Interestingly, while the ChatGPT-generated responses tend to have scores comparable to or slightly better than the pre-improvement version, they do not consistently achieve the level of consistency observed in the post-improvement version. This indicates that the refinements implemented in the post-improvement version have resulted in a tool that more effectively meets the clinical expectations and needs of healthcare professionals across various specialties. A more in-depth analysis of these variations and their implications for clinical practice will be explored in the next section.
3.1. Evaluation Metrics
To assess the quality of the chatbot’s interactions, the primary metric focused on capturing user perception regarding the relevance of the responses and the appropriateness of the tone in medical conversations. Relevance refers to how suitable and helpful the chatbot’s responses were in addressing the users’ inquiries, while the tone evaluation centered on determining whether the chatbot could generate responses that were not only accurate but also empathetic and contextually appropriate. These metrics were crucial in ensuring that the tool met the communication standards expected in healthcare settings, promoting effective and humanized interactions between the chatbot and its users.
3.2. Survey Results: Analysis of Feedback from Healthcare Professionals on Chatbot Responses
This section provides a detailed analysis of the feedback given by healthcare professionals on the responses generated by the chatbot in its pre-improvement and post-improvement versions. The survey questions were categorized into clinical areas reflecting the specialization and clinical focus of the participants, as outlined in
Table 4. This approach allowed for a structured analysis and comparison of the chatbot’s response consistency and accuracy across different treatment areas.
Figure 6 illustrates the results obtained from the survey before the chatbot was improved. This initial version, based on the SELF-RAG and Chain of Thought approach, managed to deliver some good results in terms of response accuracy and relevance. However, the chatbot’s pre-improvement responses showed considerable variability in clinical recommendation accuracy, with average scores fluctuating around 4.00 across several categories. For example, in the PT category, the score distribution was notably wide, ranging from 2 to 5, reflecting inconsistencies in the accuracy of the responses provided. This variability suggests that while the initial strategy provided a solid framework for generating responses, key areas such as tone consistency and recommendation accuracy, especially in categories requiring high clinical precision, needed optimization.
In the comparative analysis, the post-improvement version of the chatbot, based on a large language model (LLM) and incorporating an additional graph node to optimize a patient-friendly tone, demonstrated significant improvements across all categories. This version showed higher average scores and reduced data dispersion, indicating greater consistency in the recommendations provided. The enhancement in patient-friendly tone was a key factor in achieving results that exceeded those of ChatGPT. For example, in the Non-Pharmacological Treatments (NPTs) category, the average score increased from 4.15 ± 0.75 in the pre-improvement version to 4.40 ± 0.62 in the post-improvement version (t-statistic = −2.85, p-value = 0.0068), confirming the statistical significance of this improvement. Similarly, in the Pharmacological Treatments (PTs) category, the average score increased from 4.00 ± 0.81 to 4.30 ± 0.65 (t-statistic = −3.12, p-value = 0.0032), reflecting the chatbot’s enhanced ability to provide accurate and clinically relevant recommendations. All statistical analyses were performed with a significance threshold of p < 0.05, ensuring consistency and interpretability in evaluating the differences between versions. Cohen’s d effect size of 0.77 for the total aggregate score further supports these findings, indicating a moderate to large effect of the implemented improvements.
The results obtained after the chatbot was improved are presented in
Figure 8. In this figure, a significant reduction in data dispersion is observed, with most average scores rising above 4.20 across all categories. For example, in the NPTs category, the score range narrowed significantly, and the median rose to 4.40, indicating greater consistency in the recommendations. This change suggests that the improvements made, particularly the addition of the node to optimize the patient-friendly tone, resulted in a more consistent and precise performance. Additionally, the higher scores in the TC category, with a median of 4.35 in the post-improvement version, underscore the chatbot’s enhanced ability to deliver responses more aligned with best clinical practices, thereby surpassing the results of both the pre-improvement version and ChatGPT.
Finally,
Figure 10 shows a comparative analysis by profession, revealing how the post-improvement version of the chatbot, due to the incorporation of the node in the graph, achieved higher median scores and lower variability compared to the pre-improvement version. This effect is particularly evident among general practitioners, where the median score increased to 19.5 in the post-improvement version compared to 18 in the pre-improvement version, suggesting an improvement in the quality and consistency of the recommendations provided to this group. Similarly, physiotherapists also experienced notable improvements, with a reduction in score dispersion and an increase in the median to 20 in the post-improvement version. These findings indicate that the targeted improvements allowed the chatbot to better adapt to the specific clinical needs of each profession, offering recommendations more aligned with the expectations of healthcare professionals and reducing the inconsistencies observed in the pre-improvement version and in ChatGPT.