Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development

Farías, Humberto; González Aroca, Joaquín; Ortiz, Daniel

doi:10.3390/technologies13040140

Open AccessArticle

Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development

by

Humberto Farías

^1,*

,

Joaquín González Aroca

²

and

Daniel Ortiz

^1,3

¹

Vice-Rectorate for Research, University of La Serena, La Serena 1700000, Chile

²

MSKD, Research Group: Bridging Science and Practice in Musculoskeletal Disorders, University of La Serena, La Serena 1700000, Chile

³

Instituto de Ecología y Biodiversidad (IEB), Las Palmeras 3425, Ñuñoa, Santiago 7750000, Chile

^*

Author to whom correspondence should be addressed.

Technologies 2025, 13(4), 140; https://doi.org/10.3390/technologies13040140

Submission received: 22 October 2024 / Revised: 3 January 2025 / Accepted: 17 January 2025 / Published: 4 April 2025

(This article belongs to the Special Issue Advancements in Medical and Assistive Technologies Using Artificial Intelligence and Deep Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Knee osteoarthritis (KOA) is a prevalent condition globally, leading to significant pain and disability, particularly in individuals over the age of 40. While exercise has been shown to reduce symptoms and improve physical function and quality of life in patients with KOA, long-term adherence to exercise programs remains a challenge due to the lack of ongoing support. To address this, a chatbot was developed using large language models (LLMs) to provide evidence-based guidance and promote adherence to treatment. A systematic review conducted under the PRISMA framework identified relevant clinical guidelines that served as the foundational knowledge base for the chatbot. The Mistral 7B model, optimized with Parameter-Efficient Fine-Tuning (PEFT) and Mixture-of-Experts (MoE) techniques, was integrated to ensure computational efficiency and mitigate hallucinations, a critical concern in medical applications. Additionally, the chatbot employs Self-Reflective Retrieval-Augmented Generation (SELF-RAG) combined with Chain of Thought (CoT) reasoning, enabling dynamic query reformulation and the generation of accurate, evidence-based responses tailored to patient needs. The chatbot was evaluated by comparing pre- and post-improvement versions and against a reference model (ChatGPT), using metrics of accuracy, relevance, and consistency. The results demonstrated significant improvements in response quality and conversational coherence, emphasizing the potential of integrating advanced LLMs with retrieval and reasoning methods to address critical challenges in healthcare. This approach not only enhances treatment adherence but also strengthens patient–provider interactions in managing chronic conditions like KOA.

Keywords:

knee osteoarthritis; adherence; artificial intelligence; chatbot

1. Introduction

Knee osteoarthritis (KOA) is a common condition worldwide related to pain and disability [1]. Globally, 22.9% of people aged 40 and older suffer from KOA, affecting around 654 million people in this age group as of 2020 [2]. This condition can cause joint pain, muscle weakness, and physical disability and significantly reduce quality of life [3,4,5,6]. Risk factors for developing KOA include obesity [2], anterior cruciate ligament injuries, meniscal injuries, chondral injuries, and knee fractures [7]. With rising obesity rates, healthcare systems are likely to face increasing challenges in managing persistent knee pain and physical disability related to KOA [8].

Treatment of KOA through knee arthroplasty is recommended only for people in the late stage of the disease when conservative treatments are not effective [9,10]. In line with this, the clinical practice guidelines [10,11,12,13] for KOA emphasize the importance of self-effectiveness, exercise, and body weight control (where necessary) as fundamental management strategies. Long-term adherence to an exercise program is low, especially when there is no continuous support from a healthcare professional [14,15]. For example, in a recent trial, only 65% of patients with KOA followed their prescribed exercise program by a physiotherapist for 8 weeks [16], and other studies have reported that only 30% of patients with hip or knee osteoarthritis maintained their long-term adherence [17].

Until a few years ago, SMS technology (Short Message Service) or text messaging via mobile phones had established itself as a tool capable of generating positive changes in health behaviors [18]. According to research in this field, where text messages are used to support adherence to knee osteoarthritis treatments [19,20,21], it has been determined that promoting physical activity through this tool can positively influence the behavioral and psychosocial outcomes of patients undergoing treatment. The software applications developed in these studies automated the process of sending messages to deliver reminders to patients. In one of these studies, interactions were conducted with patients to identify barriers and facilitators that could improve adherence to treatments. These software applications were based on heuristic methodologies grounded in the evidence and experience of the researchers. Algorithms were used to automate these processes with the aim of facilitating administrative tasks while always maintaining standardized questions and answers. Currently, thanks to the use of language models that simulate human conversations, known as chatbots, the experiences and applications of user interaction have significantly improved. In the field of medicine, there is evidence that chatbots promote healthy lifestyles and improve mental well-being. These systems are widely used in treatment, education, and the detection of various conditions, standing out for their accessibility [22].

Recent advances in large language models (LLMs) in the medical field have shown great potential for improving the accuracy and relevance of responses generated by chatbots through the integration of Retrieval-Augmented Generation (RAG) systems. LLM-based chatbots offer several general benefits in healthcare. Firstly, they provide immediate and accessible assistance to patients, enhancing the efficiency of the healthcare system by reducing the workload on medical staff for routine consultations [23]. Additionally, these chatbots can offer consistent, evidence-based responses, increasing the accuracy and reliability of the information provided [24]. They can also personalize interactions based on the patient’s history and specific needs, improving user experience and promoting greater adherence to treatments [25].

For instance, a study described how RAG can significantly enhance the accuracy of medical responses by integrating information retrieved from external databases [26]. Similarly, another study highlighted the importance of the quality of retrieved documents to ensure precise responses [27]. Other studies have emphasized the ability of RAG to combine various data sources and improve the accuracy of responses through iterative processes [28,29]. There are various ways to implement RAG that provide benefits and advantages that should be considered within the framework of each implementation. The choice of RAG implementation method can affect the system’s efficiency, the relevance of the retrieved information, and the model’s ability to handle variability in user queries. In this work, we analyze three main types of RAG in the healthcare domain: RAG, Corrective Retrieval-Augmented Generation (CRAG), and Self-Reflective Retrieval-Augmented Generation (SELF-RAG) [30,31,32].

RAG integrates documents retrieved from an external database to improve the accuracy of the responses generated by the model, with the relevance of the retrieved documents being critical to ensuring the accuracy and relevance of the provided information [28,29]. On the other hand, CRAG introduces a retrieval evaluator that automatically corrects the retrieved information before it is used to generate responses, mitigating the possibility of hallucinations and improving the reliability of the responses [28]. SELF-RAG includes self-reflection mechanisms that allow the model to continuously evaluate the relevance of the retrieved information and automatically reformulate user queries to ensure that the responses are accurate and relevant. This approach also enhances the fluency of conversations and addresses significant hallucination problems in medical domains [32]. The relevance of responses is crucial in the healthcare context, as incorrect or irrelevant information can have significant consequences for patient well-being. The ability of these systems to provide accurate, evidence-based responses is fundamental to their effectiveness and reliability in medical applications.

Chatbots have been utilized in various areas of healthcare, including cancer care [33], behavioral change [34], and psychiatry [35], among others. Specifically, in musculoskeletal care, they have been studied in individuals with chronic pain [36], shoulder arthroplasty [37], and back pain [38]. Regarding adherence, Blasco et al. [37] found that a virtual assistant, functioning as a chatbot through an instant messaging smartphone application, can be an effective approach to enhance adherence and improve compliance rates with early postoperative home rehabilitation in patients undergoing reverse shoulder arthroplasty. However, in the osteoarthritis population, there is currently no evidence of its effectiveness.

Hallucinations in large language models (LLMs), where the model generates incorrect or unsupported information, represent a critical challenge in the medical domain, where accuracy and relevance are essential. SELF-Reflective Retrieval-Augmented Generation (SELF-RAG) is an advanced approach that addresses this issue by integrating self-reflection mechanisms into traditional Retrieval-Augmented Generation (RAG) systems. SELF-RAG dynamically evaluates the relevance and accuracy of retrieved information, allowing the model to reformulate user queries when necessary to improve the quality and contextual alignment of its responses [39]. This mechanism reduces hallucinations by enabling the model to iteratively assess and refine its outputs, a feature particularly important for applications where incorrect information can have serious implications. Additionally, SELF-RAG enhances conversational coherence and fluency, making interactions more effective and contextually relevant. Studies have shown that self-reflection allows the model to detect and correct inaccuracies during response generation, significantly improving reliability [40]. In this study, SELF-RAG plays a central role in developing a chatbot designed to improve treatment adherence in patients with knee osteoarthritis (KOA). By generating evidence-based responses tailored to patient needs, SELF-RAG supports personalized medical care and promotes adherence to prescribed treatments. Moreover, the reflective and structured processes within SELF-RAG enable the model to handle variability in user queries effectively, ensuring response accuracy in dynamic and complex clinical scenarios [41,42]. Existing research further highlights the importance of integrating robust retrieval and reflection mechanisms to enhance both the relevance and precision of outputs, especially in high-stakes environments like healthcare [43,44].

The Chain of Thought (CoT) [45] is a technique in the context of LLMs that enhances the reasoning capabilities of these models by generating a series of intermediate natural language steps leading to a final response. This technique has proven effective in improving accuracy in complex tasks by breaking down problems into more manageable steps, allowing models to allocate more computational resources to problems requiring a higher level of reasoning. CoT facilitates the generation of reasoning sequences similar to human thought processes, enhancing not only the accuracy and relevance of responses but also the fluency and coherence of conversations.

SELF-RAG benefits from the CoT structure by reducing hallucinations through the creation of a logical sequence of reasoning steps that the model follows. This process ensures that the generated responses are accurate, relevant, and easy to understand, which is essential for effective interaction between the chatbot and patients [20]. In summary, implementing Chain of Thought in SELF-RAG in developing a chatbot for managing KOA provided a robust approach that ensured precise and coherent responses, improving patient experience and treatment adherence. To effectively implement Chain of Thought (CoT) and SELF-RAG, it is crucial to have an infrastructure that allows for modeling complex processes and managing iterative and conditional workflows. The ability to break down problems into intermediate steps and continuously adjust responses requires a robust and flexible system. Graphs are fundamental for modeling these processes, as they allow for representing sequences of tasks and the conditions necessary to progress between them. Graphs facilitate the visualization and management of complex workflows, ensuring that each step is executed in the correct order and under the appropriate conditions.

LangGraph is a library specifically designed for this purpose (LangChain, n.d.). In this study, version 0.2.50 of LangGraph was utilized, which provides tools for creating and managing graphs that represent complex workflows in LLM applications as part of LangChain. The use of this specific version ensures reproducibility and consistency in the described processes. LangChain allows defining chains of computation (Directed Acyclic Graphs or DAGs), while LangGraph introduces the ability to add cycles, enabling more complex, agent-like behaviors where you can call an LLM in a loop, asking it what action to take next [21,22]. This framework can model graphs by defining workflows that include nodes for document retrieval, relevance evaluation, response generation, query transformation, and translation [21]. The ability to handle cycles and branching allows for continuous iteration over queries and responses, refining them continually. The edges can be both sequential and conditional, adding flexibility to the decision-making process within the workflow [22], which also provides a solid foundation for implementing SELF-RAG in our chatbot, facilitating the creation of complex workflows and ensuring the accuracy and relevance of generated responses [23]. Our main aim was to develop a chatbot based on LLMs with SELF-RAG to improve treatment adherence in patients with KOA.

Previous interventions for improving treatment adherence, such as SMS-based systems, relied on standardized and heuristic processes that lacked personalization and adaptability to individual patient needs. While chatbots have demonstrated potential in areas such as mental health and behavioral modification, their application to treatment adherence in knee osteoarthritis (KOA) remains underexplored. Moreover, existing systems fail to incorporate advanced techniques such as Self-Reflective Retrieval-Augmented Generation (SELF-RAG) and Chain of Thought (CoT), which are critical for mitigating hallucinations and ensuring the delivery of accurate, context-aware, and evidence-based responses. In clinical applications, response quality is crucial, as inaccurate or irrelevant information can directly affect patient outcomes. This study addresses these challenges by curating a robust knowledge base through a systematic review guided by the PRISMA framework, ensuring the chatbot’s responses are grounded in high-quality, evidence-based clinical guidelines. Additionally, the implementation of a graph-based infrastructure enhances the conversational tone, aligning it with the needs of clinical interactions. This approach improves the chatbot’s capacity to deliver precise and empathetic responses while fostering a more engaging user experience. By integrating these innovations, this work establishes a comprehensive framework for addressing adherence challenges in the management of chronic conditions such as KOA.

2. Materials and Methods

We conducted a systematic review following PRISMA guidelines to build an evidence-based knowledge base. The chatbot was developed using a SELF-RAG framework, which combines advanced language modeling with iterative self-evaluation implementation carried out on the Telegram platform. The entire process is detailed in the following sections.

2.1. Knowledge Base Construction

To generate the body of knowledge, a rapid systematic review was conducted, adhering to the PRISMA [46] declaration for systematic reviews. For the criteria used to select studies, we included clinical practice guidelines, along with the studies included in them that we were able to retrieve, using the AGREE II tool to assess the methodological rigor and transparency. Guidelines were included if they scored ≥ 60% in domains 3 (rigor of development), 6 (editorial independence), and at least one other domain, ensuring only high-quality recommendations were considered. Studies with a high risk of bias were excluded. Participants included those diagnosed with knee osteoarthritis (KOA), and any intervention related to KOA management was accepted. Additionally, all types of outcome measures were included.

The search strategy involved querying the following databases up to March 2024 without restrictions on language, publication year, or publication status: CENTRAL, MEDLINE (via PubMed), and SCOPUS. We also searched guideline repositories (Clinical Practice Guidelines Portal (National Health and Medical Research Council), Epistemonikos, Guidelines International Network, Turning Research Into Practice (TRIP), Cochrane, Orthoguidelines.org, and the National Institute for Health and Care Excellence (NICE)). The search strategy used includes the following terms: “knee osteoarthritis” OR “osteoarthritis of the knee” OR gonarthritis AND “clinical practice guideline” OR “practice guideline” OR guideline OR recommendation OR consensus.

The selection and management of studies were conducted independently by two authors (J.G.A. and H.F.) using the Rayyan software (https://www.rayyan.ai/, accessed on 16 January 2025). They screened the titles and abstracts of all studies identified in the initial searches based on the inclusion criteria, excluding any clearly irrelevant studies at this stage. If there was any doubt, we retrieved the full-text article for further assessment when we could not determine from the title and abstract if it was eligible. The two authors independently reviewed all full-text studies, and disagreements were resolved by discussion. In the scope of quality appraisal, two reviewers (J.G. and D.O.) independently appraised each guideline for quality using the online version of the AGREE II tool. The AGREE II includes 23 items across 6 domains (scope and purpose, stakeholder involvement, rigor of development, clarity of presentation, applicability, and editorial independence), along with an overall rating item and a recommendation decision for the guideline [47]. They scored each item on a 7-point scale (1 = strongly disagree; 7 = strongly agree). As mentioned before, we include guidelines for high quality defined as a score of ≥60% in domains 3 (rigor of development), 6 (editorial independence), and at least one other domain.

Data Preprocessing

The final dataset consists of 113 full-text articles screened from a total of 9355 articles initially identified. Figure 1 describes the selection process. Twenty-nine guidelines were assessed for quality appraisal; however, only six high-quality guidelines were identified [10,11,12,13,48,49]. We also added the high-quality studies (defined by the clinical guideline) that we were able to retrieve from each included clinical guideline (Appendix A). Table 1 describes the results of the AGREE II tool for each domain, and Table 2 summarizes the general characteristics of each guideline.

Given that the proposed system utilizes a Retrieval-Augmented Generation approach, the data preprocessing primarily focused on structuring and formatting the knowledge base described in the Knowledge Base Construction section to ensure compatibility with the retrieval system. The dataset, available at (https://ialab.s3.us-east-2.amazonaws.com/osteoarthritis_publications.zip) (accessed on 20 January 2025), was first tokenized and indexed to enable efficient retrieval during user interactions. No additional feature engineering or significant transformations were necessary, as the model directly retrieves relevant information. All documents were embedded using the ChromaDB vector database, which facilitated efficient vectorized searches during inference. This tool is relevant to the system as it allows for efficient storage and retrieval of the vector representation (embeddings) of documents selected as relevant in the inference process, according to user interactions. Furthermore, the embeddings were processed to ensure alignment with the query format used by the large language model.

2.2. Model Selection: Detailed Description of Mistral and Its Integration

At the time of writing, Mistral 7B [50] was among the leading models in natural language processing (NLP) benchmarks relevant to the medical domain of this study. Its performance in language comprehension and generation tasks, coupled with its deployment efficiency, positioned it as a suitable choice for developing the proposed chatbot. Moreover, its computational efficiency and adaptability through techniques such as Parameter-Efficient Fine-Tuning (PEFT) enabled the optimization of available resources without compromising the accuracy of generated responses. Mistral 7B is an open-weight model designed for high-performance NLP tasks. It employs a Transformer architecture, which facilitates the handling of complex linguistic relationships and supports a wide range of tasks with high accuracy. With seven billion parameters, the model achieves a balance between computational efficiency and robust task performance, making it particularly well-suited for integration into resource-constrained systems such as healthcare chatbots. Additionally, Mistral 7B incorporates principles inspired by Mixture-of-Experts (MoE), enabling the dynamic allocation of computational resources to subsets of its architecture based on input complexity. This mechanism enhances both scalability and efficiency by selectively activating relevant parameters during inference, maintaining accuracy while optimizing resource usage. The model was pretrained on diverse, high-quality datasets, including medical corpora, enabling effective generalization to clinical tasks. Its capabilities in text generation, translation, and question-answering form the basis of the chatbot’s functionality. In this work, Mistral 7B is instrumental in generating accurate and context-aware responses, meeting the stringent requirements of clinical precision and reliability.

Recent studies have demonstrated Mistral’s efficacy in the medical domain. For instance, the implementation of BioMistral has improved pretraining efficiency and accuracy in answering complex medical questions [51]. Similarly, Mistral-Plus has developed reward models that ensure helpful and safe responses in conversational interactions [52]. These advancements illustrate that Mistral is a suitable choice for our project, providing a solid foundation for generating accurate and relevant responses in the medical context. Additionally, advanced techniques such as supervised fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) enable efficient utilization of computational resources while maintaining high performance [53,54]. Mistral 7B incorporates strategies to mitigate hallucinations, a critical challenge in medical applications. For instance, reward models, such as those implemented in Mistral-Plus, adjust responses based on utility and safety criteria, ensuring that generated outputs align with clinical requirements. Furthermore, self-reflection mechanisms are integrated to continuously evaluate the relevance and accuracy of generated responses, improving the model’s reliability in delivering evidence-based, contextually appropriate outputs.

The efficiency of Mistral 7B in inference is further enhanced by architectural optimizations, including layer compression and sparsity techniques. These optimizations reduce computational overhead, enabling the deployment of the model in resource-constrained environments without compromising accuracy. Combined with the Retrieval-Augmented Generation (RAG) approach, Mistral 7B enhances the chatbot’s ability to deliver personalized, evidence-based responses, improving patient adherence to treatments. This functionality supports continuous refinement of interactions while upholding rigorous standards of accuracy and relevance, which is key to the success of the proposed system.

2.3. Computing Infrastructure

As previously mentioned, this proposal is based on the RAG approach, which eliminates the need for fine-tuning pre-trained models. This approach optimizes computational resources by leveraging large datasets to retrieve relevant information and generate responses efficiently. Given that the system’s requirements focus on a moderately capable GPU but with substantial RAM, the infrastructure effectively facilitated agile and accurate user interactions with the chatbot, thereby optimizing resource utilization.

The system utilized for this work is equipped with a NVIDIA GeForce RTX 3060 Ti GPU featuring 8 GB of memory and 125.7 GB of RAM, which enables efficient handling of deep learning processes and information retrieval. Table 3 summarizes the key hardware and software components employed in the infrastructure. On the software side, several critical tools were employed. Ollama 0.1.17 was used to support the Mistral model, optimizing natural language processing tasks. PyTorch 1.10.1 served as the primary framework for developing and training deep learning models, while LangChain 0.1.20 facilitated the implementation of SELF-RAG, orchestrating retrieval and response generation processes. Furthermore, ChromaDB 0.5.0 was used for efficient embedding storage and retrieval, ensuring robust data management throughout the system.

2.4. Chatbot Architecture

The implementation of this application is based on several services deployed in Docker containers that integrate to respond to user requests through the Telegram application. As detailed in Figure 2, the software architecture of this application consists of three main components: the Telegram application, a web server that acts as a webhook [55], and a set of applications that respond to user queries, referred to as the LLM server [56].

Telegram is a messaging application that establishes the interface between the user and the responses generated by the language model based on the user’s questions or comments in the application. The webhook server has two main functions. The first is to schedule messages reminding the user to perform the daily exercises prescribed by the specialist. The second function is to structure the user’s conversation flow based on Nelligan et al. [21] (Figure 3).

The relevant component of this architecture is the LLM server, which organizes all the tools that enable the generation of bot responses according to user inputs. As detailed in Figure 4, this architecture integrates the following libraries to achieve these results:

Ollama: The company Ollama offers a library that facilitates the integration of their developed models into various applications, allowing for the generation of automatic responses, text analysis, and other tasks related to natural language processing. This library enables the integration of the advanced language model Mistral for generating the bot’s response text according to user inputs. This library was used to deploy the Mistral model in the project and integrate it with the workflow that processes user inputs to generate an appropriate response in the context of KOA.
LangServe: A service that facilitates the implementation and management of large-scale NLP models. It is designed to streamline the integration of advanced language models, enabling efficient deployment, scaling, and maintenance of these models. It provided the project’s application with an Application Programming Interface (API) to handle requests to the NLP service, facilitating interoperability with the project’s webhook service.
LangChain: An open-source framework designed to help developers build language model-driven applications more effectively and efficiently. LangChain enables the integration of LLM models without the need to manage all the underlying infrastructure and interoperates with other tools or services. This allows for easy integration with APIs, databases, and other components of the software ecosystem.
LangGraph: A framework designed to create complex, stateful workflows in applications like chatbots and automated support agents. It allows developers to define nodes and edges that represent different states and transitions within a workflow. Each node can be a function or a model call, and the edges determine how the process flows from one node to another.
ChromaDB: An open-source vector store used to store and retrieve embeddings. Its primary use in the project is to store document vectors along with metadata for later use by the Mistral LLM model through the workflow defined in LangGraph.

2.5. Chatbot Workflow

The chatbot workflow was designed to optimize efficiency and accuracy in processing medical queries, utilizing a combination of advanced technologies. In the initial stage, the Mistral 7B model is deployed on a cloud-based server to ensure scalability and accessibility. This deployment is managed using Docker containers, facilitating the scalability and maintenance of the system. The connection with LangChain and LangGraph is crucial for the seamless integration of the Mistral 7B model into the chatbot’s workflow. LangChain enables the smooth incorporation of the model into chatbot applications, while LangGraph manages the complex workflows necessary to effectively handle user queries. LangGraph’s ability to define and manage Directed Acyclic Graphs (DAGs) and introduce cycles ensures that the chatbot can handle iterative processes and continuously refine its responses.

The chatbot’s workflow design, as depicted in the provided diagram (Figure 5), is implemented using LangGraph, which allows for the definition of nodes and edges that represent different states and transitions within the interaction process. Key nodes include document retrieval (Retrieve), relevance evaluation (GradeDocuments), response generation (Generate), and query transformation (TransformQuery). The edges define the flow of operations, ensuring that each step is executed in the correct sequence and under appropriate conditions.

Finally, to facilitate user interaction, the chatbot is integrated into the Telegram messaging platform. This integration allows patients to conveniently interact with the chatbot, receiving timely reminders and responses tailored to their specific needs.

2.6. First Iteration of SELF-RAG Implementation: Workflow for Initial Query Processing and Spanish Translation Evaluated via User Surveys

The first version of the SELF-RAG implementation features a detailed workflow designed to process user queries and generate accurate, contextually relevant responses. As shown in Figure 5, the workflow begins with the submission of a user query, which is then processed through several stages: document retrieval, relevance evaluation, query transformation, response generation, and final evaluation. The output is provided in the target language, ensuring its suitability for medical contexts. The integration of the CoT approach enhances the language model’s reasoning capabilities by breaking down complex queries into intermediate steps. This method ensures that each stage of the workflow, from document retrieval to response generation, incorporates reflective processes that continuously refine and validate the information. Figure 5 shows the workflow.

The workflow begins with retrieving relevant documents based on the user’s initial query from ChromaDB (node: Retrieve). These documents are then evaluated for relevance (node: GradeDocuments); those considered relevant move to the generation phase (node: Generate), while non-relevant documents trigger a transformation of the original query to improve retrieval outcomes (node: TransformQuery). Using the relevant documents, the system generates a response (node: Generate), which is then evaluated (node: GradeGeneration). If the response is supported and useful, it proceeds to the translation phase (node: TranslateQuery); otherwise, it may be sent back for reformulation or regeneration. Decisions between these stages are represented by langraph edges that define the sequence and conditions under which information moves through the system. This structured approach, integrating CoT within the SELF-RAG framework, significantly enhances the reliability and fluency of the responses, addressing key challenges in medical chatbot applications. The prompts for each workflow stage are detailed in Appendix B.

To explore the satisfaction of the answers provided by the chatbot among healthcare providers, a convenience sample was conducted, and 40 providers were recruited in the city of La Serena, Chile, between May 2024 and July 2024. The recruitment was carried out through digital media (electronic mailing and social networks), and forty-five potentially eligible persons were gathered, five of whom were excluded because they did not meet the inclusion and exclusion criteria. The 40 participants who met the inclusion criteria were invited to participate, and their written consent was given. The inclusion criteria were to (1) be a general practitioner or medical specialist, physiotherapist, occupational therapist, or nurse; (2) know the core treatment of KOA. An online survey instrument on a Likert scale (1 = lower score, 5 = higher) was used for a series of five questions related to the treatment of KOA. This survey consisted of the following questions: (1) What non-pharmacological treatment do you recommend for my knee osteoarthritis? (2) What pharmacological treatment do you recommend for my knee osteoarthritis (3) What treatment do you recommend for my knee osteoarthritis? (4) Is surgery better than exercise for my knee osteoarthritis? (5) Do you recommend electric stimulation for my knee osteoarthritis? Participants were asked to inquire these questions to our chatbot and ChatGPT and then replied how satisfied they were with the answer given by both chatbots on a five-point Likert scale. The composition of the survey sample was 35% general practitioners, 32.9% nurses, 20% orthopedic surgeons, and 12.5% physiotherapists. On the other hand, the sample consisted of 50% women.

The initial implementation of the chatbot, integrating SELF-RAG and CoT methodologies, demonstrated significant improvements in generating coherent and contextually relevant responses. This was evident from the survey results comparing the KOA chatbot to a generic ChatGPT version GPT-4o mini (18 July 2024), where the former showed superior performance in content accuracy and relevance (Figure 6). However, despite these improvements, the tone of the responses lacked the sensitivity and professionalism expected in a patient–doctor interaction. In response to this feedback, as illustrated in the updated Algorithm 1, the second version of the chatbot incorporated an additional node in the workflow dedicated to formatting and improving the tone and wording of the translated responses. This enhancement was motivated by the need to ensure that the chatbot’s interactions were not only accurate but also empathetic and appropriate for medical contexts.

As shown in Algorithm 1, the updated workflow integrates a key enhancement: FormalQuery, a node specifically designed to refine the tone and phrasing of LLM-generated responses, ensuring they adhere to the professional and empathetic communication style appropriate for patient-doctor interactions. This additional step, FormalQuery, leverages natural language processing techniques to refine the translated responses, ensuring they are delivered in a professional and empathetic manner. By addressing the tone and wording of the responses, the second version aims to enhance user satisfaction and trust, which are crucial for effective patient engagement and adherence to medical advice. The detailed algorithm, as depicted in Algorithm 1, illustrates the comprehensive workflow, incorporating this new process to enhance system performance. The algorithm outlines key stages, including initialization, retrieval, evaluation, generation, translation, and the final formatting of the response to align with a patient-centric medical tone. Furthermore, as shown in Figure 7, the specific prompt code used to achieve this alignment ensures the outputs meet the required standards for medical communication, addressing the limitations identified in the previous version.

Algorithm 1 SELF-RAG: SelRReRective Retrieval-Augmented Generation. This algorithm outlines the steps of the SELF-RAG process, detailing the sequence from receiving a user question to generating and self-reflecting on responses, ensuring enhanced accuracy and relevance. The updated workflow includes the addition of a FormalQuery node, which refines responses to make them more patient-friendly, improving communication between patients and healthcare providers.
Require: Input query Q, knowledge base K, LLM model M
Ensure: Generated response R
1: Initialize empty list D for retrieved documents
2: Retrieve documents D from K using query Q
3: R ← M(Q, D)	▷ Generate initial response
4: E ← ExtractEntities(R)	▷ Extract entities from response
5: for each e ∈ E do
6: D′ ← Retrieve(K, e)	▷ Retrieve documents for each entity
7: if D′ is relevant then
8: D ← D ∪ D′	▷ Update retrieved documents
9: end if
10: end for
11: R′ ← M(Q, D)	▷ Generate refined response
12: Reflect and validate R′ against D
13: if Validation fails then
14: Q′ ← ReformulateQuery(Q, R′)	▷ Reformulate query based on reflection
15: R″ ← M(Q′, D)	▷ Generate final response
16: end if
17: Translate R″ to Spanish
18: Rewrite R″ in patient-friendly medical tone
19: return R″

The updated survey results (Figure 8) show an increase in user satisfaction across all questions, with a reduction in the percentage of dissatisfied responses. This indicates that the chatbot not only maintained its strength in providing accurate and relevant content but also succeeded in delivering responses that are better aligned with the expectations of a medical consultation.

To more accurately evaluate the responses provided by the chatbot in its clinical dimension, a categorization of the questionnaire questions was developed, taking into account the professional profiles of the participants, which included orthopedic surgeons, general practitioners, nurses, and physical therapists. As shown in Table 4, the questions were grouped into four main categories: Non-Pharmacological Treatments (NPTs), Pharmacological Treatments (PTs), General Treatments (GTs), and Treatment Comparison (TC). This grouping was performed with the aim of reflecting the areas of expertise and clinical focus of each professional. For instance, physical therapists and nurses, who frequently address non-pharmacological interventions, significantly contributed to the NPT category, while orthopedic surgeons and general practitioners, with a greater emphasis on pharmacological prescriptions and surgical decision-making, contributed to the PT and TC categories, respectively. This categorization allows for a structured and comparative evaluation of the chatbot’s recommendations, facilitating the identification of specific improvements in the quality of the clinical responses provided.

3. Results

Consequently, a comparative analysis of the pre- and post-improvement versions of the chatbot was conducted, focusing on user satisfaction with the answers provided, using the aforementioned clinical categories as a reference framework. The results of this analysis are presented in Figure 9, which graphically represents the comparison of average scores between the two versions of the chatbot, grouped according to the established clinical categories. The boxplots summarize the distribution of the data, highlighting the medians, interquartile ranges, and overall variability. Additionally, the orange points in the figure represent individual data values (raw scores) within each category and version. These points provide critical insights into the distribution of the data beyond the statistical summaries offered by the boxplots, allowing the identification of potential outliers or patterns that may not be immediately evident. The metrics displayed in Figure 9 include medians, interquartile ranges (IQRs), and raw scores (individual evaluations). These metrics were chosen to provide both summary statistics and a detailed view of the data distribution, allowing the identification of trends and potential outliers in the chatbot’s performance. In the NPT category, the average score increased from approximately 3.75 in the pre-improvement version to 4.25 in the post-improvement version, showing a reduction in data dispersion and suggesting greater consistency in the recommendations. Similarly, in the PT category, an increase in the median scores is observed, with less variability in the data from the improved version, indicating an enhancement in the accuracy of the responses. The individual evaluations, represented by the orange points, further confirm this trend by showing a tighter clustering of scores around the median in the post-improvement version.

Overall, all categories demonstrate an increase in median scores and reduced dispersion, reflecting a notable improvement in the quality and consistency of the responses following the implemented enhancements to the chatbot. The inclusion of individual data points alongside the summary statistics provides a comprehensive view of the performance improvements, reinforcing the robustness of the observed results.

A more detailed analysis by professional specialization, as illustrated in Figure 10, reveals how the aggregated total scores vary across different professions—orthopedic surgeons, general practitioners, nurses, and physical therapists. The post-improvement version of the chatbot generally shows higher median scores and reduced data dispersion compared to the pre-improvement version, particularly among general practitioners and physical therapists. The metrics displayed in Figure 10 include medians, interquartile ranges (IQRs), and raw scores (individual evaluations), which provide both summary statistics and a detailed view of the data distribution. These metrics enable the identification of trends and potential outliers, offering valuable insights into how the chatbot’s performance varies across professional groups. This suggests that the enhancements made to the chatbot have improved its ability to deliver more consistent and accurate recommendations tailored to the specific clinical focus of each profession. Interestingly, while the ChatGPT-generated responses tend to have scores comparable to or slightly better than the pre-improvement version, they do not consistently achieve the level of consistency observed in the post-improvement version. This indicates that the refinements implemented in the post-improvement version have resulted in a tool that more effectively meets the clinical expectations and needs of healthcare professionals across various specialties. A more in-depth analysis of these variations and their implications for clinical practice will be explored in the next section.

3.1. Evaluation Metrics

To assess the quality of the chatbot’s interactions, the primary metric focused on capturing user perception regarding the relevance of the responses and the appropriateness of the tone in medical conversations. Relevance refers to how suitable and helpful the chatbot’s responses were in addressing the users’ inquiries, while the tone evaluation centered on determining whether the chatbot could generate responses that were not only accurate but also empathetic and contextually appropriate. These metrics were crucial in ensuring that the tool met the communication standards expected in healthcare settings, promoting effective and humanized interactions between the chatbot and its users.

3.2. Survey Results: Analysis of Feedback from Healthcare Professionals on Chatbot Responses

This section provides a detailed analysis of the feedback given by healthcare professionals on the responses generated by the chatbot in its pre-improvement and post-improvement versions. The survey questions were categorized into clinical areas reflecting the specialization and clinical focus of the participants, as outlined in Table 4. This approach allowed for a structured analysis and comparison of the chatbot’s response consistency and accuracy across different treatment areas. Figure 6 illustrates the results obtained from the survey before the chatbot was improved. This initial version, based on the SELF-RAG and Chain of Thought approach, managed to deliver some good results in terms of response accuracy and relevance. However, the chatbot’s pre-improvement responses showed considerable variability in clinical recommendation accuracy, with average scores fluctuating around 4.00 across several categories. For example, in the PT category, the score distribution was notably wide, ranging from 2 to 5, reflecting inconsistencies in the accuracy of the responses provided. This variability suggests that while the initial strategy provided a solid framework for generating responses, key areas such as tone consistency and recommendation accuracy, especially in categories requiring high clinical precision, needed optimization.

In the comparative analysis, the post-improvement version of the chatbot, based on a large language model (LLM) and incorporating an additional graph node to optimize a patient-friendly tone, demonstrated significant improvements across all categories. This version showed higher average scores and reduced data dispersion, indicating greater consistency in the recommendations provided. The enhancement in patient-friendly tone was a key factor in achieving results that exceeded those of ChatGPT. For example, in the Non-Pharmacological Treatments (NPTs) category, the average score increased from 4.15 ± 0.75 in the pre-improvement version to 4.40 ± 0.62 in the post-improvement version (t-statistic = −2.85, p-value = 0.0068), confirming the statistical significance of this improvement. Similarly, in the Pharmacological Treatments (PTs) category, the average score increased from 4.00 ± 0.81 to 4.30 ± 0.65 (t-statistic = −3.12, p-value = 0.0032), reflecting the chatbot’s enhanced ability to provide accurate and clinically relevant recommendations. All statistical analyses were performed with a significance threshold of p < 0.05, ensuring consistency and interpretability in evaluating the differences between versions. Cohen’s d effect size of 0.77 for the total aggregate score further supports these findings, indicating a moderate to large effect of the implemented improvements.

The results obtained after the chatbot was improved are presented in Figure 8. In this figure, a significant reduction in data dispersion is observed, with most average scores rising above 4.20 across all categories. For example, in the NPTs category, the score range narrowed significantly, and the median rose to 4.40, indicating greater consistency in the recommendations. This change suggests that the improvements made, particularly the addition of the node to optimize the patient-friendly tone, resulted in a more consistent and precise performance. Additionally, the higher scores in the TC category, with a median of 4.35 in the post-improvement version, underscore the chatbot’s enhanced ability to deliver responses more aligned with best clinical practices, thereby surpassing the results of both the pre-improvement version and ChatGPT.

Finally, Figure 10 shows a comparative analysis by profession, revealing how the post-improvement version of the chatbot, due to the incorporation of the node in the graph, achieved higher median scores and lower variability compared to the pre-improvement version. This effect is particularly evident among general practitioners, where the median score increased to 19.5 in the post-improvement version compared to 18 in the pre-improvement version, suggesting an improvement in the quality and consistency of the recommendations provided to this group. Similarly, physiotherapists also experienced notable improvements, with a reduction in score dispersion and an increase in the median to 20 in the post-improvement version. These findings indicate that the targeted improvements allowed the chatbot to better adapt to the specific clinical needs of each profession, offering recommendations more aligned with the expectations of healthcare professionals and reducing the inconsistencies observed in the pre-improvement version and in ChatGPT.

4. Discussion

We developed a chatbot based on LLMs to improve adherence to exercise-based treatment in people with KOA. The challenges included providing accurate and relevant information, avoiding model hallucinations, and enhancing interaction fluency. To address these issues, an ad hoc knowledge base was created based on the scientific literature with a rapid systematic review approach. SELF-RAG was implemented using the Chain of Thought (CoT) approach on LangGraph, extending SELF-RAG to incorporate processes in the graph that deliver responses in language appropriate for the doctor–patient relationship. This approach ensures evidence-based, personalized, and comprehensible responses, improving the quality of medical care and motivating treatment adherence, addressing critical issues in the implementation of chatbots in healthcare. Finally, the system was integrated into a chat platform via Telegram to facilitate patient interaction.

4.1. Comparison with Existing Interventions for Improve Adherence to Exercise-Based Treatments in People with KOA

Several strategies have been implemented in clinical practice to enhance adherence to exercise programs, ranging from behavioral intervention approaches to more technology-driven interventions. Chen et al. [57] assessed the effects of a transtheoretical model-based exercise program on exercise adherence in people with KOA, and findings showed that the intervention was significantly better in improving exercise adherence. Three trials have been reported on the use of technology-driven interventions on adherence to exercise; one of them [58] evaluated the effects of a smartphone application; another study [59] assessed the efficacy of computer-based telephone counseling; and a study [60] reported the use of behavior change text messages. Two of these studies show significant improvement in adherence rates.

When compared to mobile apps and telehealth services, the chatbot offers an added layer of accessibility and personalization. While mobile apps rely on preset algorithms and telehealth requires scheduled interactions, the chatbot’s ability to engage users dynamically and respond to their immediate needs could result in higher adherence levels. Moreover, the AI-driven nature of the chatbot allows it to continuously evolve, offering increasingly effective support over time, a feature not present in static interventions. However, we still do not know what effect our chatbot has on clinical outcomes. We hope to evaluate it soon in the KOA population.

4.2. Clinical Implications: How Successful Patient Adherence Methods Impact Health Outcomes

Successful patient adherence methods may have a greater positive impact on people’s health than the actual results of medical therapies [61]. However, in the context of musculoskeletal disorders, adherence to treatment and compliance with rehabilitation is one of the main barriers to the recovery of patients [62]. Moreover, considering the current resource constraints, it is becoming increasingly impractical for public health services to provide in-person rehabilitation [63], so by making this tool available to a wider audience, including underserved populations, the potential for improved health outcomes can be improved. This could be particularly beneficial in rural or low-income areas, where access to physical therapy is limited. Furthermore, one of the core treatments of KOA is exercise. In this context, the development of an AI-driven personal assistant may have significant implications for enhancing adherence to exercise programs in patients with knee osteoarthritis (KOA).

Considering the above, evaluating the effects of this chatbot on several clinical outcomes is our next stage.

4.3. Strengths and Limitations

We conducted a systematic search according to the PRISMA guidelines with clear criteria to generate the body knowledge database. By doing this, we tried to avoid a selection bias in the recommendations that the chatbot could make. Furthermore, we developed a search strategy with high sensitivity to retrieve clinical guidelines and no restrictions on the year of publication or language. One of our main limitations is that we did not clinically validate this tool. We plan to conduct this work soon.

5. Conclusions

In this work, we presented the development of a chatbot based on large language models (LLMs) to improve adherence to exercise-based treatment in individuals with knee osteoarthritis (KOA). This chatbot, built on the SELF-RAG architecture, is designed to provide precise and automatically reformulated responses to better meet the needs of patients, effectively addressing common challenges in understanding and adherence to treatment. Notably, our AI application was validated twice by healthcare professionals, ensuring that the system meets the necessary clinical and ethical standards. As a result of the initial consultation with these professionals, we were able to capture business rules in new graph nodes, which enhanced the system’s ability to handle specific cases more efficiently.

As the next steps, we plan to conduct a clinical validation of this tool to assess its effectiveness not only in improving treatment adherence but also in its impact on long-term clinical outcomes in individuals with KOA. This validation will include the tracking of key metrics such as pain reduction, improved functionality, and enhanced quality of life for patients. Finally, we will explore the scalability of this solution for implementation in other healthcare areas, aiming to develop a comprehensive platform that supports patients with various chronic conditions.

Author Contributions

Conceptualization, J.G.A.; Software, H.F. and D.O.; Methodology, J.G.A., H.F., and D.O.; Investigation, J.G.A.; Resources, H.F. and D.O.; Supervision, H.F.; Project Administration, H.F.; Funding Acquisition, H.F.; Writing—Original Draft, J.G.A., H.F., and D.O.; Writing—Review and Editing, J.G.A., H.F., and D.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the FIULS project, Universidad de La Serena, Chile.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://ialab.s3.us-east-2.amazonaws.com/osteoarthritis_publications.zip (accessed on 20 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ACR	American College of Rheumatology
API	Application Programming Interface
BMJ	British Medical Journal Rapid
CoT	Chain of Thought
CPU	Central Processing Unit
CRAG	Corrective Retrieval-Augmented Generation
CUDA	Compute Unified Device Architecture
DAGs	Directed Acyclic Graphs
EULAR	European League Against Rheumatism
GPU	Graphics Processing Unit
GTs	General Treatments
IQRs	Interquartile Ranges
JSON	JavaScript Object Notation
KOA	Knee osteoarthritis
LLMs	Large language models
NICE	National Institute of Healthcare Clinical Excellence
NLP	Natural language processing
NPT	Non-Pharmacological Treatments
OARSI	OsteoArthritis Research Society International
PEFT	Parameter-Efficient Fine-Tuning
PRISMA	Preferred Reporting Items for Systematic reviews and Meta-Analyses
PTs	Pharmacological Treatments
RAC	Royal Australian College of General Practitioners
RAG	Retrieval-Augmented Generation
RAM	Random Access Memory
SELF-RAG	Self-Reflective Retrieval-Augmented Generation
SMS	Short Message Service
TC	Treatment Comparison
TRIP	Turning Research Into Practice

Appendix A. DOI of Articles Included in Clinical Guidelines That We Could Retrieve

10.1016/j.joca.2014.07.017

10.1016/j.semarthrit.2021.07.007

10.1016/j.knee.2013.11.009

10.1016/j.joca.2016.10.022

10.1016/j.jbmt.2012.04.003

10.1016/j.joca.2016.12.017

10.1111/os.13602

10.1016/j.joca.2014.07.017

10.1016/j.asmr.2024.100926

10.1002/art.20256

10.1186/1471-2474-6-27

10.1186/1472-6882-14-160

10.1002/art.20256

10.1002/acr.23013

10.1002/acr.23921

10.1136/ard.2010.147082

10.1177/0363546514551923

10.1136/bmj.b421

10.1136/bmj.i3740

10.1016/j.arthro.2015.09.003

10.1002/art.1780360302

10.1177/0363546512443043

10.1007/s00167-012-1960-3

10.1302/0301-620X.93B1.25498

10.1302/0301-620X.93B10.27078

10.1016/j.arthro.2015.04.091

10.1080/17453670610012962

10.2106/JBJS.J.01759

10.1056/NEJMoa013259

10.1056/NEJMoa0708333

10.1056/NEJMoa1301408

10.1056/NEJMoa1305189

10.1186/2193-1801-2-263

10.1186/1471-2474-6-27

10.1186/1472-6882-14-160

10.1186/s43058-020-00032-6

10.5348/D05-2016-9-RA-5

10.18869/acadpub.aassjournal.2.2.29

10.1186/ar2016

10.1136/bmj.c4675

10.1186/s12877-022-03498-2

10.1053/apmr.2002.33988

10.2522/ptj.20060006

10.1007/s00167-006-0243-2

10.1097/PHM.0000000000000209

10.1097/PHM.0b013e3181ae0c9d

10.1016/j.arthro.2013.05.007

10.1177/0363546513488518

Appendix B. Prompt Templates for Document Relevance and Model Evaluation

Appendix B.1. Retrieval Grader Prompt

This prompt is used to assess the relevance of a retrieved document in relation to a user question.
prompt = PromptTemplate(template=“““You are a grader assessing relevance of a retrieved document to a user question. Here is the retrieved document: \n\n {document} Here is the user question: {question} If the document contains keywords related to the user question, grade it as relevant. It does not need to be a stringent test. The goal is to filter out erroneous retrievals. Give a binary score ’yes’ or ’no’ score to indicate whether the document is relevant to the question. Provide the binary score as a JSON with a single key ’score’ and no preamble or explanation.”““, input_variables=[“question”, “document”],)

Appendix B.2. Response Prompt (Mistral Model)

This prompt guides the Mistral model in generating responses based on retrieved context.
prompt = PromptTemplate(template=“““You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don’t know the answer, just say that you don’t know. Use three sentences maximum and keep the answer concise..”““, input_variables=[“question”, “document”],)

Appendix B.3. Hallucination Grader Prompt

This prompt evaluates whether a response is grounded in facts.
prompt = PromptTemplate(template=“““You are a grader assessing whether an answer is grounded in supported by a set of facts.
Here are the facts:
-------
{documents}
-------
Here is the answer: {generation}
Give a binary score 'yes' or 'no' score to indicate whether the answer is grounded in supported by a set of facts.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.""",
input_variables=["generation", "documents"],
)

Appendix B.4. Answer Grader Prompt

This prompt is used to assess whether an answer is useful in resolving a question.
prompt = PromptTemplate(
template="""You are a grader assessing whether an answer is useful to resolve a question.
Here is the answer:
-------
{generation}
-------
Here is the question: {question}
Give a binary score 'yes' or 'no' to indicate whether the answer is useful to resolve a question.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.""",
input_variables=["generation", "question"],
)

Appendix B.5. Question Re-Writer Prompt

This prompt converts an initial question into an improved version optimized for vector store retrieval.
prompt = PromptTemplate(template=“““You are a question re-writer that converts an input question to a better version that is optimized for vectorstore retrieval. Look at the initial question and formulate an improved question. Do not include any citations, references, or preambles.
Here is the initial question:
-------
{question}
-------
Improved question: """,
input_variables=["question"],
)

Appendix B.6. Translation Prompt

This prompt translates medical text from English to Spanish, ensuring clarity and precision without including citations or sources.
prompt = PromptTemplate(template=“““You are a translator converting medical text from English to Spanish. The translation should be clear and precise, without including citations or sources from the documents, and accessible for patients who are not healthcare professionals. Translate the following text into Spanish. Do not include any notes, explanations, references, symbols, numbers, sources, or English text.
Here is the text in English:
-------
{question}
-------
Provide the translated text in Spanish, keeping the original format.""",
input_variables=["question"],
)

Appendix B.7. Formalization Prompt

This prompt transforms poorly written Spanish medical text into a clear and patient-friendly explanation.
prompt = PromptTemplate(template=“““You are a healthcare professional who converts poorly written Spanish medical text into clear and patient-friendly Spanish. The explanation should be precise, using comprehensible terminology and an explanatory and approachable tone suitable for patients. Transform the following text into a clear and explanatory style, removing any sources, notes, explanations, references, symbols, numbers, and English text. If any condition related to the knee joint is mentioned, always use “artrosis de rodilla”.
Here is the poorly written Spanish text:
-------
{question}
-------
Provide the transformed explanation in clear and patient-friendly Spanish.""",
input_variables=["question"],
)

References

Cross, M.; Smith, E.; Hoy, D.; Nolte, S.; Ackerman, I.; Fransen, M.; Bridgett, L.; Williams, S.; Guillemin, F.; Hill, C.L.; et al. The Global Burden of Hip and Knee Osteoarthritis: Estimates from the Global Burden of Disease 2010 Study. Ann. Rheum. Dis. 2014, 73, 1323–1330. [Google Scholar] [CrossRef] [PubMed]
Cui, A.; Li, H.; Wang, D.; Zhong, J.; Chen, Y.; Lu, H. Global, Regional Prevalence, Incidence and Risk Factors of Knee Osteoarthritis in Population-Based Studies. EClinicalMedicine 2020, 29–30, 100587. [Google Scholar] [CrossRef] [PubMed]
Martel-Pelletier, J.; Barr, A.J.; Cicuttini, F.M.; Conaghan, P.G.; Cooper, C.; Goldring, M.B.; Goldring, S.R.; Jones, G.; Teichtahl, A.J.; Pelletier, J.-P. Osteoarthritis. Nat. Rev. Dis. Primers 2016, 2, 16072. [Google Scholar] [CrossRef] [PubMed]
Mushtaq, S.; Choudhary, R.; Scanzello, C.R. Non-Surgical Treatment of Osteoarthritis-Related Pain in the Elderly. Curr. Rev. Musculoskelet. Med. 2011, 4, 113–122. [Google Scholar] [CrossRef]
Peat, G.; McCarney, R.; Croft, P. Knee Pain and Osteoarthritis in Older Adults: A Review of Community Burden and Current Use of Primary Health Care. Ann. Rheum. Dis. 2001, 60, 91–97. [Google Scholar] [CrossRef]
Cudejko, T.; Esch, M.; Leeden, M.; Holla, J.; Roorda, L.; Lems, W.; Dekker, J. Proprioception Mediates the Association between Systemic Inflammation and Muscle Weakness in Patients with Knee Osteoarthritis: Results from the Amsterdam Osteoarthritis Cohort. J. Rehabil. Med. 2018, 50, 67–72. [Google Scholar] [CrossRef]
Whittaker, J.L.; Losciale, J.M.; Juhl, C.B.; Thorlund, J.B.; Lundberg, M.; Truong, L.K.; Miciak, M.; van Meer, B.L.; Culvenor, A.G.; Crossley, K.M.; et al. Risk Factors for Knee Osteoarthritis after Traumatic Knee Injury: A Systematic Review and Meta-Analysis of Randomised Controlled Trials and Cohort Studies for the OPTIKNEE Consensus. Br. J. Sports Med. 2022, 56, 1406–1421. [Google Scholar] [CrossRef]
Hitzl, W.; Stamm, T.; Kloppenburg, M.; Ritter, M.; Gaisberger, M.; van der Zee-Neuen, A. Projected Number of Osteoarthritis Patients in Austria for the Next Decades—Quantifying the Necessity of Treatment and Prevention Strategies in Europe. BMC Musculoskelet. Disord. 2022, 23, 133. [Google Scholar] [CrossRef]
Smink, A.J.; Ende, C.H.M.v.D.; Vlieland, T.P.M.V.; Swierstra, B.A.; Kortland, J.H.; Bijlsma, J.W.J.; Voorn, T.B.; Schers, H.J.; Bierma-Zeinstra, S.M.A.; Dekker, J. “Beating OsteoARThritis”: Development of a Stepped Care Strategy to Optimize Utilization and Timing of Non-Surgical Treatment Modalities for Patients with Hip or Knee Osteoarthritis. Clin. Rheumatol. 2011, 30, 1623–1629. [Google Scholar] [CrossRef]
Fernandes, L.; Hagen, K.B.; Bijlsma, J.W.J.; Andreassen, O.; Christensen, P.; Conaghan, P.G.; Doherty, M.; Geenen, R.; Hammond, A.; Kjeken, I. EULAR Recommendations for the Non-Pharmacological Core Management of Hip and Knee Osteoarthritis. Ann. Rheum. Dis. 2013, 72, 1125–1135. [Google Scholar] [CrossRef]
Bannuru, R.R.; Osani, M.C.; Vaysbrot, E.E.; Arden, N.K.; Bennell, K.; Bierma-Zeinstra, S.M.A.; Kraus, V.B.; Lohmander, L.S.; Abbott, J.H.; Bhandari, M.; et al. OARSI Guidelines for the Non-Surgical Management of Knee, Hip, and Polyarticular Osteoarthritis. Osteoarthr. Cartil. 2019, 27, 1578–1589. [Google Scholar] [CrossRef] [PubMed]
National Clinical Guideline Centre (UK). Osteoarthritis: Care and Management in Adults; National Institute for Health and Clinical Excellence: Guidance; National Institute for Health and Care Excellence (UK): London, UK, 2014. Available online: http://www.ncbi.nlm.nih.gov/books/NBK248069/ (accessed on 1 January 2025).
Kolasinski, S.L.; Neogi, T.; Hochberg, M.C.; Oatis, C.; Guyatt, G.; Block, J.; Callahan, L.; Copenhaver, C.; Dodge, C.; Felson, D.; et al. 2019 American College of Rheumatology/Arthritis Foundation Guideline for the Management of Osteoarthritis of the Hand, Hip, and Knee. Arthritis Rheumatol. 2020, 72, 220–233. Available online: https://onlinelibrary.wiley.com/doi/full/10.1002/art.41142 (accessed on 1 January 2025). [CrossRef] [PubMed]
Pisters, M.F.; Veenhof, C.; van Meeteren, N.L.U.; Ostelo, R.W.; de Bakker, D.H.; Schellevis, F.G.; Dekker, J. Long-Term Effectiveness of Exercise Therapy in Patients with Osteoarthritis of the Hip or Knee: A Systematic Review. Arthritis Rheum. 2007, 57, 1245–1253. [Google Scholar] [CrossRef] [PubMed]
Holden, M.A.; Button, K.; Collins, N.J.; Henrotin, Y.; Hinman, R.S.; Larsen, J.B.; Metcalf, B.; Master, H.; Skou, S.T.; Thoma, L.M.; et al. Guidance for Implementing Best Practice Therapeutic Exercise for Patients with Knee and Hip Osteoarthritis: What Does the Current Evidence Base Tell Us? Arthritis Care Res. 2021, 73, 1746–1753. [Google Scholar] [CrossRef] [PubMed]
Moseng, T.; Dagfinrud, H.; van Bodegom-Vos, L.; Dziedzic, K.; Hagen, K.B.; Natvig, B.; Røtterud, J.H.; Vlieland, T.V.; Østerås, N. Low Adherence to Exercise May Have Influenced the Proportion of OMERACT-OARSI Responders in an Integrated Osteoarthritis Care Model: Secondary Analyses from a Cluster-Randomised Stepped-Wedge Trial. BMC Musculoskelet. Disord. 2020, 21, 236. [Google Scholar] [CrossRef]
Pisters, M.F.; Veenhof, C.; Schellevis, F.G.; Twisk, J.W.R.; Dekker, J.; de Bakker, D.H. Exercise Adherence Improving Long-Term Patient Outcome in Patients with Osteoarthritis of the Hip and/or Knee. Arthritis Care Res. 2010, 62, 1087–1094. [Google Scholar] [CrossRef]
Armanasco, A.A.; Miller, Y.D.; Fjeldsoe, B.S.; Marshall, A.L. Preventive Health Behavior Change Text Message Interventions: A Meta-analysis. Am. J. Prev. Med. 2017, 52, 391–402. [Google Scholar] [CrossRef]
Dar, G.; Marx, Y.; Ioffe, E.; Kodesh, E. The Effectiveness of a Multimedia Messaging Service Reminder System in the Management of Knee Osteoarthritis: A Pilot Study. Int. J. Clin. Med. 2014, 5, 483–489. [Google Scholar] [CrossRef]
Blake, H. Motive8! Feasibility of a Text Messaging Intervention to Promote Physical Activity in Knee Osteoarthritis. Int. J. Sports Exerc. Med. 2015, 1, 1510027. [Google Scholar] [CrossRef]
Nelligan, R.K.; Hinman, R.S.; Atkins, L.; Bennell, K.L. A Short Message Service Intervention to Support Adherence to Home-Based Strengthening Exercise for People with Knee Osteoarthritis: Intervention Design Applying the Behavior Change Wheel. JMIR Mhealth Uhealth 2019, 7, e14619. [Google Scholar] [CrossRef]
Afsahi, A.M.; Alinaghi, S.S.; Molla, A.; Mirzapour, P.; Jahani, S.; Razi, A.; Mojdeganlou, P.; Karimi, E.; Mehrtak, M.; Dadras, O.; et al. Chatbots Utility in Healthcare Industry: An Umbrella Review. Front. Health Inform. 2024, 13, 200. [Google Scholar] [CrossRef]
Miner, A.S.; Milstein, A.; Hancock, J.T. Talking to Machines about Personal Mental Health Problems. JAMA 2016, 318, 1217–1218. [Google Scholar] [CrossRef] [PubMed]
Bickmore, T.W.; Giorgino, T. Health Dialog Systems for Patients and Consumers. J. Biomed. Inform. 2006, 39, 556–571. [Google Scholar] [CrossRef] [PubMed]
Bibault, J.E.; Chaix, B.; Guillemassé, A.; Cousin, S.; Escande, A.; Perrin, M.; Nectoux, P. A Chatbot versus Physicians to Provide Information for Patients with Breast Cancer: Blind, Randomized Controlled Noninferiority Trial. J. Med. Internet Res. 2019, 21, e15787. [Google Scholar] [CrossRef] [PubMed]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Amodei, D. Development and Testing of Retrieval Augmented Generation in Large Language Models. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Zettlemoyer, L. Optimizing Clinical Guidelines Interpretation by Large Language Models. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. arXiv 2020, arXiv:2007.01282. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
Glass, M.; Shi, Y.; Bang, Y. Robust Retrieval-Augmented Generation for Open-Domain Question Answering. arXiv 2021, arXiv:2107.08143. [Google Scholar]
Ji, Z.; Yu, T.; Xu, Y.; Lee, N.; Ishii, E.; Fung, P. Towards Mitigating Hallucination in Large Language Models via Self-Reflection. arXiv 2023, arXiv:2310.06271. [Google Scholar]
Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Self-Reflective Retrieval-Augmented Generation. arXiv 2023, arXiv:2310.11511. [Google Scholar]
Xu, L.; Sanders, L.; Li, K.; Chow, J.C.L. Chatbot for Health Care and Oncology Applications Using Artificial Intelligence and Machine Learning: Systematic Review. JMIR Cancer 2021, 7, e27850. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Aggarwal, A.; Tam, C.C.; Wu, D.; Li, X.; Qiao, S. Artificial Intelligence-Based Chatbots for Promoting Health Behavioral Changes: Systematic Review. J. Med. Internet Res. 2023, 25, e40789. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Gallegos, C.; Kausler, R.; Alderden, J.; Davis, M.; Wang, L. Can Artificial Intelligence Chatbots Improve Mental Health?: A Scoping Review. Comput. Inform. Nurs. 2024, 42, 731–736. [Google Scholar] [CrossRef]
Andrews, N.E.; Ireland, D.; Vijayakumar, P.; Burvill, L.; Hay, E.; Westerman, D.; Rose, T.; Schlumpf, M.; Strong, J.; Claus, A. Acceptability of a Pain History Assessment and Education Chatbot (Dolores) Across Age Groups in Populations with Chronic Pain: Development and Pilot Testing. JMIR Form. Res. 2023, 7, e47267. [Google Scholar] [CrossRef]
Blasco, J.-M.; Navarro-Bosch, M.; Aroca-Navarro, J.-E.; Hernández-Guillén, D.; Puigcerver-Aranda, P.; Roig-Casasús, S. A Virtual Assistant to Guide Early Postoperative Rehabilitation after Reverse Shoulder Arthroplasty: A Pilot Randomized Trial. Bioengineering 2024, 11, 152. [Google Scholar] [CrossRef]
Anan, T.; Kajiki, S.; Oka, H.; Fujii, T.; Kawamata, K.; Mori, K.; Matsudaira, K. Effects of an Artificial Intelligence-Assisted Health Program on Workers with Neck/Shoulder Pain/Stiffness and Low Back Pain: Randomized Controlled Trial. JMIR Mhealth Uhealth 2021, 9, e27535. [Google Scholar] [CrossRef]
Li, J.; Yuan, Y.; Zhang, Z. Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases. arXiv 2023, arXiv:2403.10446. [Google Scholar]
Niu, M.; Li, H.; Shi, J.; Haddadi, H.; Mo, F. Mitigating Hallucinations in Large Language Models via Self-Refinement-Enhanced Knowledge Retrieval. arXiv 2023, arXiv:2405.06545. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Gur, I.; Le, Q. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Huh, J.; Park, H.J.; Ye, J.C. Breast Ultrasound Report Generation using LangChain. arXiv 2022, arXiv:2312.03013. [Google Scholar]
Chen, Q.; Chen, A.; Zhang, M.; Zhao, Y.; Yu, N. Revolutionizing Mental Health Care through LangChain. arXiv 2023, arXiv:2403.05568. [Google Scholar]
Shao, S.; Wang, Y.; Huang, J. Automating Customer Service using LangChain. arXiv 2023, arXiv:2310.05421. [Google Scholar]
Zhang, R.; Du, H.; Liu, Y.; Niyato, D.; Kang, J.; Sun, S.; Shen, X.; Poor, H.V. Interactive AI with Retrieval-Augmented Generation for Next Generation Networking. arXiv 2022, arXiv:2401.11391. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Brouwers, M.C.; Kho, M.E.; Browman, G.P.; Burgers, J.S.; Cluzeau, F.; Feder, G.; Fervers, B.; Graham, I.D.; Grimshaw, J.; Hanna, S.E.; et al. AGREE II: Advancing Guideline Development, Reporting and Evaluation in Health Care. CMAJ 2010, 182, E839–E842. [Google Scholar] [CrossRef]
Siemieniuk, R.A.C.; Harris, I.A.; Agoritsas, T.; Poolman, R.W.; Brignardello-Petersen, R.; Van de Velde, S.; Buchbinder, R.; Englund, M.; Lytvyn, L.; Quinlan, C.; et al. Arthroscopic Surgery for Degenerative Knee Arthritis and Meniscal Tears: A Clinical Practice Guideline. BMJ 2017, 357, j1982. [Google Scholar] [CrossRef]
The Royal Australian College of General Practitioners. Guideline for the Management of Knee and Hip Osteoarthritis, 2nd ed.; RACGP: East Melbourne, Australia, 2018; Available online: https://www.racgp.org.au/getattachment/71ab5b77-afdf-4b01-90c3-04f61a910be6/Guideline-for-the-management-of-knee-and-hip-osteoarthritis.aspx (accessed on 1 January 2025).
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Labrak, Y.; Bazoge, A.; Morin, E.; Gourraud, P.A.; Rouvier, M.; Dufour, R. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv 2024, arXiv:2402.10373. [Google Scholar]
Zheng, C.; Sun, K.; Wu, H.; Xi, C.; Zhou, X. Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF. arXiv 2024, arXiv:2403.02513. [Google Scholar]
Li, R.; Wang, X.; Yu, H. LlamaCare: An Instruction Fine-Tuned Large Language Model for Clinical NLP. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 10632–10641. [Google Scholar]
Jung, H.; Kim, Y.; Choi, H.; Seo, H.; Kim, M.; Han, J.; Kee, G.; Park, S.; Ko, S.; Kim, B.; et al. Enhancing Clinical Efficiency through LLM: Discharge Note Generation for Cardiac Patients. arXiv 2024, arXiv:2404.05144. [Google Scholar]
Dortiz, M. Kinesiology-LLM Webhook. GitHub Repository. Available online: https://github.com/dortizm/kinesiology-llm-webhook (accessed on 16 August 2024).
Dortiz, M. Kinesiology-LLM Server. GitHub Repository. Available online: https://github.com/dortizm/kinesiology-llm-server (accessed on 16 August 2024).
Chen, H.; Wang, Y.; Liu, C.; Lu, H.; Liu, N.; Yu, F.; Wan, Q.; Chen, J.; Shang, S. Benefits of a Transtheoretical Model-Based Program on Exercise Adherence in Older Adults with Knee Osteoarthritis: A Cluster Randomized Controlled Trial. J. Adv. Nurs. 2020, 76, 1765–1779. [Google Scholar] [CrossRef] [PubMed]
Alasfour, M.; Almarwani, M. The effect of innovative smartphone application on adherence to a home-based exercise programs for female older adults with knee osteoarthritis in Saudi Arabia: A randomized controlled trial. Disabil. Rehabil. 2020, 44, 2420–2427. [Google Scholar] [CrossRef] [PubMed]
Baker, K.; LaValley, M.P.; Brown, C.; Felson, D.T.; Ledingham, A.; Keysor, J.J. Efficacy of Computer-Based Telephone Counseling on Long-Term Adherence to Strength Training in Elderly Patients with Knee Osteoarthritis: A Randomized Trial. Arthritis Care Res. 2020, 72, 982–990. [Google Scholar] [CrossRef]
Bennell, K.; Nelligan, R.K.; Schwartz, S.; Kasza, J.; Kimp, A.; Crofts, S.J.; Hinman, R.S. Behaviour change text messages for home exercise adherence in knee osteoarthritis: A randomised trial. J. Med. Internet Res. 2020, 22, e21749. [Google Scholar] [CrossRef]
Haynes, R.B.; McDonald, H.; Garg, A.X.; Montague, P. Interventions for helping patients to follow prescriptions for medications. Cochrane Database Syst. Rev. 2002, 2, CD000011. [Google Scholar] [CrossRef]
Lin, J.; Sklar, G.E.; Oh, V.M.; Sen Li, S.C. Factors affecting therapeutic compliance: A review from the patient’s perspective. Ther. Clin. Risk Manag. 2008, 4, 269–286. [Google Scholar] [CrossRef]
Anesi, G.L.; Kerlin, M.P. The impact of resource limitations on care delivery and outcomes: Routine variation, the coronavirus disease 2019 pandemic, and persistent shortage. Curr. Opin. Crit. Care 2021, 27, 513. [Google Scholar] [CrossRef]

Figure 1. Flow diagram of the selection process.

Figure 2. This diagram shows the interaction between the user, Telegram, and the LLM server, which includes components like Ollama, LangGaph, LangServe, LangChain, and ChromaDB, with a webhook enabling communication.

Figure 3. Interaction flow for exercise adherence. This diagram illustrates the step-by-step interaction between the user and the chatbot (LLM), guiding the user through exercise reminders, response handling, and barrier resolution with personalized reinforcement and support.

Figure 4. Architecture of the chatbot for improving adherence to KOA treatment. Components include Ollama (Mistral model integration), LangServe (NLP model management), LangChain (LLM application framework), LangGraph (workflow definition), and ChromaDB (vector storage and retrieval).

Figure 5. Illustration of the step-by-step process from user query submission to the final translated response, ensuring a comprehensive and reliable system to improve patient adherence to treatment protocols.

Figure 6. Results of the first survey among healthcare professionals comparing the pre-improvement chatbot responses with those generated by ChatGPT. The Y axis shows the frequency of each answer.

Figure 7. Python code for a prompt template that guides the LLM to generate medical responses in a clear and patient-friendly tone.

Figure 8. Results of the second survey among healthcare professionals comparing the post-improvement chatbot responses, which were enhanced for a more patient-friendly tone, with those generated by ChatGPT. The Y axis shows the frequency of each answer, similar to Figure 6 from the previous survey.

Figure 9. Comparison of average scores between pre- and post-improvement chatbot versions across four clinical categories: NPTs, PTs, GTs, and TC. The post-improvement version shows higher median scores and reduced data dispersion.

Figure 10. Aggregated total scores by profession, comparing pre-improvement, post-improvement, and ChatGPT versions. The post-improvement version shows higher median scores and reduced dispersion, especially among general practitioners and physical therapists.

Table 1. Overall AGREE II average domain percentage scores for each guideline.

Guideline	Domain 1 (Scope and Purpose)	Domain 2 (Stakeholder Involvement)	Domain 3 (Rigor of Development)	Domain 4 (Clarity of Presentation)	Domain 5 (Applicability)	Domain 6 (Editorial Independence)
BMJ	67	75	76	92	73	88
RAC	97	83	93	94	63	63
OARSI	92	72	71	83	19	71
ACR	92	86	80	92	38	75
EULAR	78	78	72	64	27	63
NICE	86	81	89	83	69	79

Abbreviations: ACR, American College of Rheumatology; OARSI, OsteoArthritis Research Society International; NICE, National Institute of Healthcare Clinical Excellence; EULAR, European League Against Rheumatism; BMJ, British Medical Journal Rapid; RAC, Royal Australian College of General Practitioners.

Table 2. General characteristics of each guideline.

	Population	Treatments Included	Country	Strongly Recommended Treatment	Strongly Recommended Against Treatment
BMJ	Degenerative knee conditions	Arthroscopy	Several countries	Recommendation against arthroscopy	Arthroscopy
RAC	Hip and knee osteoarthritis	Exercise, education, weight loss, adjuncts, pharmacology, and arthroscopy	Australia	Exercise, education, and weight loss	Opioids, transdermal opioids, hyaluronic acid, and stem cell
OARSI	Hip and knee osteoarthritis	Exercise, education, weight loss, adjuncts, and pharmacology	Several countries	Exercise and education	TENS, US, IFT, laser, transdermal opioids, topical capsaicin, platelet-rich plasma, stem cell, prolotherapy, chondroitin, fish oil-omega3, and glucosamine
ACR	Hip, knee, and hand osteoarthritis	Exercise, education, weight loss, adjuncts, and pharmacology	United States	Exercise, education, weight loss, walking aids, corticosteroids, and NSAIDs	TENS, hyaluronic acid, platelet-rich plasma, stem cell, chondroitin, and glucosamine
EULAR	Hip and knee osteoarthritis	Exercise, education, weight loss, and adjuncts	Europe	Exercise, education, and weight loss	No recommendation against treatment
NICE	Osteoarthritis	Exercise, education, weight loss, adjuncts, pharmacology, and arthroscopy	United Kingdom	Exercise, education, and weight loss	Acupuncture, TENS, US, IFT, laser, opioids, hyaluronic acid, glucosamine, and arthroscopy

Abbreviations: ACR, American College of Rheumatology; OARSI, OsteoArthritis Research Society International; NICE, National Institute of Healthcare Clinical Excellence; EULAR, European League Against Rheumatism; BMJ, British Medical Journal Rapid; RAC, Royal Australian College of General Practitioners; TENS, Transcutaneous Electrical Nerve Stimulation; US, Ultrasound Therapy; NSAIDs, Non-Steroidal Anti-inflammatory Drugs.

Table 3. Specifications of the computing infrastructure utilized in the study, detailing the CPU, GPU, memory, and relevant software versions. The system supports efficient deep learning model execution and natural language processing, leveraging tools such as PyTorch, Ollama, LangChain, and ChromaDB to facilitate the implementation of the RAG framework.

Component	Specification
CPU	x86_64, 8 Cores, 16 Logical Cores, 2.44 GHz
Memory (RAM)	125.7 GB
GPU	NVIDIA GeForce RTX 3060 Ti, 8 GB Memory
Python Version	3.8.19
PyTorch Version	1.10.1 + cu111
CUDA Version	11.1
Operating System	Linux 6.2.6-76060206-generic
Ollama Version	0.1.17
LangChain Version	0.1.20
ChromaDB Version	0.5.0

Table 4. Categorization of survey questions into four treatment-focused categories: Non-Pharmacological Treatments, Pharmacological Treatments, General Treatments, and Treatment Comparison, aligned with the expertise of the participating healthcare professionals.

Category	Included Questions
Non-Pharmacological Treatments (NPTs)	1. What non-pharmacological treatment do you recommend for my knee osteoarthritis?
Pharmacological Treatments(PTs)	2. What pharmacological treatment do you recommend for my knee osteoarthritis?
General Treatments(GTs)	3. What treatment do you recommend for my knee osteoarthritis?
Treatment Comparison (TC)	4. Is surgery better than exercise for my knee osteoarthritis? 5. Do you recommend electric stimulation for my knee osteoarthritis?

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Farías, H.; González Aroca, J.; Ortiz, D. Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development. Technologies 2025, 13, 140. https://doi.org/10.3390/technologies13040140

AMA Style

Farías H, González Aroca J, Ortiz D. Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development. Technologies. 2025; 13(4):140. https://doi.org/10.3390/technologies13040140

Chicago/Turabian Style

Farías, Humberto, Joaquín González Aroca, and Daniel Ortiz. 2025. "Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development" Technologies 13, no. 4: 140. https://doi.org/10.3390/technologies13040140

APA Style

Farías, H., González Aroca, J., & Ortiz, D. (2025). Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development. Technologies, 13(4), 140. https://doi.org/10.3390/technologies13040140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chatbot Based on Large Language Model to Improve Adherence to Exercise-Based Treatment in People with Knee Osteoarthritis: System Development

Abstract

1. Introduction

2. Materials and Methods

2.1. Knowledge Base Construction

Data Preprocessing

2.2. Model Selection: Detailed Description of Mistral and Its Integration

2.3. Computing Infrastructure

2.4. Chatbot Architecture

2.5. Chatbot Workflow

2.6. First Iteration of SELF-RAG Implementation: Workflow for Initial Query Processing and Spanish Translation Evaluated via User Surveys

3. Results

3.1. Evaluation Metrics

3.2. Survey Results: Analysis of Feedback from Healthcare Professionals on Chatbot Responses

4. Discussion

4.1. Comparison with Existing Interventions for Improve Adherence to Exercise-Based Treatments in People with KOA

4.2. Clinical Implications: How Successful Patient Adherence Methods Impact Health Outcomes

4.3. Strengths and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. DOI of Articles Included in Clinical Guidelines That We Could Retrieve

Appendix B. Prompt Templates for Document Relevance and Model Evaluation

Appendix B.1. Retrieval Grader Prompt

Appendix B.2. Response Prompt (Mistral Model)

Appendix B.3. Hallucination Grader Prompt

Appendix B.4. Answer Grader Prompt

Appendix B.5. Question Re-Writer Prompt

Appendix B.6. Translation Prompt

Appendix B.7. Formalization Prompt

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI