This section presents the results derived from the background literature. It is organized into three parts: the first part addresses the methods of fine-tuning a large language model, the second part explores techniques used to search for relevant data for users’ questions in databases, and the third part focuses on strategies for the implementation of retrieval-augmented generation. Each part aligns with the primary objectives of the systematic review.
3.1. Methods of Fine-Tuning a Large Language Model
A large language model is characterized by its extensive number of parameters, making it computationally challenging to pre-train or fine-tune the entire model to meet specific user requirements. Therefore, it is crucial to explore fine-tuning methods that are efficient in terms of time, computational power, and results.
The methods for the fine-tuning of models, as identified in the research articles reviewed in
Section 2, are summarized in
Table 2. This table provides the name of each method, the type of optimization that it represents, and a brief description.
For each identified method, we provide a detailed explanation, along with an analysis of their respective advantages and disadvantages.
LoRA, or Low-Rank Adaptation, introduced in reference [
4], is an efficient reparameterization technique that optimizes the memory and computation usage by introducing low-rank decompositions into large pre-trained weight matrices. As shown in
Figure 2, LoRA operates by adding two smaller trainable matrices,
and
, in parallel, to the original pre-trained weight matrix
. The rank
r is chosen such that
, ensuring that the number of additional parameters introduced is minimal relative to the size of
.
For a given input
, the output of the original weight matrix alone would be
However, LoRA modifies this by incorporating an incremental update,
, which captures task-specific information:
where
is a scaling factor that controls the contribution of the adaptation matrices. This approach allows LoRA to adapt the model to specific tasks without altering the pre-trained
, preserving the underlying knowledge in the original weights.
The training process for LoRA begins with initialized from a random Gaussian distribution and initialized to zero, ensuring that the initial value of is zero, meaning no modification to at the start. Over the course of training, the updates to and enable the model to learn task-specific adjustments, which can then be integrated with the pre-trained model’s parameters.
One of the primary advantages of LoRA is the drastic reduction in the number of parameters that need to be fine-tuned, allowing for efficient memory usage and reduced computational demands. This is particularly beneficial for large-scale models with billions of parameters, as it enables fine-tuning without a substantial increase in the model size or inference cost. Selecting an optimal rank
r is crucial in balancing model flexibility and efficiency, and it may require tuning depending on the complexity of the task [
4,
5,
6,
7].
BitFit, on the other hand, fine-tunes only the bias terms in the model, while keeping the rest of the model parameters fixed. This method is extremely parameter-efficient as it updates only a small subset of the model’s parameters, significantly reducing the computational cost and memory usage. However, BitFit may not capture all task-specific nuances due to its limited parameter updates, and, while it works well on deep neural networks, it often fails with large language models [
4,
5].
Adapters introduce small neural network layers within the model that can be fine-tuned for specific tasks. These layers are added after each layer of the pre-trained model, with only these new layers being updated during fine-tuning. This approach allows for modular updates specific to different tasks, while keeping the main model parameters fixed, thereby preserving the model’s general knowledge. However, this method adds some computational overhead during inference due to the additional layers [
4,
5].
Prefix-Tuning involves prepending a trainable continuous vector to the input embeddings. This vector is optimized while the rest of the model remains unchanged, making it an efficient method as it requires the optimization of only a small set of parameters. While this method is effective in modifying model behavior with minimal parameter changes, the added prefix can increase the input sequence length, slightly affecting the processing time [
4,
5,
8].
Prompt-Tuning optimizes a small set of input tokens (prompts) while keeping the model parameters fixed. These prompts guide the model towards better performance on specific tasks without altering the model architecture, making it very lightweight in terms of additional parameters. However, the effectiveness of this method heavily depends on the quality and design of the prompts [
4,
5].
QLoRA combines quantization with low-rank adaptation to reduce the memory usage. It employs 4-bit quantization for weight matrices while applying low-rank adaptation techniques, greatly reducing the memory requirements and enabling the fine-tuning of very large models on limited hardware. Although QLoRA maintains performance close to that of full-precision models, managing the quantization errors is crucial to avoid performance degradations [
4,
5,
8].
LOMO, or Low Memory Overhead, fuses gradient computation and parameter updates into one step to minimize the memory usage during backpropagation. This method is advantageous as it reduces the memory footprint during training, enabling efficient fine-tuning on hardware with limited memory capacities. However, the fused operations can complicate its implementation and debugging [
4,
5].
Delta-Tuning restricts tuning to a low-dimensional manifold, acting as an optimal controller to guide model behavior with minimal parameter updates. This method is highly efficient in terms of the number of parameters that need updating and maintains a high level of control over the model’s behavior. However, this restriction to a low-dimensional space may limit the model’s ability to fully adapt to complex tasks [
4,
5].
Diff Pruning introduces sparsity in the parameter updates, reducing the number of parameters that need to be fine-tuned. This method can lead to more efficient models if the sparsity is managed well, as it reduces the number of parameters to update, saving computational resources. However, if not applied carefully, pruning can lead to the loss of important information [
4,
5].
LoftQ combines quantization with low-rank adaptation techniques to improve the memory efficiency. This approach balances memory efficiency with fine-tuning performance and enhances generalization for downstream tasks. However, it requires the careful handling of quantization errors [
4,
5].
AdapterDrop reduces the number of adapters used during inference to save memory without sacrificing the performance. This technique is advantageous as it saves memory during inference while maintaining the performance by using only the necessary adapters. However, dropped adapters may sometimes contain useful task-specific information [
4,
5].
AutoPEFT automates the configuration of parameter-efficient fine-tuning methods to find the most efficient setup for a given task. It automates the selection and configuration process, saving time and effort, and can find optimal configurations that may not be obvious. However, the automation process itself may require additional computational resources and time [
4,
5].
In one study [
9], an alternative approach to the fine-tuning of a large language model (LLM) is explored, utilizing the OpenAI API to enhance the performance in educational tasks with the GPT-3.5-turbo model. This technique is distinct as it does not modify the model’s weights or architecture. Instead, it employs training data to auto-complete the inference prompt with examples that aid the model in comprehending the task and generating more accurate responses. This approach is termed “few-shot learning”. According to [
10], few-shot learning demonstrates the potential to train models effectively with limited labeled data, leveraging examples embedded within the prompt to adapt to the task requirements. A related technique, “zero-shot learning”, is detailed in reference [
3]. This method leverages external data to augment the model’s knowledge by incorporating these data into the prompt, allowing the model to interpret the text and generate accurate answers to user queries. This approach reduces the likelihood of hallucinations and is beneficial for the integration of newer data. Both few-shot and zero-shot learning methods have been demonstrated to improve the response quality of large language models.
Multiple LoRA-Adapter Fusion, as utilized in reference [
6], represents an innovative approach in the fine-tuning of large language models (LLMs). This technique involves independently fine-tuning multiple LoRA adapters on distinct datasets, subsequently fusing these adapters using learnable weights. This method capitalizes on the specific strengths of each adapter, trained on diverse data, to ensure that the resultant model is versatile and performs optimally across various tasks. A notable advantage of this approach is its mitigation of the performance degradation typically associated with the direct fusion of different datasets. By employing learnable fusion weights, the contribution of each adapter is optimized, enhancing the overall model performance. However, this method’s complexity necessitates careful optimization to achieve the best results.
Traditional fine-tuning [
11] entails updating all parameters of a pre-trained LLM to tailor it to a specific downstream task. This process typically begins with a model trained on a broad corpus, which is then fine-tuned by adjusting the entire set of model parameters using a task-specific dataset. The primary advantage of traditional fine-tuning is its flexibility and capacity to fully adapt the model to new tasks, capturing intricate task-specific nuances while leveraging the comprehensive pre-trained knowledge of the model. However, this process is computationally intensive, requiring substantial memory and processing power, particularly as the model size increases. Additionally, it can be time-consuming and carries a risk of overfitting if not properly managed, given that the entire model is adjusted for potentially smaller, more specific datasets.
In reference [
11], a comparative analysis is presented between various fine-tuning methodologies, including traditional fine-tuning without prompts, hard prompting with unfrozen LLMs, soft prompting with unfrozen LLMs, and soft prompting with frozen LLMs. Traditional fine-tuning involves updating all model parameters, ensuring thorough adaptation to new tasks but at a high computational cost. Hard prompting with unfrozen LLMs integrates explicit textual prompts and updates all parameters, incorporating clear task-specific instructions directly into the input, yet still demands significant computational resources. Soft prompting, identified as prefix-tuning in
Table 1, with unfrozen LLMs, employs continuous, trainable vectors added to the input sequence, facilitating adaptation with fewer parameter changes, thus balancing efficiency and performance. Notably, the cited study highlights that soft prompting with frozen LLMs, where only the prompt embeddings are updated while the model parameters remain fixed, significantly reduces the computational costs and simplifies the training process. This method not only enhances the efficiency but also improves few-shot learning and cross-institution’s generalizability, making it a viable strategy for the deployment of LLMs in clinical settings where multiple tasks must be managed efficiently by a single model.
In terms of applicability and trade-offs, LoRA offers a highly efficient approach for scenarios where hardware constraints exist, such as edge devices or environments with limited computational resources. Its ability to maintain the integrity of the original model’s parameters makes it ideal for use cases requiring model reuse across different tasks. However, tuning the low-rank matrices to achieve optimal performance can be a delicate and time-consuming process, as it depends on the selection of appropriate ranks for the matrices.
Similarly, QLoRA is beneficial when memory efficiency is paramount, as it combines quantization with low-rank adaptation to enable fine-tuning on resource-limited hardware. This makes QLoRA particularly applicable in settings where the memory bandwidth is restricted, such as mobile devices. However, the potential performance degradation due to quantization errors remains a trade-off that requires careful mitigation to maintain accuracy close to that of full-precision models.
Adapters and Prefix-Tuning are especially suitable for multi-task environments, as they allow modular updates tailored to specific tasks while preserving the model’s core knowledge. Adapters add only a minor computational overhead, but this can accumulate when handling many tasks, leading to longer inference times. Prefix-Tuning, while lightweight and parameter-efficient, can introduce additional sequence lengths, which may impact the processing time in latency-sensitive applications.
AdapterDrop is advantageous in applications where memory efficiency during inference is critical, as it selectively reduces the number of active adapters. This method strikes a balance between memory usage and task performance; however, it may inadvertently omit valuable task-specific information when certain adapters are excluded.
LoRA-Adapter Fusion provides enhanced adaptability by fusing multiple task-specific adapters, making it particularly useful for comprehensive models applied across diverse domains. Nevertheless, this technique requires substantial optimization to manage the increased complexity, as the fused adapters must be weighted carefully to avoid diminishing returns from suboptimal fusion configurations.
3.1.1. Computational Efficiency of Methods for Fine-Tuning of Large Language Models
One of the primary considerations for any fine-tuning method is its computational efficiency, particularly when working with large models like GPT or LLaMA. Among the methods analyzed, LoRA (Low-Rank Adaptation) and its variant, QLoRA, stand out for their exceptional efficiency. By focusing on low-rank updates to certain layers, these methods significantly reduce the memory usage and computational requirements while maintaining model performance [
4,
8]. BitFit, which fine-tunes only the bias terms, is also extremely lightweight, making it ideal for low-resource settings [
4]. However, traditional fine-tuning, which updates all model parameters, is resource-intensive and often requires parallel processing on multiple GPUs, making it less suitable for applications where the computational resources are limited [
4].
Techniques like adapters and Prefix-Tuning offer a balanced trade-off between resource consumption and performance, introducing additional parameters while only training these, thus reducing the need to update the entire model [
5]. AdapterDrop, a variation of the adapter method, improves the efficiency by activating the adapters only when necessary, thereby dynamically adjusting the resource usage [
6]. On the other hand, LoRA-Adapter Fusion, which combines the strengths of both LoRA and adapters, demonstrates moderate efficiency by enabling multi-domain fine-tuning without requiring the full resources of traditional methods [
6].
Delta-Tuning and Diff Pruning are noteworthy for their ability to reduce the computational overhead by selectively updating only the most important model parameters or pruning non-essential ones. These methods strike a balance between efficiency and performance [
4]. In contrast, few-shot learning and zero-shot learning offer the highest levels of computational efficiency since they require minimal or no training data for new tasks, leveraging the model’s pre-existing capabilities [
5].
Based on the findings from this systematic literature review, we conducted a comparative analysis of the efficiency levels of various fine-tuning methods for a large language model (LLM), with a focus on the computational power required for each method. The methods were categorized into five efficiency levels: low, low to moderate, moderate, high, and very high.These results, which are our conclusions based on information extracted from the articles selected through the SLR, are summarized and presented in
Figure 3.
For a more comprehensive comparison,
Table 3 presents the details regarding the efficiency of each fine-tuning method.
3.1.2. Accuracy of Methods for Fine-Tuning of Large Language Models
When it comes to accuracy, traditional fine-tuning generally offers the highest performance, as it updates all model parameters, ensuring complete adaptation to the target task [
4]. However, methods like LoRA, adapters, and Prefix-Tuning can achieve comparable levels of accuracy, particularly when fine-tuned for specific domains. LoRA maintains high accuracy across various NLP tasks, especially in domain-specific settings [
11], while adapters and LoRA-Adapter Fusion allow for strong performance in multi-domain applications [
6].
QLoRA, which combines quantization with LoRA, is particularly effective in memory-intensive tasks like medical summarization, maintaining high accuracy while significantly reducing the memory usage [
8]. On the other hand, BitFit and LOMO might sacrifice some accuracy due to their minimalist approach, but they still perform well in low-resource environments [
4]. Prompt-Tuning, while efficient, tends to show slightly lower accuracy in complex tasks due to its reliance on modifying input prompts rather than the model parameters [
5].
Methods like few-shot learning and zero-shot learning exhibit varying levels of accuracy depending on the similarity between the pre-trained model’s capabilities and the new task. While these methods perform remarkably well in structured tasks, they may struggle in highly specialized domains where more fine-grained adaptation is needed [
9]. Delta-Tuning and Diff Pruning also offer solid accuracy levels, particularly in tasks that require specialized adaptations without the need for full model retraining [
4].
Building on the insights gained from this systematic literature review, we carried out an extensive comparative analysis focusing on the accuracy levels of the different fine-tuning methods applied to a large language model (LLM). The analysis classified these methods into five distinct accuracy categories: low, low to moderate, moderate, high, and very high. A summary of these results, which are our conclusions based on information extracted from the articles selected through the systematic literature review, is illustrated in
Figure 4, providing a clear visual representation of how each method performs across these categories.
For a more detailed comparison,
Table 4 provides the accuracy specifics for each fine-tuning method.
3.1.3. Applicability of Methods for Fine-Tuning of Large Language Models
The applicability of fine-tuning methods depends largely on the target task and the availability of resources. LoRA and adapters are widely applicable across various NLP tasks, making them go-to options for both academic research and industry use cases. LoRA, with its efficiency and strong performance, is ideal for applications where the computational resources are constrained but high accuracy is still required [
11]. Adapters, and particularly LoRA-Adapter Fusion, are suitable for multi-domain tasks, allowing for flexible fine-tuning across different applications without the need to retrain the entire model [
6].
In environments with stringent resource limitations, BitFit and LOMO are highly applicable, offering lightweight fine-tuning solutions that can be deployed on low-end hardware [
4]. Prompt-Tuning, due to its focus on modifying input prompts, is particularly useful in zero-shot or rapid deployment scenarios where retraining the model is not feasible [
11]. Few-shot learning and zero-shot learning are applicable in cases where there are limited or no training data available, making them ideal for generalized tasks or domains with few labeled data [
5].
More specialized methods like QLoRA, Delta-Tuning, and Diff Pruning are highly applicable in domains such as healthcare or finance, where resource efficiency must be balanced with high accuracy and privacy protection [
8]. QLoRA, in particular, is well suited for applications involving large-scale datasets and sensitive information, as it minimizes the memory usage while maintaining the performance [
8]. Diff Pruning, which focuses on reducing the model size without sacrificing accuracy, is ideal for deployment on edge devices with limited computational power [
4].
The choice of fine-tuning method depends heavily on the specific requirements of the task, the available computational resources, and the desired level of accuracy. Methods like LoRA, adapters, and LoRA-Adapter Fusion offer strong performance across a wide range of tasks, while BitFit, LOMO, and Prompt-Tuning provide efficient alternatives for low-resource environments. Few-shot and zero-shot learning are invaluable when task-specific data are scarce, and QLoRA and Diff Pruning cater to specialized, resource-constrained applications. Each method presents a unique balance of computational efficiency, accuracy, and applicability, making it essential to choose the one that aligns best with the specific needs of the project.
Drawing on the insights obtained from this systematic literature review, we performed an in-depth comparative analysis that examined the applicability levels of the various fine-tuning methods utilized for a large language model (LLM). In this analysis, each method was categorized into five specific applicability levels: low, low to moderate, moderate, high, and very high. The findings from this assessment, derived from our conclusions based on information extracted from the articles selected through the SLR, are summarized in
Figure 5, which offers a comprehensive visual overview of how each fine-tuning method aligns across these categories. This detailed comparison sheds light on the relative applicability and suitability of the different techniques evaluated.
To offer a more detailed comparison,
Table 5 outlines the specifics of the applicability of each fine-tuning method.
3.2. Techniques for Searching of Relevant Data for User Questions in Databases
In the field of automatic question-answering systems, finding relevant data to address user questions is a critical task. This section delves into the various methodologies employed to search for relevant data within databases, ensuring precise and contextually appropriate responses. The methods highlighted here stem from comprehensive literature reviews, advanced retrieval techniques, and the innovative integration of machine learning models.
Initial QA systems heavily relied on pattern matching using regular expressions and context-free grammars (CFG). These methods involved hard-coding complex logical rules to identify specific sequences of characters or parts of speech within documents. Despite enabling basic QA capabilities, these techniques were rigid and limited in capturing the semantic nuances of natural language, restricting their application to simple keyword searches and scripted responses [
12].
As the field evolved, statistical methods emerged, representing words as numerical vectors and reducing documents to fixed-length lists of numbers. These methods involved preprocessing text by removing stop-words and stemming and then constructing matrices, such as document–term matrices. Statistical operations computed the distances and similarities between vectors, allowing more sophisticated QA by comparing vector representations of text. However, these methods still struggled to capture deeper contextual meanings and relationships within the text [
12].
Building on the foundational tasks of data identification, diverse methodologies have been developed for QA systems.
Text-Based QA Systems: These systems utilize unstructured documents, such as Wikipedia entries, through processes like question analysis, passage retrieval, and answer extraction. Question analysis dissects the input to understand its intent using morphological, syntactical, and semantic analysis. Passage retrieval searches for relevant documents, extracting and ranking excerpts based on their potential to answer the query. Finally, answer extraction generates and validates the answer from the retrieved passages [
13].
Knowledge-Based QA Systems: These systems leverage structured data from knowledge bases. They employ information retrieval to sort candidate answers and semantic parsing to convert sentences into executable semantic representations. Neural networks enhance the parsing capabilities, making these systems more robust in handling complex queries [
13].
Hybrid QA Systems: Combining structured and unstructured data, hybrid systems integrate text-based and knowledge-based approaches. They employ question analysis to reduce the search space by understanding the question’s context and intent, utilizing techniques like information retrieval and semantic parsing to enhance the efficiency and accuracy [
13].
Transformer-based models have significantly advanced QA systems. BERT (Bidirectional Encoder Representations from Transformers) and DistilBERT exemplify this progress. BERT is pre-trained on extensive text corpora using a masked language model objective and fine-tuned for tasks like question answering. It identifies the start and end positions of answers within passages, providing a robust mechanism for context understanding. DistilBERT, a compact and faster variant of BERT, achieves similar performance with reduced computational requirements. The CO-SE pipeline, presented in [
14], leverages Transformer-based models, consisting of two main components: the retriever and the reader. The retriever employs the TF-IDF (Term Frequency-Inverse Document Frequency) model to fetch relevant documents. These documents are then passed to the reader, which uses the fine-tuned DistilBERT model to extract precise answers by understanding the context within the text, efficiently addressing specific queries like those related to COVID-19 [
14,
15].
Contextual embeddings, such as ELMo, BERT, and GPT, also enhance QA systems. ELMo uses a bidirectional LSTM model to derive word embeddings based on entire input sentences, capturing deeper, context-dependent word meanings. BERT employs a bidirectional Transformer encoder to predict masked words, integrating both left and right contexts, making it highly effective for QA. GPT combines unsupervised pre-training with supervised fine-tuning, learning long-range dependencies and acquiring substantial world knowledge, thus improving the QA performance [
12,
15].
An alternative approach to measuring textual similarity involves integrating two methods, word embedding and TF-IDF weighting, as described in reference [
16]. This approach leverages models such as Word2Vec, GloVe, and FastText to generate word embeddings, which capture the semantic relationships between words. Concurrently, the TF-IDF model is employed to assess the significance of the words within the text. By combining these models, the method calculates word embeddings for each word in a document and then determines the document’s overall embedding by computing a weighted average of these word embeddings, with weights derived from the TF-IDF scores. This integrated approach enables the retrieval of relevant information from extensive databases by calculating the cosine similarity between a user’s query and each document, thus facilitating efficient and accurate data retrieval.
A solution for low-resource languages consists of cross-lingual models like XLM-R and Unicoder, which leverage multilingual data to optimize QA tasks across different languages. XLM-R enhances the performance through cross-lingual transfer, while Unicoder employs various pre-training tasks to become a robust language-independent encoder for QA [
12,
15]. The idea of using bilingual models was also presented in reference [
17] as a solution for low-resource languages.
In some special cases, an approach like a template-based method can be used to address the limitations inherent in more complex question-answering systems. A template-based approach employs predefined templates to generate responses to specific types of queries. This method involves creating a set of patterns or templates that can be matched against the input query to produce a structured response. The templates are designed based on the anticipated structure of the questions, allowing the system to quickly identify the relevant template and fill in the necessary information from the provided data. This approach is particularly effective in handling straightforward, factual queries where the answer can be directly extracted from the available data, without the need for extensive reasoning or inference. However, template-based systems may struggle with more complex queries that require deeper understanding and contextual analysis, as they rely heavily on predefined templates and may not adapt well to variations in question phrasing or unexpected query types [
15].
The hybrid approach outlined in [
18] utilizes MultiWordNet to identify synonyms and hypernyms, along with a Word2Vec model trained on domain-specific data to rank and filter these expanded terms based on their semantic similarity to the query. This multi-step pipeline ensures that queries are enhanced with contextually relevant terms, thereby improving the precision and recall of relevant sentence retrieval. By contextualizing terms within the document collection, this approach addresses the limitations of previous methods, capturing both the semantic and syntactic nuances, which is essential for accurate QA. The integration of lexical resources and word embeddings represents a significant advancement in searching for relevant data in automatic question-answering systems, demonstrating improved performance in closed-domain QA tasks.
A technique to enhance the accuracy of a system that retrieves pertinent information for user queries from extensive databases is discussed in reference [
19]. Large databases present certain challenges, such as prolonged search times and the presence of similar data that may confuse the model. This paper introduces the method of clustering questions, where a user’s query is categorized into one of these clusters. After identifying the relevant cluster, the system conducts a search within this cluster to locate the most suitable answer to the user’s query. This process entails comparing the user’s query with the questions and answers within the cluster using predefined similarity measures. This comprehensive approach enables question-answering (QA) systems to address a wide range of question types and contexts, ultimately enhancing the user experience by delivering more precise and relevant answers.
To thoroughly evaluate the performance of question-answering (QA) systems, several well-established metrics are employed, each providing unique insights into different aspects of accuracy and effectiveness. Firstly, accuracy is a straightforward metric often used for datasets containing single-answer questions. It measures the proportion of correctly answered questions by dividing the number of correct answers by the total number of questions answered. This metric is particularly useful for simple, fact-based questions where there is only one correct answer. However, it may not fully capture the nuances of QA systems designed to handle more complex or multi-answer queries.
For cases where multiple correct answers are possible, the F-score is more appropriate as it combines both precision and recall into a harmonic mean. Precision represents the fraction of correctly retrieved answers out of all answers retrieved, while recall indicates the fraction of correctly retrieved answers out of all relevant answers. By calculating the F-score, which gives equal weight to precision and recall, we obtain a balanced view of the system’s ability to generate and retrieve correct answers. This is particularly useful in multi-answer scenarios, where both the completeness and correctness of the answers are crucial for a meaningful evaluation.
Another critical metric is the mean reciprocal rank (MRR), which assesses the QA system’s capability to rank correct answers highly. The MRR is calculated by taking the mean of the reciprocal ranks of the first correct answer across all questions, providing an average measure of how well the system prioritizes the most relevant answers. This metric is especially valuable in scenarios where users expect the most accurate answer to appear first or near the top of the list.
Mean Reciprocal Rank (MRR): This metric evaluates the rank of the first relevant answer for a set of questions. It is calculated as the average reciprocal rank across all questions and is defined as
Here, is the total number of questions, and represents the rank position of the first correct answer for question i. A higher MRR value indicates that relevant answers are ranked higher on average.
Mean Average Precision (MAP): This metric extends the evaluation by considering the precision of relevant answers at multiple recall levels for each question. It averages the average precision (AP) across all questions and is defined as
Here,
is the average precision for question
i, calculated as
In this equation,
is the total number of relevant answers for question
i,
represents the precision at position
j, and
is a binary indicator that is 1 if the answer at position
j is relevant and 0 otherwise. The MAP provides a comprehensive view of a QA system’s performance by considering the ranked list of possible answers.
Here,
is the number of relevant documents for question
i,
is the precision at position
j, and
is a binary function that is 1 if the document at position
j is relevant and 0 otherwise. By employing these metrics together, researchers can gain a multidimensional understanding of the strengths and weaknesses of QA systems, enabling them to make targeted improvements [
13].
This overview highlights the diverse and evolving methodologies in automatic QA systems, showcasing the continuous advancements aimed at improving the accuracy and efficiency in answering user queries.
3.3. Strategies for Implementation of Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) improves the responses of large language models (LLMs) by leveraging external, authoritative knowledge bases, rather than depending solely on potentially outdated training data or the model’s internal knowledge. This method addresses the critical issues of accuracy and timeliness in LLM outputs. By incorporating RAG, the problem of hallucination in LLMs is significantly reduced, leading to responses that are relevant, up-to-date, and verifiable [
1]. Furthermore, the lack of knowledge about private data often leads to hallucinations, but, by using RAG, the model can access these data and minimize such inaccuracies. This enhances user trust and provides developers with a cost-effective means to boost the reliability and usefulness of LLMs across various applications. For instance, RAG enables advanced chatbot functionalities by combining techniques like LangChain and performance-optimized LLM fusion to ensure effective and precise responses. It employs web scraping capabilities to integrate extensive information, improving the contextual relevance and informativeness of responses. When fine-tuned with efficiency-driven strategies like LoRA and QLoRA, RAG can provide a seamless blend of external knowledge retrieval and computational efficiency, empowering applications such as custom chatbot development in medical and other specialized domains [
10]. RAG implementation can be divided into strategies applied before generation, during generation, after generation, and end-to-end [
1].
For the before generation strategies, we found that it involves augmenting the LLM before the text generation process begins.
LLM-Augmenter [
1,
20] is a system that augments a black-box LLM with plug-and-play (PnP) modules. The system retrieves evidence from external knowledge bases and forms evidence chains. This evidence is then used to prompt the LLM, which generates a response grounded in the retrieved evidence. The response is iteratively verified and refined to ensure accuracy.
FreshPrompt [
1,
21] is a method to address the static nature of most LLMs by incorporating dynamic, up-to-date information through a search engine. This approach enhances the model’s ability to adapt to evolving knowledge, improving the correctness of the responses in fast-changing scenarios.
Context curation [
22] involves adjustments to the retrieved content before inputting it to the LLM, which include reranking and context selection/compression, to highlight the most pertinent results and reduce noise.
The during generation strategy focuses on integrating knowledge retrieval during the text generation process, ensuring that each generated sentence is supported by relevant information.
Knowledge retrieval [
1,
23] is a method where potential hallucinations are detected and corrected during sentence generation. By utilizing logit output values, the system identifies inaccuracies and retrieves supporting knowledge to correct them, thus reducing the likelihood of hallucinations.
The Decompose-and-Query (D&Q) framework [
1,
24] addresses challenges in question answering by guiding models to use external knowledge and constrains their reasoning to reliable information. This framework significantly improves the robustness and accuracy of LLMs in generating answers.
The EVER framework [
1,
25] introduces a real-time verification and rectification framework that detects and corrects hallucinations during the generation process. This stepwise approach ensures that the generated content is trustworthy and factually accurate.
Adaptive retrieval [
22] involves techniques like Self-RAG, which allow the model to determine optimal moments and content for retrieval, enhancing its efficiency and relevance.
Recursive retrieval [
22] involves iteratively refining the search queries based on the previous results to improve the depth and relevance of information. This is particularly useful for complex search scenarios.
Post-generation or after generation techniques involve refining the generated output by cross-verifying it with external sources.
RARR (Retrofit Attribution using Research and Revision) [
1,
26] represents a system that retrofits attributions by conducting research and revising the generated content based on the retrieved evidence. This post hoc verification ensures that the final output aligns with verifiable facts.
High-entropy word spotting and replacement [
1,
27] involves detecting high-entropy words, which are more prone to hallucinations, and replacing them with lower-entropy alternatives from reliable models. This method effectively reduces hallucinations by focusing on the most problematic words in the generated text.
The end-to-end strategy of RAG [
1,
2] integrates a pre-trained sequence-to-sequence (seq2seq) Transformer with a dense vector index of Wikipedia, accessed through the Dense Passage Retriever (DPR). This combination allows the model to generate outputs based on both the input query and relevant documents retrieved by the DPR. By training the generator and retriever jointly, the model achieves enhanced performance in knowledge-intensive tasks, demonstrating the efficacy of combining parametric and non-parametric memory in LLMs.
The Naive RAG paradigm, one of the earliest methodologies in the field, is characterized by a basic “Retrieve–Read” framework, consisting of indexing, retrieval, and generation phases [
22]. Initially, raw data are indexed by cleaning and transforming diverse document formats into a consistent vectorized form, which allows efficient similarity-based retrieval. Upon receiving a user query, the model retrieves relevant content chunks and synthesizes a response through a frozen language model. However, Naive RAG often suffers from limitations in its retrieval precision and generation coherence, which may lead to irrelevant or hallucinated outputs.
Figure 6 (left) illustrates the simple linear process of Naive RAG, showcasing its foundational approach.
Advanced RAG builds upon the Naive RAG paradigm by introducing optimizations in the pre- and post-retrieval processes. In the pre-retrieval phase, techniques like query expansion and rewriting are employed to enhance the query clarity, while post-retrieval processes such as re-ranking and summary fusion are used to refine the retrieved context [
22]. These enhancements address some of the precision and relevance issues seen in Naive RAG, making the Advanced RAG framework more robust. As depicted in
Figure 6 (middle), Advanced RAG maintains a sequential structure but incorporates additional steps for higher retrieval accuracy and content integration.
Modular RAG represents a further advancement by enabling flexible module configurations that can be adapted to various tasks. Unlike Naive and Advanced RAG, which follow a rigid retrieval-to-generation flow, Modular RAG allows for the dynamic reconfiguration of components, such as adding dedicated search, fusion, and memory modules [
22]. This flexibility supports both sequential and iterative retrieval processes, enhancing the framework’s adaptability across diverse applications. As seen in
Figure 6 (right), Modular RAG’s architecture supports complex interactions between modules, facilitating enhanced retrieval performance and task-specific adjustments.
Each RAG paradigm illustrates the progressive evolution of retrieval-augmented generation systems, from the simplicity of Naive RAG to the adaptable, task-oriented structure of Modular RAG. These developments underscore the importance of modular design and strategic optimization in addressing retrieval challenges and achieving more contextually relevant outputs in natural language generation.
The evolution of retrieval-augmented generation (RAG) paradigms—from Naive to Advanced and Modular—highlights the significant advancements in integrating external knowledge with language models, each with distinct advantages and limitations [
22]. The Naive RAG framework, characterized by its simple “Retrieve–Read” process, excels in its straightforward implementation and accessibility for foundational use cases. However, it often suffers from low retrieval precision and generation coherence, which can lead to irrelevant or hallucinated outputs [
22]. Advanced RAG addresses these shortcomings through optimizations in the pre-retrieval and post-retrieval stages, such as query expansion and re-ranking, enhancing the retrieval accuracy and relevance. Despite these improvements, its sequential framework remains rigid, limiting its adaptability in dynamic scenarios [
22]. Modular RAG, the most flexible of the paradigms, introduces configurable components like dedicated search and memory modules, enabling task-specific and iterative retrieval processes. This adaptability makes it suitable for diverse applications but also increases the architectural complexity and resource requirements [
22]. Collectively, these paradigms illustrate the trade-offs between simplicity, robustness, and flexibility in RAG systems [
22].
In our research, we also found a case study in clinical informatics [
3], which demonstrated the practical application and benefits of RAG by employing the Llama 2 model with zero-shot prompting and integrating RAG to summarize and extract key clinical information from electronic health records (EHRs) related to malnutrition management. The incorporation of RAG significantly improved the model’s performance by providing access to an extended knowledge base, enhancing the accuracy and relevance of the generated summaries and extracted information. This exemplifies a “before generation” strategy, where information was extracted at an initial phase and subsequently used in the prompt of the LLM, enabling it to utilize new data, combined with the multi-tasking power of LLMs. By integrating RAG, the model was able to access more detailed and accurate information, leading to improved performance in tasks such as the structured summarization of clinical notes and the extraction of malnutrition risk factors. This integration mitigated the hallucination problem commonly observed in LLMs, demonstrating the value of RAG in enhancing the reliability of generative models in clinical settings.
In reference [
28], a method was discovered that utilized customer reviews and Q&A pairs to produce contextually relevant answers during response generation. This approach involved calculating embeddings for all e-commerce data collected from the internet and saving them. By using the cosine similarity between the user’s question embeddings and the stored embeddings, the most relevant review and Q&A pair were identified and used as input for the model. This dynamic retrieval and incorporation of pertinent user-generated content improves the relevance and accuracy of the responses, especially in e-commerce and customer service settings. Consequently, this strategy ensures that the generated text is not only precise but also closely aligns with users’ expectations and real-world feedback.
In reference [
29], the implementation of retrieval-augmented generation (RAG) within the domain of healthcare administrative task automation was examined. This investigation introduced a multi-agent framework that incorporated RAG to manage and optimize administrative functions, including patient registration, medical billing, and appointment scheduling. This system employed RAG to retrieve pertinent patient information from external databases and seamlessly integrate it into the large language model’s processing pipeline, ensuring that the generated responses were both accurate and contextually relevant. The multi-agent architecture facilitated efficient task decomposition and execution, utilizing RAG to verify and refine the data throughout various stages of the workflow. The RAG system operated using data embeddings, extracted from PDF files and databases, which were segmented into fixed-size chunks. The system then calculated embeddings for user queries to extract the most relevant information. This methodology not only alleviated the administrative workload on healthcare professionals but also improved the overall efficiency and accuracy of administrative processes, demonstrating a comprehensive end-to-end strategy for the application of RAG in a complex, real-world scenario.
Some more strategies can be found in reference [
30], where the authors presented the development of a retrieval-augmented generation (RAG) medical assistant for infectious diseases. This incorporated three distinct retrieval techniques: Naive RAG, auto-merging retriever-based RAG, and ensemble retriever-based RAG. The Naive RAG approach utilized a direct retrieval process that synthesized documents based on user queries, although it may not always yield the most precise information. The auto-merging retriever-based RAG method enhanced the accuracy by dividing longer documents into smaller sections, enabling more precise retrieval and improving the relevance of the synthesized responses. The ensemble retriever-based RAG further refined the retrieval process by combining multiple algorithms, employing the Reciprocal Rank Fusion (RRF) algorithm to aggregate the rankings and ensure that the final results were highly relevant. This comprehensive approach significantly bolstered the chatbot’s capability to provide accurate, contextually pertinent answers from a graph database-structured knowledge graph. Notably, the ensemble retriever demonstrated exceptional performance in terms of answer accuracy and contextual relevance, underscoring RAG’s potential to enhance the reliability and effectiveness of large language models (LLMs) in healthcare by grounding the responses in the most pertinent and up-to-date information available.
In
Figure 7, the process of answering complex user queries using RAG is illustrated. The figure showcases a workflow where a user query triggers the retrieval of relevant documents from a database, which are then processed by a large language model to generate a factually accurate and contextually rich response. This example emphasizes the advantages of RAG by integrating real-time data retrieval with LLM capabilities, ensuring that the generated outputs are informed by up-to-date and contextually relevant information. Unlike traditional approaches that rely solely on static, pre-trained knowledge, RAG dynamically incorporates external knowledge, enhancing the precision and relevance of the answers. This approach is particularly advantageous in addressing queries that require current and accurate data, such as understanding the lifespans of electric vehicle batteries, as shown in the example provided. A similar example based on the Public Sector Decarbonization Scheme can be found in [
31], underscoring the effectiveness of RAG in real-world applications.
In order to address the privacy concerns associated with sensitive data in a retrieval-augmented generation (RAG) system, a user rule-based database can be utilized to ensure that only accredited individuals have access. This approach involves setting access control rules that define permissions based on user roles and accreditation, which the system checks before granting data access. By leveraging role-based access control and authentication protocols, the system restricts data retrieval to authorized users only. This method not only protects sensitive information but also supports logging and auditing to ensure compliance with privacy regulations.
To address the empirical evidence supporting the effectiveness of retrieval-augmented generation (RAG) in enhancing accuracy and reducing hallucination in large language models (LLMs), various studies provide concrete examples across different domains. For instance, the application of RAG in clinical settings demonstrated improved summarization and extraction accuracy when dealing with electronic health records (EHRs), as [
3] reported that integrating RAG with a generative model increased the summarization accuracy by 6%, achieving an impressive 99.25% overall accuracy, while also reducing hallucinations in extracting critical information from EHRs. Additionally, RAG has the capability to diminish hallucinations by leveraging external, authoritative knowledge bases, which allows models to generate responses based on current and factual data [
22]. This approach has also been effective in tasks requiring precise numerical extraction, as illustrated in [
32] via document-based question answering, which showed that RAG improved the accuracy in handling exact answer selection and complex numerical data. By drawing on external databases, RAG has proven instrumental in enhancing the reliability of LLMs, particularly in domains where outdated or unverified information could compromise the performance [
1].
After extracting the strategies for the implementation of a retrieval-augmented generation (RAG) system, which were identified using the systematic literature review (SLR) method, it was essential to establish a methodology for the evaluation of the interpretative capacity of large language models (LLMs) in processing retrieved data. As highlighted in reference [
32], several evaluation tasks have been identified to assess this interpretative ability. These tasks involve various question types, such as single-choice, multiple-choice, single-choice (numbers), yes–no questions, and number extraction. Each of these question formats is designed to challenge the model’s capacity to select or generate accurate answers based on the retrieved content. For evaluation purposes, a dataset in English, tailored to advanced models like GPT-3 and GPT-4, was employed. Using predefined answers, the accuracy of the models was measured, allowing for a comparative analysis of their performance based on their ability to interpret and process the data retrieved.
In addition to the accuracy-focused metrics, reference [
22] introduces a set of quality scores, including context relevance, answer faithfulness, and answer relevance. These scores are crucial in assessing the efficiency of LLMs in handling retrieved data, as they measure the model’s ability to deal with noise, filter out irrelevant data, synthesize information from multiple documents, and identify counterfactuals. For performance assessment, task-specific metrics such as the exact match (EM),
F1 score, BLEU, and ROUGE are commonly applied to evaluate an LLM’s ability to produce responses that are both accurate and coherent. These metrics each contribute unique insights into the model’s performance. The
F1 score, for example, measures the balance between precision (the proportion of relevant items among the retrieved items) and recall (the proportion of relevant items that were retrieved), which is especially useful in evaluating whether the model’s responses are both relevant and complete in covering the expected information. BLEU (Bilingual Evaluation Understudy) compares the overlap of n-grams between the model-generated response and a known result, providing a quantitative measure of how similar the response is to an expected output. ROUGE (Recall-Oriented Understudy for Gisting Evaluation), on the other hand, assesses the quality of the generated response by evaluating recall-based overlaps of n-grams, which helps to determine how much of the reference text the response captures.
The F1 score, BLEU score, and ROUGE score are defined as follows.
F1 Score: The
F1 score is a statistical measure that combines precision and recall into a single metric. It is particularly useful when the balance between precision (how many predicted positives are actual positives) and recall (how many actual positives are correctly identified) is critical. The
F1 score is the harmonic mean of the precision and recall and is defined as
In this formula, a higher
F1 score indicates better performance in balancing precision and recall.
BLEU (Bilingual Evaluation Understudy): BLEU is a metric designed to evaluate the quality of the text generated by a model by comparing it to a reference text. It focuses on the precision of n-grams (sequences of n words), penalizing outputs that are too short using a brevity penalty (BP). To evaluate BLEU with multiple n-grams, the formula is expressed as
Here,
represents the precision of n-grams,
indicates the weight assigned to each n-gram level (e.g., unigrams, bigrams, etc.), and BP is the brevity penalty used to discourage excessively short generated outputs. BLEU captures the precision of n-grams while accounting for the length of the generated text.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a family of metrics that measure the quality of generated text by assessing the overlap of n-grams between the generated and reference texts. ROUGE-N, one of the commonly used variants, is defined as
In this formula,
N refers to the length of the n-grams being evaluated. ROUGE-N specifically focuses on recall, measuring how many n-grams from the reference text are present in the generated text.
Equations (
6)–(
8) offer detailed insights into the mathematical foundations of these widely used evaluation metrics, each tailored to specific evaluation scenarios in text generation tasks.
Further elaboration of the evaluation techniques is provided in reference [
33], which discusses model scoring systems like GPT-Eval and LLM-Mini-CEX. These LLM-based scoring systems are specially trained to assess the responses generated by other models, adding an extra layer of validation and enabling a deeper understanding of the model performance across multiple dimensions.