Web Application for Retrieval-Augmented Generation: Implementation and Testing
Abstract
:1. Introduction
- By implementing the PaSSER application, the study provides a practical framework that can be used and expanded upon in future RAG research.
- The paper illustrates the integration of RAG technology with blockchain, enhancing data security and verifiability, which could inspire further exploration into the secure and transparent application of RAG systems.
- By comparing different LLMs within the same RAG framework, the paper provides insights into the relative strengths and capabilities of the models, contributing knowledge on model selection in RAG contexts.
- The focus on applying and testing within the domain of smart agriculture adds to the understanding of how RAG technology can be tailored and utilized in specific fields, expanding the scope of its application and relevance.
- The use of open-source technologies in PaSSER development allows the users to review and trust the application’s underlying mechanisms. More so, it enables collaboration, provides flexibility to adapt to specific needs or research goals, reduces development costs, facilitates scientific accuracy by enabling exact replication of research setups, and serves as a resource for learning about RAG technology and LLMs in practical scenarios.
2. Web Application Development and Implementation
2.1. LLMs Integration
2.2. PaSSER App Functionalities
- Cleaning and standardizing text data. This is achieved by removing unnecessary characters (punctuation and special characters). Converting the text to a uniform size (usually lower case). Separating the text into individual words or tokens. In the implementation considered here, the text is divided into chunks with different overlaps.
- Vector embedding. The goal is to convert tokens (text tokens) into numeric vectors. This is achieved by using pre-trained word embedding models from selected LLMs (in this case, Misrtal:7b, Llama2:7b, and Orca2:7b). These models map words or phrases to high-dimensional vectors. Each word or phrase in the text is transformed into a vector that represents its semantic meaning based on the context in which it appears.
- Aggregating embeddings for larger text units to represent whole sentences or documents as vectors. It can be achieved by simple aggregation methods (averaging the vectors of all words in a sentence or document) or by using sentence transformers or document embedding techniques that take into account the more consistent and contextual nature of words. Here, transformers are used, which are taken from the selected LLMs.
- Create a vectorstore to store the vector representations in a structured format. The data structures used are optimized for operations with high-dimensional vectors. ChromaDB is used for the vectorstore.
- Selection of a specific knowledge base in a specific domain.
- 2.
- To create a reference dataset for a specific domain, a collection of answers related to the selected domain is gathered. Each response contains key information related to potential queries in that area. These answers are then saved in a text file format.
- 3.
- A selected LLM is deployed to systematically generate a series of questions corresponding to each predefined reference answer. This operation facilitates the creation of a structured dataset comprising pairs of questions and their corresponding answers. Subsequently, this dataset is saved in the JSON file format.
- 4.
- The finalized dataset is uploaded to the PaSSER App, initiating an automated sequence of response generation for each query within the target domain. Following that, each generated response is forwarded to a dedicated Python backend script. This script is tasked with assessing the responses based on predefined metrics and comparing them to the established reference answers. The outcomes of this evaluation are then stored on the blockchain, ensuring a transparent and immutable ledger of the model’s performance metrics.
- 5.
- The results from the blockchain are retrieved for further processing and analysis.
3. Evaluation Metrics
3.1. METEOR (Metric for Evaluation of Translation with Explicit Ordering)
- –
- Word alignment between candidate and reference translations based on exact, stem, synonym, and paraphrase matches, with the constraint that each word in the candidate and reference sentences can only be used once and aims to maximize the overall match between the candidate and references.
- –
- Calculation of Number of matched words in the candidate/Number of words in the candidate and Number of matched words in the candidate/Number of words in the reference:
- –
- Calculation of Penalty for chunkiness, which accounts for the arrangement and fluency of the matched chunks (c) = Number of chunks of contiguous matched unigrams in the candidate translation and (m):
- –
- The final score is computed using the harmonic mean of Precision and Recall, adjusted by the penalty factor:
3.2. ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- –
- is the ratio of the number of overlapping n-grams between the system summary and the reference summaries to the total number of n-grams in the reference summaries:
- –
- is the ratio of the number of overlapping n-grams in the system summary to the total number of n-grams in the system summary itself:
- –
- is the harmonic mean of precision and recall:
- –
- is the length of the LCS divided by the total number of words in the reference summary. This measures the extent to which the generated summary captures the content of the reference summaries:
- –
- is the length of the LCS divided by the total number of words in the generated summary. This assesses the extent to which the words in the generated summary appear in the reference summaries:
- –
- is a harmonic mean of the LCS-based precision and recall:
3.3. BLEU (Bilingual Evaluation Understudy)
3.4. Perplexity (PPL)
- –
- PPL with Laplace Smoothing adjusts the probability estimation for each word by adding one to the count of each word in the training corpus, including unseen words. This method ensures that no word has a zero probability. The adjusted probability estimate with Laplace smoothing is calculated using the following formula:
- –
- PPL with Lidstone smoothing is a generalization of Laplace smoothing where instead of adding one to each count, a fraction (where ) is added. This allows for more flexibility compared to the fixed increment in Laplace smoothing. Adjusted Probability Estimate with Lidstone Smoothing:
3.5. Cosine Similarity
3.6. Pearson Correlation
3.7. F1 Score
4. Testing
- -
- Intel Xeon, 32 Cores, 0 GPU, 128 GB RAM, Ubuntu 22.04 OS.
- -
- Mac Mini M1, 8 CPU, 10 GPU, 16 GB RAM, OSX 13.4.
4.1. Q&A Time LLM Test Results
4.2. RAG Q&A Score Test Results
4.3. Blockchain RAM Resource Evaluation
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Howard, J.; Ruder, S. Universal Language Model Fine-Tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 1 July 2018. [Google Scholar] [CrossRef]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. No. 97. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; pp. 2790–2799. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2005, arXiv:2005.14165. Available online: https://arxiv.org/abs/2005.14165v4 (accessed on 26 March 2024).
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2005, arXiv:2005.11401. Available online: http://arxiv.org/abs/2005.11401 (accessed on 2 February 2024).
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. Available online: http://arxiv.org/abs/2312.10997 (accessed on 18 February 2024).
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 1 November 2020. [Google Scholar] [CrossRef]
- Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval Augmented Language Model Pre-Training. Proc. Mach. Learn. Res. 2020, 119, 3929–3938. [Google Scholar]
- Izacard, G.; Grave, E. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 20 April 2021. [Google Scholar] [CrossRef]
- GitHub. GitHub—Scpdxtest/PaSSER. Available online: https://github.com/scpdxtest/PaSSER (accessed on 8 March 2024).
- Popchev, I.; Doukovska, L.; Radeva, I. A Framework of Blockchain/IPFS-Based Platform for Smart Crop Production. In Proceedings of the ICAI’22, Varna, Bulgaria, 6–8 October 2022. [Google Scholar] [CrossRef]
- Popchev, I.; Doukovska, L.; Radeva, I. A Prototype of Blockchain/Distributed File System Platform. In Proceedings of the IEEE International Conference on Intelligent Systems IS’22, Warsaw, Poland, 12–14 October 2022. [Google Scholar] [CrossRef]
- IPFS Docs. IPFS Documentation. Available online: https://docs.ipfs.tech/ (accessed on 25 March 2024).
- GitHub. Antelope. Available online: https://github.com/AntelopeIO (accessed on 11 January 2024).
- Ilieva, G.; Yankova, T.; Radeva, I.; Popchev, I. Blockchain Software Selection as a Fuzzy Multi-Criteria Problem. Computers 2021, 10, 120. [Google Scholar] [CrossRef]
- Radeva, I.; Popchev, I. Blockchain-Enabled Supply-Chain in Crop Production Framework. Cybern. Inf. Technol. 2022, 22, 151–170. [Google Scholar] [CrossRef]
- Popchev, I.; Radeva, I.; Doukovska, L. Oracles Integration in Blockchain-Based Platform for Smart Crop Production Data Exchange. Electronics 2023, 12, 2244. [Google Scholar] [CrossRef]
- Ollama. Available online: https://ollama.com. (accessed on 25 March 2024).
- GitHub. GitHub—Chroma-Core/Chroma: The AI-Native Open-Source Embedding Database. Available online: https://github.com/chroma-core/chroma (accessed on 26 February 2024).
- PrimeReact. React UI Component Library. Available online: https://primereact.org (accessed on 25 March 2024).
- WharfKit. Available online: https://wharfkit.com/ (accessed on 25 March 2024).
- LangChain. Available online: https://www.langchain.com/ (accessed on 25 March 2024).
- NLTK: Natural Language Toolkit. Available online: https://www.nltk.org/ (accessed on 26 February 2024).
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
- NumPy Documentation—NumPy v1.26 Manual. Available online: https://numpy.org/doc/stable/ (accessed on 26 February 2024).
- Paul Tardy. Rouge: Full Python ROUGE Score Implementation (Not a Wrapper). Available online: https://github.com/pltrdy/rouge (accessed on 1 April 2024).
- Contributors. T. H. F. Team (Past and Future) with the Help of All Our. Transformers: State-of-the-Art Machine Learning for JAX, PyTorch and TensorFlow. Available online: https://github.com/huggingface/transformers (accessed on 1 April 2024).
- SciPy Documentation—SciPy v1.12.0 Manual. Available online: https://docs.scipy.org/doc/scipy/ (accessed on 26 February 2024).
- Pyntelope. PyPI. Available online: https://pypi.org/project/pyntelope/ (accessed on 27 February 2024).
- Rastogi, R. Papers Explained: Mistral 7B. DAIR.AI. Available online: https://medium.com/dair-ai/papers-explained-mistral-7b-b9632dedf580 (accessed on 24 October 2023).
- ar5iv. Mistral 7B. Available online: https://ar5iv.labs.arxiv.org/html/2310.06825 (accessed on 6 March 2024).
- The Cloudflare Blog. Workers AI Update: Hello, Mistral 7B! Available online: https://blog.cloudflare.com/workers-ai-update-hello-mistral-7b (accessed on 6 March 2024).
- Hugging Face. Meta-Llama/Llama-2-7b. Available online: https://huggingface.co/meta-llama/Llama-2-7b (accessed on 6 March 2024).
- Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Ribeiro, C.S.; Agrawal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al. Orca-2: Teaching Small Language Models How to Reason. arXiv 2023, arXiv:2311.11045. [Google Scholar]
- Popchev, I.; Radeva, I.; Dimitrova, M. Towards Blockchain Wallets Classification and Implementation. In Proceedings of the 2023 International Conference Automatics and Informatics (ICAI), Varna, Bulgaria, 5–7 October 2023. [Google Scholar] [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. arXiv 2023, arXiv:2309.01431. [Google Scholar] [CrossRef]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 22 June 2005. [Google Scholar]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries; Association for Computational Linguistics: Barcelona, Spain, 2004. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002. [Google Scholar] [CrossRef]
- Arora, K.; Rangarajan, A. Contrastive Entropy: A New Evaluation Metric for Unnormalized Language Models. arXiv 2016, arXiv:1601.00248. Available online: https://arxiv.org/abs/1601.00248v2 (accessed on 2 February 2024).
- Jurafsky, D.; Martin, J.H. Speech and Language Processing. Available online: https://web.stanford.edu/~jurafsky/slp3/ (accessed on 8 February 2024).
- Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. Adv. Artif. Intell. 2006, 4304, 1015–1021. [Google Scholar]
- issuu. Bancor Protocol Whitepaper En. Available online: https://issuu.com/readthewhitepaper/docs/bancor_protocol_whitepaper_en (accessed on 24 March 2024).
- Medium; Binesh, A. EOS Resource Usage. Available online: https://medium.com/shyft-network/eos-resource-usage-f0a8098827d7 (accessed on 24 March 2024).
- Hugging Face. Models. Available online: https://huggingface.co/models (accessed on 23 March 2024).
- Cheng, D.; Huang, S.; Wei, F. Adapting Large Language Models via Reading Comprehension. arXiv 2024, arXiv:2309.09530. [Google Scholar] [CrossRef]
- Hugging Face. M42-Health/Med42-70b. Available online: https://huggingface.co/m42-health/med42-70b (accessed on 26 March 2024).
Llama2:7b | Mistral:7b | Orca2:7b | ||||
---|---|---|---|---|---|---|
Metric | macOS/M1 | Ubuntu/Xeon | macOS/M1 | Ubuntu/Xeon | macOS/M1 | Ubuntu/Xeon |
Evaluation time (s) | Faster (51,613) | Slower (115,176) | Faster (35,864) | Slower (45,325) | Fastest (24,759) | Slowest (74,431) |
Evaluation count (units) | Slightly Higher (720) | Comparable (717) | Higher (496) | Lower (284) | Lower (350) | Higher (471) |
Load duration time (s) | Faster (0.025) | Slower (0.043) | Fastest (0.016) | Slower (0.039) | Similar (0.037) | Similar (0.045) |
Prompt evaluation count | Lower (51) | Higher (68) | Lower (47) | Higher (54) | Lower (53) | Highest (96) |
Prompt evaluation duration (s) | Shorter (0.571) | Longer (5.190) | Shorter (0.557) | Longer (4.488) | Shorter (0.588) | Longest (6.955) |
Total duration (s) | Shorter (52,211) | Longer (120,413) | Shorter (36,440) | Longer (49,856) | Shortest (25,387) | Longer (81,434) |
Tokens/second | Higher (14.07) | Lower (6.3) | Higher (13.91) | Lower (6.36) | Highest (14.38) | Lower (6.53) |
Metric | Llama2:7b | Mistral:7b | Orca2:7b | Best Model | Metric in Text Generation and Summarization Tasks |
---|---|---|---|---|---|
METEOR | 0.248 | 0.271 | 0.236 | Mistral:7b | Assesses fluency and adequacy of generated text response, considering synonymy and paraphrase. |
ROUGE-1 recall | 0.026 | 0.032 | 0.021 | Mistral:7b | Measures the extent to which a generated summary captures key points from a source text, indicating coverage. |
ROUGE-1 precision | 0.146 | 0.161 | 0.122 | Mistral:7b | Evaluates the fraction of content in the generated summary that is relevant to the source text, implying conciseness. |
ROUGE-1 f-score | 0.499 | 0.472 | 0.503 | Orca2:7b | Provides a balance between recall and precision for assessing the overall quality of a generated summary. |
ROUGE-l recall | 0.065 | 0.07 | 0.055 | Mistral:7b | Reflects the degree to which a generated lowercase summary encompasses the content of a reference lowercase summary. |
ROUGE-l precision | 0.131 | 0.143 | 0.108 | Mistral:7b | Measures the accuracy of a generated lowercase summary in replicating the significant elements of the source text. |
ROUGE-l f-score | 0.455 | 0.424 | 0.457 | Orca2:7b | Integrates precision and recall to evaluate the quality of a generated lowercase summary holistically. |
BLUE | 0.186 | 0.199 | 0.163 | Mistral:7b | Quantifies the similarity of the generated text to reference texts by comparing n-grams, which is useful for machine translation and summarization. |
Laplace perplexity | 52.992 | 53.06 | 53.083 | Llama2:7b | Estimates the likelihood of a sequence in generated text, indicating how well the text generation model predicts sample sequences. |
Lidstone perplexity | 46.935 | 46.778 | 56.94 | Mistral:7b | Assesses the smoothness and predictability of a text generation model by evaluating the likelihood of sequence occurrence with small probability adjustments. |
Cosine similarity | 0.728 | 0.773 | 0.716 | Mistral:7b | Determines the semantic similarity between the vector representations of generated text and reference texts. |
Pearson correlation | 0.843 | 0.861 | 0.845 | Mistral:7b | Quantifies the linear correspondence between generated text scores and human-evaluated scores, indicating model predictability and reliability. |
F1 score | 0.178 | 0.219 | 0.153 | Mistral:7b | Combines the precision and recall of the generated text in summarization tasks, providing a singular measure of its informational quality. |
Operation | Bytes | RAM Price (SYS/kB) | EOS Price (USD) | RAM Cost (SYS) | Equivalent Cost (USD) |
---|---|---|---|---|---|
Time tests * | 402,300 | 0.01503345 | 1.01 | 5.91 | 5.97 |
Score tests ** | 2,896,560 | 0.01503345 | 1.01 | 42.52 | 42.95 |
Total | 3,298,860 | 48.43 | 48.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. https://doi.org/10.3390/electronics13071361
Radeva I, Popchev I, Doukovska L, Dimitrova M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics. 2024; 13(7):1361. https://doi.org/10.3390/electronics13071361
Chicago/Turabian StyleRadeva, Irina, Ivan Popchev, Lyubka Doukovska, and Miroslava Dimitrova. 2024. "Web Application for Retrieval-Augmented Generation: Implementation and Testing" Electronics 13, no. 7: 1361. https://doi.org/10.3390/electronics13071361
APA StyleRadeva, I., Popchev, I., Doukovska, L., & Dimitrova, M. (2024). Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics, 13(7), 1361. https://doi.org/10.3390/electronics13071361