1. Introduction
Large Language Models (LLMs) have transformed information access and processing, particularly in question-answering systems. These models understand and respond to queries in natural language, facilitating intuitive and accessible interfaces for users. Their ability to analyze context and produce relevant and coherent responses is crucial in areas such as legal [
1], medical [
2], and financial sectors [
3].
These models, trained on vast datasets, possess the capability to answer specific queries by extracting and synthesizing knowledge from their internal databases. However, a significant challenge arises when the required information is restricted, meaning it contains specific details that might not be in their training database. For instance, in the case of technical manuals, e.g., construction regulations that feature different stipulations from one country to another, or company policies and procedures that may substantially differ from general regulations.
The central issue addressed in this paper is the evaluation of an LLM’s behavior when asked to reason and answer based solely on a restricted piece of information. This information ideally is part of the model’s training data but is modified and questioned in a way that forces the LLM to focus on particular details that involve reasoning and precise answer identification, rather than a generic response.This distinction is crucial because it forces LLMs to adapt and combine training information with newly unseen content in a consistent manner.
In virtue of this, the challenge lies in these models’ ability to interpret and apply their knowledge to cases that, while superficially similar to previously seen situations, differ in critical aspects that may affect the accuracy of the responses generated. For example, in the legal field, an LLM might be trained on the jurisprudence of a specific country but could be asked to apply those principles to a case in a different legal context, which, although similar, has unique regulations and precedents. The ability to adjust to these subtle yet fundamental differences is what we are going to observe in order to optimize the effectiveness of LLMs in practical applications.
This objective allows us to test information retrieval systems, meaning that although the main results will focus on the questionnaire and each language model’s performance, various technical results will be reported since they give insight into the ins and outs of information processing with LLMs.
To conduct an effective evaluation aligned with our objectives, a set of questions was developed around a widely known domain, namely The Bible, with answers that must be derived solely from a specific set of restricted verses. The biblical domain was selected as the central theme due to its immense cultural impact and the high quantity of information available online, meaning that any respectable LLM must be able to answer basic questions about it. Nonetheless, this domain is vast enough to be able to generate very precise questions that cannot be answered solely with general knowledge. The questions themselves were created so that they have to be answered based on very particular verses, and inferences based on them, more on that later.
This study not only deepens our understanding of the current capabilities of LLMs but also raises crucial questions about how these models might be designed or modified to effectively handle selective information. Such considerations have significant implications for the design of artificial intelligence systems that must operate in restricted environments and handle sensitive information. This is particularly important given the increasing amounts of industry-level applications and personal assistants that use LLMs.
The work is structured as follows.
Section 1 introduces the background and motivation for this study.
Section 2 reviews related work in question answering and retrieval-augmented generation (RAG) systems.
Section 3 provides a detailed description of the models employed in this research.
Section 4 outlines the characteristics of the corpus and the questionnaire used.
Section 5 details the methodology used throughout the study.
Section 6 presents the results, followed by a discussion of these findings in
Section 7. Finally,
Section 8 concludes with a summary of the key conclusions.
3. Models
In this section, we describe the models used for our experiment. This RAG implementation is divided into two standard sections: retrieval and generation. For the retrieval, we apply the BGE-M3 embedding model [
15] to our database and the questions; based on these embeddings the semantic search is carried out. For the second part, answer generation, we use three state-of-the-art LLMs: Llama 2, in its fine-tuned 13B chat version [
16], GPT, in its 3.5 version [
17] and PaLM [
18].
4. Corpus and Questions
The Bible stands as the most widely read text in Western history. Additionally, its narrative ranks among the most renowned and globally significant with a huge number of translations. Consequently, many datasets include this collection of texts in their repository; this means most LLMs should have knowledge of biblical writings. The Bible is also employed as a tool for the creation of parallel corpora [
21] for automatic translation. It is also frequently used as a reference corpus for paraphrase detection [
22] since some languages have several versions and translations of the text. With this in mind, we consider The Bible a good option for evaluating the LLMs mentioned above, taking into account the accessibility of the text.
The digital edition we selected was
La Biblia—Latinoamérica [
23], taken from the corpus created by Sierra et al. [
24]. This version has labels for each verse, which facilitates the organization of the corpus since the information is clearly separated. Also, each verse is manually annotated and revised to avoid any misleading information. This structure is computationally efficient, particularly for NLP techniques, since no additional segmentation, has to be carried out for most tasks.
To evaluate the performance and behavior of the LLMs, we wrote a series of questions about The Bible, some of the questions can be answered based on a specific passage of The Bible and some can not. In total, 36 questions were selected for the final version of the experiment. The majority of questions are about specific events, names, moments, and places. In general, the information necessary to ask the question is available entirely from one chapter to facilitate the retrieval of information. These questions were created in this manner in order to limit each LLM’s answer to factual information only. The focus of the experiment is to test an LLM’s factual information reasoning, not religious interpretation of this text. These questions require the LLMs to consider The Bible as a source of information, not as religious guidelines or as information that contradicts the model’s pretrained knowledge.
The questions formulated for the examination were categorized into two distinct groups:
Those that could be addressed using solely the information furnished by The Bible (
Table 1).
Those that necessitated additional information beyond that provided by the scriptural text for resolution (
Table 2).
The initial category (question group 1) of questions typically proves simpler to answer as the required responses are directly available within a single, or multiple chapters of The Bible.
Conversely, to answer the second type of question (question group 2), more information than that provided by the context is required. In response to these questions, it is expected that LLMs may not be able to provide answers, since the necessary information is not explicitly presented in the books of The Bible. This information usually consists of interpretations and opinions formulated by scholars or ecclesiastical authorities.
As can be seen in the questionnaire, questions focus on particular individuals with particular occupations, locations, and actions. These details are expressed purely with lexical elements (the names of the individuals, actions, etc.). Since dense embeddings work with (varying) context windows, we run the risk of mixing and losing certain named entities given that they can all be clustered in a biblical category. Being able to use lexical embeddings, or a weighted sum that takes this lexical factor into consideration, might prove to be a better retrieval method for very precise queries.
Table 1.
Questions that can be answered with information contained in The Bible.
# | Question | Expected Answer |
---|
1 | ¿Qué mar fue abierto por Moisés y para qué? | El mar Rojo, para escapar de los egipcios. |
| Which sea was parted by Moses, and for what purpose? | The Red Sea, to escape from the Egyptians. |
2 | ¿Qué ídolo erróneamente veneran los israelitas? | Un becerro de oro. |
| What idol did the Israelites mistakenly worship? | A golden calf. |
3 | ¿Cómo mató David a Goliat? | Con una piedra de su honda. |
| How did David kill Goliath? | With a stone from his sling. |
4 | ¿Quién tuvo el sueño de las vacas gordas y las vacas flacas? | El faraón de Egipto. |
| Who had the dream of fat cows and lean cows? | The Pharaoh of Egypt. |
5 | ¿A quién le dijo Rut las palabras: “donde tú vayas, iré yo; y donde tú vivas, viviré yo; tu pueblo será mi pueblo y tu Dios será mi Dios”? | A su suegra Noemí. |
| To whom did Ruth say the words: “Where you go, I will go; and where you stay, I will stay. Your people will be my people and your God my God”? | To her mother-in-law Naomi. |
6 | ¿Cómo se llamaba el jefe del ejército a quien derrotaron los israelitas bajo el mando de la jueza Débora? | Sísera. |
| What was the name of the army chief defeated by the Israelites under the command of the judge Deborah? | Sisera. |
7 | ¿Quién mató a Holofernes? ¿Cómo? | Judith. Lo decapitó. |
| Who killed Holofernes and how? | Judith. She decapitated him. |
8 | Who stripped Samson of his hair and why? | Su esposa, Dalila, para que perdiera su fuerza. |
| Who stripped Samson of his hair and why? | His wife, Delilah, so that he would lose his strength |
9 | ¿Qué comían los israelitas en el desierto? | Maná. |
| What did the Israelites eat in the desert? | Manna. |
10 | ¿Quién era Nabucodonosor? | El rey de Babilonia o Asiria. |
| Who was Nebuchadnezzar? | The king of Babylon. |
11 | ¿Qué oficio tenía Melquisedec? | Sacerdote. |
| What profession did Melchizedec have? | Priest. |
12 | ¿Qué construyó Noé? ¿De qué escapaba? | Un arca. De un diluvio. |
| What did Noah build? What was he escaping from? | An ark.From a flood. |
13 | ¿De dónde era Ciro? | Ciro era de Persia. |
| Where was Cyrus from? | Cyrus was from Persia. |
14 | ¿Quién tentó a Jesús en el desierto? | Satanás. |
| Who tempted Jesus in the desert? | Satan. |
15 | ¿En qué monte fue crucificado Jesús? | Gólgota. También llamado Calvario. |
| On what mount was Jesus crucified? | Golgotha. Also called Calvary. |
16 | ¿Qué pareja acompañó a Pablo en algunos de sus viajes? | Aquila y Priscila. |
| Which couple accompanied Paul on some of his travels? | Aquila and Priscilla. |
17 | ¿Quién se quedó sin oreja la noche que murió el maestro? | Malco. |
| Who lost his ear the night that Jesus died? | Malchus. |
18 | ¿Quién estaba siendo juzgado junto con Jesús por los romanos? | Barrabás. |
| Who was being tried alongside Jesus by the Romans? | Barabbas. |
19 | ¿Qué le hizo Juan Bautista a Jesús? | Lo bautizó. |
| What did John the Baptist do to Jesus? | He baptized him. |
20 | ¿Por cuántas monedas Judas traiciona a Jesús? | Treinta piezas de monedas de plata. |
| How many coins did Judas betray Jesus for? | Thirty pieces of silver coins. |
21 | ¿Quién hizo que decapitaran a Juan Bautista? | La hija de Herodías. |
| Who caused John the Baptist to be beheaded? | The daughter of Herodias. |
22 | ¿Qué amigo le escribe dos cartas a Timoteo? | Pablo de Tarso. |
| Which friend wrote two letters to Timothy? | Pablo de Tarso. |
23 | ¿Quién niega a Jesús? ¿Cuántas veces? | Pedro. Tres. |
| Who denies Jesus? How many times? | Peter. Three times |
24 | ¿A qué hora murió Jesús? | A las tres de la tarde. |
| At what time did Jesus die? | At three in the afternoon. |
25 | ¿En qué ciudad hizo Pablo de Tarso su discurso “Al Dios desconocido”? | Atenas. |
| In what city did Paul of Tarso give his speech “To the Hidden God”? | Athens. |
26 | ¿En qué fueron grabados los diez mandamientos y cuáles son esos? | Fueron dados en dos tablas de piedra y son: 1. Amarás a Dios sobre todas las cosas. Sólo existe un Dios, creador y todopoderoso, al que adorar. 2. No tomarás el nombre de Dios en vano. 3. Santificarás las fiestas. 4. Honrarás a tu padre y a tu madre. 5. No matarás. 6. No cometerás actos impuros. 7. No robarás. 8. No darás falso testimonio ni mentirás. 9. No consentirás pensamientos ni deseos impuros. 10. No codiciarás los bienes ajenos. |
| On what were the Ten Commandments engraved, and what are they? | They are: 1. You shall love God above all things. There is only one God, creator and almighty, to whom worship. 2. You shall not take the name of God in vain. 3. You shall keep holy the feast days. 4. You shall honor your father and your mother. 5. You shall not kill. 6. You shall not commit impure acts. 7. You shall not steal. 8. You shall not bear false witness nor lie. 9. You shall not consent impure thoughts or desires. 10. You shall not covet your neighbor’s goods. |
27 | ¿Cuántos hijos tuvo Jacob? ¿Cómo se llamaban? | 12. Rubén, Simeón, Leví, Judá, Dan, Neftalí, Gad, Aser, Isacar, Zabulón, José, Benjamín. |
| How many sons did Jacob have? What were their names? | 12. Reuben, Simeon, Levi, Judah, Dan, Naphtali, Gad, Asher, Issachar, Zebulun, Joseph, Benjamin. |
28 | ¿Cuál era el oficio de Mateo antes de unirse a los seguidores de Jesús? ¿Y de Pedro? | Recaudador de impuestos (publicano), pescador. |
| What was Matthew’s occupation before joining Jesus’ followers? And Peter’s? | Tax collector (publican). Fisherman. |
29 | ¿Cuántos candeleros hay en Apocalipsis y a qué se refiere? | Siete. A las 7 iglesias. |
| How many lampstands are there in Revelation, and what do they refer to? | Seven. To the seven churches. |
30 | ¿A quién se tragó el pez grande? | Jonás. |
| Who was swallowed by big fish? | Jonah. |
31 | ¿A quién le fue revelado el libro del Apocalipsis? | A Juan. |
| To whom was the book of the Apocalypses revealed? | To John. |
Table 2.
Questions that need more information than the one provided by the text to be resolved.
# | Question | Expected Answer |
---|
32 | ¿Qué libro de la Biblia narra el amor de los esposos? | El Cantar de los Cantares. |
| Which book of The Bible tells the love of spouses? | The Song of Songs. |
33 | ¿Quién es considerado el autor de los salmos? | El rey David. |
| Who is considered the author of the Psalms? | King David. |
34 | ¿Qué son Isaías, Jeremías, Ezequiel y Daniel? | Profetas. Los profetas mayores. |
| What are Isaiah, Jeremiah, Ezekiel, and Daniel? | Prophets. The major prophets. |
35 | ¿Qué profeta escribió el libro de las Lamentaciones? | Jeremías. |
| Which prophet wrote the book of Lamentations? | Jeremiah. |
36 | ¿Cuál era el más escéptico de los discípulos de Jesús? | Tomás. |
| Who was the most skeptical of Jesus’ disciples? | Thomas. |
5. Methodology/RAG
In this section, the precise workflow of the system is described. In order to correctly implement the aforementioned LLMs, corpus, and questionnaires, a system capable of integrating these components simultaneously, and in a standardized and comparable manner is necessary. As one might guess a Retrieval Augmented Generation (RAG) [
8] approach was elected, opting to explore two of the original RAG papers’ question-answering tasks: Open-domain Question Answering and Abstractive Question Answering.
These experiments evaluate the LLMs’ question-answering behavior and the question-answering itself. What we mean by behavior is whether the model sticks only to the extracted information and reasons based solely on that information. Whereas question answering corresponds to the actual answer given by the model. It is important to observe that a correct answer should only be obtained from a correct information extraction. If the model bypasses the extracted information and uses prior knowledge, then the behavior is incorrect.
In accordance with this, certain questionnaire questions are designed to require a deeper degree of understanding and rationalization rather than just extracting the correct answer from the context. For some questions the presented context will never generate a correct answer; nonetheless, a context has been extracted and used in order to evaluate behavior in this scenario. In the following section, an evaluation will be described in detail.
In order to generate a consistent ground for comparison and remove noise generated from each model’s retrieval process, the retrieval aspect of the model was conducted manually with standardized embeddings. Questions and chapters were embedded using the same embedding model after which a similarity retrieval was carried out.Using the obtained texts, each LLM was asked to use the retrieved context to answer the corresponding question.
Figure 1 shows the workflow of the RAG system.
The rest of this section will describe the previous paragraph and diagram in detail, starting with an analysis of the dataset, followed by the text preprocessing, the embedding creation, the embedding evaluation, the answer generation and the answer evaluation.
7. Discussion
The result discussion considers two different aspects: retrieval and answer generation. Afterward, an analysis of the solutions provided by the system will be carried out. In this study, we do not consider F1 or Accuracy measures for quantitative evaluation. Furthermore, the aspects we aimed to assess, the performance and behavior of the models, cannot be evaluated using these metrics.
In the retrieval part, the weighted combination of the three different embedding vectors that BGE-M3 creates is able to retrieve 23 passages containing chapters with relevant information out of 36 possible ones.
It is worth noting that this system is capable of retrieving a relevant chapter for question 36. This question is one of the five questions marked as unanswerable (question group 2) since the answer requires information that goes beyond the scope of the text. This suggests that BGE-M3’s model is capable of recognizing relevant information even in truncated scenarios.
Although the chapters themselves do not contain the exact answer, they contain something relevant to the question. The similarities might be lexical (lexical overlap in The Bible chapters is not rare), or in the mathematical representation of the embedding itself, regardless this embedding model is capable of identifying them and pairing them, most of the time accordingly.
The next paragraphs show some examples where this pairing is incorrect and the model falls short, this may be due to the retrieval model’s need for exact matches or the model’s interpretations.
In question number 2: “¿Qué ídolo erróneamente veneran los israelitas?” (What idol did the Israelites mistakenly worship?) The answer is a golden calf. The retrieved chapter, Isaiah 44, mentions something about false gods made out of wood, as well as the Israelis, but it does not contain anything about a golden calf. This embedding model is susceptible to interpreting wood as the key element of the chapter and prioritizes an incorrect chapter in this context. We marked the answer to this question as Good for Llama and PaLM because the models’ answer was based on the retrieved context.
This also occurs in question number 17: “¿Quién se quedó sin oreja la noche que murió el maestro?” (Who lost his ear the night that Jesus died?). The answer is “Malchus”. Our system retrieved, Marcus 14, which describes the night on which Jesus died and the incident in which “the humble servant of the highest priest loses an ear”; however, it does not mention his name, whereas in John 18 besides the previous description the name Malco is mentioned. Information discrepancies like this affect the LLM’s answer in an unwanted way. We marked the answer for this question as Good for Llama and PaLM also because the models’ answer was based on the retrieved context.
These questions are prone to be answered poorly by the model since an adequate retrieval of a chapter still leaves out relevant information. As an example, we can see question number 27: “¿Cuántos hijos tuvo Jacob? ¿Cómo se llamaban?” (How many sons did Jacob have? What were their names?). The chapter that the system retrieved only mentions the first four of Jacob’s children, in this case, Llama 2 Chat answers only the four mentioned children given in the context, meanwhile GPT 3.5 and PaLM resort to their prior knowledge to answer correctly, that is why we marked with “Deviated”.
As a final example regarding retrieval, in question 36, “¿Cuál era el más escéptico de los discípulos de Jesús?” (Who was the most skeptical of Jesus’ disciples?), which belonged to the second group, the system retrieved John 20 which states the skepticism of Thomas, despite the existence of other chapters in The Bible that depict followers of Jesus expressing skepticism. For instance, this includes Peter when he denies Jesus three times, or when he hesitates to walk on water.
Furthermore, there were questions where the retrieved chapter was correct; nonetheless, the models were unable to infer the response based on it. Llama 2 Chat performed this in questions 18, 20, and 21, GPT 3.5 in questions 18, 28, and 36, and PaLM only in question 28.
Regarding answer generation, we can observe that there is a significant difference in the size and amount of vocabulary used by the different models. Naturally, this can be explained by the nature of the models used. We use Llama 2 Chat, which is optimized for dialogue use cases, unlike PaLM, whose base model is just for text generation.
We also observed that some of the responses provided by Llama 2 Chat, characterized by its neutral behavior, were contradictory. Specifically, it initially provided the correct answer but subsequently stated that it did not possess the information.
From
Table 6 we can observe that providing the model with a relevant chapter to answer the question does not necessarily improve the total amount of correct answers. Llama improved from 19 to 23 correct answers. Surprisingly GPT 3.5 and PaLM, provided with the correct chapter, do not improve their correct answers. GPT 3.5 without chapter information has 21 correct answers whereas with chapter information the correct answers go down to 19. On the other hand, PaLM without chapter information has 30 correct answers; however, seven of those responses were incomplete, and 24 with chapter information.
Figure 2 shows the performance of the models using RAG methodology, and
Figure 3 shows the performance of the models’ neutral behavior.
8. Conclusions
In this work, we employ the RAG methodology to leverage the prior knowledge and capabilities of large language models, specifically Llama 2 Chat, GPT 3.5, and PaLM.
In this experiment, we utilized the Bible as it is a widely known text and is highly likely to be part of the training data for the models. This choice aligns with the research objective of observing model behavior when given a simple instruction, and it facilitates verification of whether the model’s response is based on the provided context, as instructed, or relies on its prior knowledge.
Furthermore, this methodology enables the transfer of knowledge from LLMs to specific domains that require strict adherence to specific information, such as legal [
5], medical, or financial, without the need to train or fine-tune a new model. This presents a significant advantage, as training one of these models incurs substantial economic and computational costs. Additionally, fine-tuning requires a sufficiently large and high-quality dataset, as well as considerable computational power, to achieve effective results.
One key insight from this study is the demonstration of these models’ abilities to respond to questions based on a given context. The variance in correct answers between models with and without context relies on factors such as information retrieval, model size, and data availability. This inconsistency is one of the reasons why precision and accuracy measures are not particularly helpful in this type of task and it is recommended to conduct a different evaluation.
It should be noted that we cannot guarantee the absence of The Bible in the pretraining data of any of the three models and it is likely that this explains why the models tend to rely on their prior knowledge instead of following the given instructions in some of the questions.
In future work, we aim to test newer models, as well as implement the use of multiple passages within the models, since we observed that some questions require more than one passage or necessitate inferences from multiple passages. Additionally, we plan to utilize a corpus that we can confirm does not belong to the model’s prior knowledge to determine if this approach allows the model to behave according to the provided instructions.
It is important to mention the significance of having open-source models since the computational and economic cost of generating such models is only accessible to a few of the largest companies. With the RAG methodology, we can leverage these systems to avoid training a model of this size.