1. Introduction
Open-Domain Question Answering (OpenQA) aims to answer questions without pre-specified domains, including end-to-end question-answering systems and pipeline-based systems. The former regards the OpenQA system as a whole for joint training, while the latter decomposes the system into multiple components for separate training. Currently, pipeline-based OpenQA systems are more common, such as those developed after 2017, which generally follow the two-stage Retriever-Reader (R2) architecture proposed by Chen et al. [
1]. This architecture first retrieves a small subset of passages based on the query, and then the reader extracts or generates answers from the retrieved passages. Therefore, when the retriever retrieves more relevant passages, the reader is more likely to extract the answer, and the performance of the OpenQA system can be improved. However, queries are often short, incompletely expressed semantically, and full of ambiguities. So the model cannot fully understand the query’s intent, thus retrieving irrelevant passages, and readers extract or generate incorrect answers.
To improve retrieval efficiency, we use the query expansion technique. Query expansion is the process of reformulating a given query to improve retrieval performance in information retrieval operations, especially in the context of query comprehension, which expands the initial query to match additional relevant passages. Traditional query expansion methods include (1) linguistics-based, (2) corpus-based, (3) search log-based, and (4) network-based approaches. These methods all involve selecting expansion terms that are semantically similar to the original terms from existing knowledge resources or large corpora. Another important method is Pseudo-Relevance Feedback (PRF), which assumes that top-ranked passages at the retrieval stage contain relevant signals and leverages these signals to modify initial queries, reducing the impact of lexical mismatches between queries and passages and improving search efficiency. BERT [
2], T5 [
3], and other transformer-based deep language models combined with PRF techniques have significantly improved performance in different search tasks [
4,
5] compared with traditional PRF methods such as Rocchio, association model, RM3 and KL expansion models.
The two keys of query expansion are the types of expansion terms and the selection methods. (1) The expansion terms extracted by the above methods are more inclined to be semantically close to the items in the original query. However, choosing only semantically related words as expansion terms is insufficient for retrieving the most relevant passages. Utilizing diverse contextual information can help queries retrieve relevant passages from different angles for complementarity. (2) Regarding the method for selecting expansion terms, traditional statistical methods and modern neural networks can be used to select terms from relevant passages. Recently, generative models have played a significant role in OpenQA, performing exceptionally as readers and query expansion generators [
6,
7]. However, due to the closed-book nature of these models, the generation process can be challenging. Providing some reference materials for the generative models and transforming them into “open-book” generation can speed up the generation process and improve the quality of the generated results [
8,
9].
In this paper, we propose HTGQE (
https://github.com/XY2323819551/TGQE_code, accessed on 27 April 2023), a Hybrid Text Generation-based Query Expansion method to improve retrieval efficiency. We improve the quality of query expansion terms in two ways: the type of expansion generated and the way it is generated. We utilize the generation model as query expansion generator, combining PRF techniques in information retrieval with text generation technologies. By providing “reference materials” for the generation model during text generation, compared to closed-book text generation, HTGQE reduces the difficulty of text generation and accelerates the model’s convergence speed and improves the quality of the generated text. Furthermore, diverse query expansion contexts are highly beneficial for query expansion. HTGQE follows Mao et al. [
10] and generates three types of expansion contexts for the query: the answer to the question, the sentence containing the answer, and the passage’s title where the query is located. In the following, we denote these three contexts as
,
, and
, respectively. Finally, the retrieval results are combined to form new retrieval results. Comprehensive experiments on the NQ and Trivia demonstrate that HTGQE performs well in passage retrieval and reading tasks.
HTGQE increases the lexical overlap between the expanded query and candidate passages, significantly improving sparse retrieval performance. Furthermore, we combine the enhanced sparse retrieval results with dense retrieval results, achieving better retrieval accuracy and question-answering system performance due to their complementary nature. Besides, HTGQE is a plug-and-play method that can be applied to any existing R2 system architecture without altering the structure of the retriever and reader. This query expansion method, independent of the specific retriever and reader structures, means that the benefits of query expansion can be obtained without fine-tuning or training the retriever and reader.
Contributions: (1) HTGQE integrates large language models and PRF techniques, providing “reference materials” for the generation model, making the generation process faster and the generated content of higher quality. (2) HTGQE utilizes the query’s answer , title , and relevant sentences as diversified contexts, expanding the initial query from different perspectives and complementing the retrieval results. (3) Compared to the closed-book text generation method GAR, HTGQE achieves improved EM scores on both NQ and Trivia regardless of the configuration of extractive or generative readers.
The Materials and Methods section introduced the framework of the HTGQE method, the datasets, evaluation metrics, related model structures, and hyperparameter settings for the experiments. The Experimental Results and Analysis section designed detailed experiments to validate HTGQE, including passage retrieval experiments, passage reading experiments, discussions on mixed strategies, investigations on the optimal number of expansion terms, and case analysis. The Discussion section discussed some limitations of the study. Finally, the Conclusions section concludes the paper.
3. Experimental Results and Analysis
This section primarily focuses on conducting comprehensive experiments to validate the proposed HTGQE.
Section 3.1 deals with the passage retrieval task based on the HTGQE approach;
Section 3.2, building upon the retrieval results of
Section 3.1, investigates the advantages of the HTGQE in OpenQA;
Section 3.3 thoroughly explores and discusses various fusion strategies for different retrieval results;
Section 3.4 examines the impact of the number of generated expansion terms on retrieval accuracy; finally, to better understand the process and rationale of the HTGQE,
Section 3.5 presents and analyzes several case studies.
3.1. Passage Retrieval with HTGQE
This section focuses on highlighting the effectiveness of the HTGQE. In addition to the BM25 and the BM25+RM3 [
10], other comparisons are carried out between GAR [
10] and HTGQE. As shown in
Table 2, we examine the Top-k retrieval accuracy of both sparse and hybrid retrieval methods. In sparse retrieval, The result of GAR is a combination of GAR(
), GAR(
), and GAR(
); the specific fusion method can be found in
Section 3.3. HTGQE follows the same principle.
For sparse retrieval, GAR and HTGQE significantly outperform the initial BM25 and BM25+RM3, demonstrating the effectiveness of query expansion based on generative models. Under various conditions, HTGQE’s Top-k retrieval accuracy surpasses that of GAR, indicating that with proper “reference materials,” the accuracy of open-domain generative models can be significantly improved. HTGQE uses the top three Pseudo-Relevance passages of the initial query retrieval results as these reference materials. In the NQ/Trivia dataset, GAR’s Top-5 and Top-20 accuracy are 14.7%/2.4% and 9.7%/1.3% higher than BM25, respectively, while HTGQE’s Top-5 and Top-20 retrieval accuracy are 8.8%/5.5% and 3.6%/1.9% higher than GAR, respectively. Furthermore, regardless of how many passages are retrieved, HTGQE consistently outperforms GAR, and the gap between GAR and HTGQE narrows as the value of
k increases. In contrast, the classic QE method RM3 slightly improves over regular BM25 but does not perform comparably to HTGQE or GAR. According to
Table 2, the effectiveness of different query contexts varies, with
and
being more effective than
. However, the hybrid results are higher than any individual retrieval performance, suggesting that merging each query context will yield better retrieval results. The complementary nature of different generative-enhanced queries is also mentioned in
Section 3.4.
For hybrid retrieval, as seen in
Table 2, the retrieval accuracy of HTGQE+DPR is higher than that of GAR+DPR and BM25+DPR. On NQ, HTGQE+DPR’s Top-5/Top-20 accuracy are 9.4% 4.5% and 3.5%/1.5% higher than BM25+DPR and GAR+DPR, respectively. On Trivia, HTGQE+DPR’s Top-5/Top-20 accuracy are 3.1%/2.0% and 0.9%/0.3% higher than BM25+DPR and GAR+DPR, respectively. The specific fusion method is described in
Section 3.3.
3.2. Passage Reading with HTGQE
To evaluate whether the HTGQE approach can offer OpenQA improvements, we conducted experiments comparing the open-book HTGQE and closed-book GAR models using the retrieval results from
Section 3.1. The outstanding performance of GAR is already known in previous study [
10], this section aims to verify that the HTGQE is more effective than GAR. It is important to note that both the extractive and generative readers in this paper have not undergone training, so they will not achieve the results reported in the GAR [
10].
As shown in
Table 3, the top 50 candidate passages from the retrieval results are selected as inputs for the reader. HTGQE outperforms GAR in both extractive and generative settings. For the extractive reader, the former surpasses the latter by 4.57% and 6.35% on the NQ and Trivia, respectively. In the generative reader setting, HTGQE exceeds GAR by 14.34% and 17.09%, respectively. This further demonstrates that text generation incorporating PRF techniques is more effective because it provides useful reference materials during the text generation process. Moreover, the different contextual environments also contribute significantly to the experimental results.
Furthermore, while HTGQE focuses on improving sparse retrieval, the new sparse retrieval results bring improvements in EM scores that outperform many other neural network methods, even if they take 50, 100, or even more passages as the input of the readers. However, since we directly use the existing reader checkpoint and do not train the reader on the new retrieval results, the reader has yet to learn the knowledge in the new retrieval results. So there is still a gap between our method and SOTA methods. If you retrain the reader, expect to see a significant performance improvement.
3.3. Fusion Strategies for Retrieval Results
In this section, we explore how to fuse the three types of retrieval results better to achieve improved retrieval accuracy. We have experimented with four fusion methods:
,
,
and
(for example,
refers to the combination of results retrieved using
and
as query expansion terms). According to
Table 4, the overall performance is better when fusing the results of all three types, with
being the second best. This shows that the retrieval results obtained using
,
, and
as query expansion terms are complementary.
Furthermore, this paper employs a uniform fusion approach, in which an equal number of candidate passages are extracted from each retrieval result to form a new candidate passage set. For instance, in the method, for each query, we extract the top 500 candidate passages from the retrieval results corresponding to the query with , and the top 500 candidate passages from the retrieval results corresponding to the query with . These two sets are then intermixed to form the final retrieval results. This fusion approach is relatively simple, but other fusion methods, such as rank reciprocal, can also be considered.
3.4. Exploring the Optimal Number of Expansion Terms
In this paper, we use three query expansion generators to generate various contextual information for initial query. However, there is yet to be a definitive answer regarding the optimal number of expansion terms for each context. The previous experiments have all assumed a default number of expansion terms to be 1. This section aims to explore the optimal number of expansion terms for various contexts. We conducted experiments with the number of expansion terms set to 1, 2, 3, 4, and 5, respectively.
On NQ, as shown in
Figure 2, we investigated the impact of different numbers of expansion terms on the Top-k retrieval accuracy. The four graphs in
Figure 2 show a fairly consistent trend: for
, the performance is better when the number of expansion terms is 2 (orange line); for
, the performance is relatively better when the number of expansion terms is 1 (light blue line); for
, the performance is better when the number of expansion terms is 1 or 5 (light blue line or blue line). Lastly, we combined the three individual retrieval results corresponding to the number of expansion terms to obtain a mixed effect. Overall, the optimal result can be achieved when the number of expansion terms is 5.
As shown in
Figure 3, we observed a fairly consistent trend on Trivia: for
, the performance is relatively good when the number of expansion terms is 2 or 5 (represented by the orange and blue line, respectively); for
, the performance is comparatively better when the number of expansion terms is 1 (indicated by the light blue line); for
, the optimal number of expansion terms is 1 or 5. Lastly, fusing the three individual retrieval results according to the corresponding expansion term quantities achieves the overall best performance when the number of expansion terms is 5. Although the influence of the expansion terms quantity on retrieval accuracy in
query expansion is significant, this gap is greatly alleviated after combining the three contexts, further highlighting their complementarity.
In summary, the best results can be achieved when the number of expansion terms is set to 5. However, the optimal performance is similar when the number of expansion terms is 1, 2, 3, or 4. Additionally, too many expansion terms can lead to an excessively long initial query, which may reduce the retrieval speed. Therefore, whether to choose a quantity of 1 or 5 for expansion terms depends on the specific application requirements. If the goal is to achieve higher accuracy, it is advisable to set the number of expansion terms to 5. On the other hand, if the focus is inference speed, opting for a single expansion term is a better choice.
3.5. Case Study
To facilitate a better understanding of the HTGQE process and its underlying ideas for the readers, we present three examples in
Table 5. Correct answers are highlighted in green, incorrect ones in red, and the correct reference answers are enclosed in blue brackets. For each query, we concatenate it with the top three pseudo-relevant passages from the initial retrieval results and input them into different query expansion generators. The
,
and
are three query expansion contexts. Finally, we use the three generated contexts as expansion terms for the initial query and conduct retrieval, fusing all retrieval results. In the first example, the target appears in
, but not in
and
. In the second example, the target does not appear in
; fortunately,
and
both provide the target. In addition, the final examples, only
provides the correct target. While not every context provides the correct target, three contexts increase the probability of finding the correct target, all of which are relevant to the query. This demonstrates that different query contexts complement each other, reducing noise during query context generation and improving the final retrieval accuracy.
5. Conclusions
In this work, we propose HTGQE, which improves retrieval efficiency and effectiveness by combining generation models and PRF techniques to generate various query expansion contexts. The pseudo-relevant passages served as reference material for the generation model, which increases the model’s input while reducing the pressure on generative models and improving the quality of generated terms. Various query expansion contexts complement each other, reducing noise during query context generation and improving the final retrieval accuracy. Notably, compared to the closed-book text generation method GAR, when configured with extractive readers, HTGQE achieves EM score improvements of 4.57% and 6.35% on NQ and Trivia, respectively. When equipped with generative readers, HTGQE achieves EM score improvements of 14.34% and 17.09% on NQ and Trivia datasets, respectively. In addition, HTGQE is a plug-and-play approach that can be applied to existing R2 systems under specific datasets, independent of the specific structure of the retriever and reader models, which allows easy and quick integration into existing R2 systems. However, our method still has its limitations. Adding expansion terms to the initial query can improve the retrieval efficiency of the QA system compared to the standard R2 system, but also slow down the inference speed within an acceptable range.