1. Introduction
In the field of IR, the core task is to efficiently and accurately mine information that aligns with user needs from vast document collections. However, achieving this goal in practical applications presents several challenges. One key challenge is that users often tend to use short and direct query statements, which makes it difficult for traditional retrieval models to fully capture the user’s true intent.
Pseudo-Relevance Feedback (PRF) [
1,
2] methods, through automatic QE, extract keywords from feedback documents to serve as expansion terms, effectively alleviating issues related to word matching. These methods have been shown to play a significant role in sparse retrieval models [
3,
4,
5,
6,
7]. However, traditional PRF approaches primarily rely on statistical frequency or the occurrence count of terms within documents to select expansion terms, overlooking the consideration of semantic information. This limitation restricts their performance in complex retrieval tasks.
In recent years, the rise of LLMs and dense retrieval models has provided new solutions to these challenges. LLMs, leveraging their vast knowledge base and context-aware capabilities, can generate multiple optimized queries, modeling the importance of terms and enhancing retrieval performance. At the same time, dense retrieval models (such as ColBERT [
8], ANCE [
9], ColBERTv2 [
10], ColBERT-PRF [
11], etc.) utilize pre-trained language models (e.g., BERT [
12]) to capture deep semantic relationships between queries and documents. By mapping both queries and documents into the same high-dimensional vector space, these models enable a precise understanding of complex query requirements. Dense retrieval models are also proficient in utilizing contextual information, which allows them to perform exceptionally well in handling long documents and multi-modal information retrieval tasks.
One current research focus involves leveraging LLMs for QE. These methods improve retrieval performance by analyzing the original query and generating synonym expansions, hypernym/hyponym expansions, and semantically related query variants. However, Breuer T [
13] found that generating multiple query variants with LLMs may lead to query drift, resulting in the loss of some semantic information, which in turn affects the comprehensiveness and accuracy of retrieval results.
To address this issue, inspired by Pan et al. [
5], who utilized kernel functions to handle word co-occurrence frequencies and long document processing, we propose a Large Language Model-based QE and Gaussian Kernel Semantic-Enhanced Dense Retrieval Model (LSDR
Gs). The model combines optimized queries generated by LLMs with a Gaussian kernel semantic space to capture deep semantic relationships between queries. It further integrates the semantic distribution of query relevance, thereby enhancing the semantic consistency between the optimized and original queries and improving the comprehensiveness and precision of retrieval performance.
Our work is driven by the following key objectives:
(1) Leveraging LLMs for Query Expansion: We systematically investigate how LLMs utilize their extensive knowledge bases and contextual understanding to generate multiple optimized queries. This process enriches the original query and enhances retrieval performance, particularly in complex or ambiguous query scenarios.
(2) Developing a Dense Retrieval Model Enhanced with Gaussian Kernel Model: We design and implement a dense retrieval model that incorporates Gaussian kernel functions to capture deep semantic relationships, mitigating query drift and improving QE effectiveness.
(3) Evaluating the Effectiveness of LSDRGs: We conduct comprehensive experiments to assess the performance of the LSDRGs model across different datasets, validating the advantages of LSDRGs in improving retrieval precision and relevance.
The main contributions of this paper are as follows:
(1) Integrating LLMs with Dense Retrieval Technology: This approach combines the knowledge base and context-aware capabilities of LLMs with the deep learning techniques of dense retrieval models. By generating multiple optimized queries, it enriches the initial query and enhances retrieval performance. While the process of query generation involves certain computational resources, the method’s effectiveness in improving retrieval quality makes it a valuable extension to traditional retrieval systems.
(2) Introducing a Novel Query Enhancement Method—the LSDRGs Model, which utilizes Gaussian kernel functions to construct a semantic space, effectively captures deep semantic relationships between the original and optimized queries, addressing issues such as semantic drift and information loss that may arise during the query generation process.
(3) Practical Deployment of the Model: In the practical deployment of the model, this method only extends the query without requiring task-specific fine-tuning of the LLMs, thus avoiding redundant document computations. This strategy significantly reduces inference time and improves system efficiency, providing a more effective solution for large-scale information retrieval tasks.
2. Related Work
2.1. Dense Retrieval
Dense retrieval models differ from traditional BERT-based re-rankers using “cross-encoders” [
14,
15,
16] in that they typically adopt a BERT-based “dual encoder” architecture, offering significant advantages in retrieval efficiency and scalability. In a dual encoder architecture, queries and documents are encoded separately into dense vector representations, enabling the efficient use of vector search algorithms during retrieval. Dense retrieval models are generally classified into two categories: single-representation dense retrieval models and multi-representation dense retrieval models [
17]. In particular, within the single-representation dense retrieval paradigm, models such as DPR [
18] and ANCE [
9] encode each query or document as a single dense vector representation. This benefit is attributed to the availability of pre-computed document representations; single-representation models can quickly locate relevant documents via efficient nearest-neighbor search (e.g., retrieval frameworks based on vector indexing technologies). This approach offers significant advantages in retrieval speed but may have limitations in capturing complex semantic relationships due to its reliance on a single vector representation.
In contrast, multi-representation dense retrieval models differ from single-representation models by encoding each token within the query and document as separate dense vectors, enabling the capture of finer-grained semantic information. For example, ColBERT [
8] performs an approximate nearest-neighbor search on each embedding in the query and document, followed by precise scoring to achieve efficient and high-precision retrieval performance. This “late interaction” mechanism effectively balances computational efficiency and semantic capture capability. As an improved version of ColBERT, ColBERTv2 [
10] adopts more advanced training methods, including optimized contrastive learning strategies and model fine-tuning techniques, further enhancing retrieval performance. Additionally, ColBERTv2 introduces residual compression technology to significantly reduce storage costs, making it more practical and efficient for large-scale retrieval tasks. Given its strong performance in both semantic expression and system efficiency, we have chosen ColBERTv2 as the foundational retrieval model to validate the effectiveness of the method proposed in this study.
2.2. Query Expansion
Optimal QE [
19] is a popular paradigm for improving effectiveness in IR, with methods such as PRF widely used to mitigate vocabulary mismatches by QE. Recent advancements in generative language models have demonstrated their ability to produce relevant responses based on given prompts. QE is a widely adopted technique in IR applications [
20], where the original query is expanded by adding additional context to match target documents. Early studies used the initially retrieved documents as PRF [
2,
3,
4,
5,
6,
7,
21], extracting relevant content as supplementary information. However, the effectiveness of these methods is limited by the quality of the initial retrieval.
2.3. Large Language Models
Recently, advancements in LLMs and prompt engineering [
22,
23], such as LLAMA [
24], have made significant progress. LLM-enhanced information retrieval [
25,
26,
27,
28,
29] has become a prominent area of research, where LLMs are used to generate QE by leveraging their inherent knowledge. For example, HyDE [
30] uses LLMs to directly generate hypothetical documents answering the query, which are then used to retrieve similar actual documents through their embeddings. Query2Doc [
31] improves the quality of QE by providing LLMs with a few examples. Jagerman et al. [
32] also explored chain-of-thought as a method for QE. To address the potential lack of domain-specific knowledge in LLMs, Shen et al. [
33] proposed a retrieval-enhanced method that generates QE using LLMs and fine-tunes them using pre-trained domain-specific models. Breuer T [
13] highlighted that excessive QE could lead to a decrease in retrieval effectiveness for certain topics and pointed out the potential topic drift caused by synthetic queries. Therefore, this paper primarily focuses on addressing the issue of potential topic drift in QE.
2.4. Kernel Function
In early studies, De Kretser and Moffat [
34] proposed a locality-based similarity measure that utilized four contribution functions (i.e., triangle, cosine, circular, and arc functions) to evaluate the similarity between each query term and other positions. Subsequently, some kernel functions were employed to estimate the influence of query term occurrences [
35]. Specifically, kernels that satisfy certain properties, such as Gaussian, triangular, cosine, circular, quartic, Epanechnikov, and Triweight functions, were introduced. These studies proposed that when two query terms are closer, they have higher co-occurrence values. Based on this theory, Pan et al. [
5] introduced a kernel co-occurrence framework that uses kernel functions to capture the relationships between query terms and expanded terms.
Inspired by this line of work, we propose a dense retrieval method that combines QE based on LLMs and Gaussian kernel-based semantic enhancement. In the semantic vector space, we leverage the Gaussian semantic space to pull together semantically similar optimized queries and push away queries with potential topic drift, effectively addressing the issue of latent topic shift.
3. Proposed Method
In this section, we present the innovative IR method proposed in this study, which integrates LLMs (such as LLAMA 3 8B [
36]), Gaussian kernel functions, and traditional dense retrieval techniques. The core advantage of this approach lies in its ability to leverage the semantic understanding of queries provided by LLMs while effectively addressing the common issue of topic drift during multi-query retrieval by incorporating the Gaussian kernel function. This method enables the construction of optimized queries with multi-dimensional and rich semantics, further enhancing retrieval performance. Experimental results show that, when tested on two TREC datasets, the proposed method demonstrates significant performance improvements across various evaluation metrics.
Figure 1 illustrates the overall workflow of integrating LLMs and kernel-based semantic strategies into the ColBERTv2 retrieval framework. The process consists of the following key stages:
(1) Query Expansion: First, LLAMA 3 8B is used to expand the original query by generating a series of optimized queries with supplemental semantics. These expanded queries aim to capture multiple facets of the original query’s meaning.
(2) Query Encoding: Using the Query Encoder module built into the ColBERTv2 framework, all the generated optimized queries are transformed into their corresponding vector representations, resulting in query vectors.
(3) Similarity Computation: The Euclidean distance between the original query and its optimized queries’ vector representations is computed. This step quantifies the similarity between the queries in high-dimensional space, helping assess the level of alignment between the expanded queries and the original intent. Unlike cosine similarity, which focuses on vector directionality, Euclidean distance captures the absolute differences between the queries, which is more effective in this context for evaluating the semantic shifts during query expansion.
(4) Kernel-based Weighting: The computed distances are then mapped to a kernel function, which is used to explore deeper semantic relationships between the queries. This approach captures more complex inter-query relationships and enhances the semantic consistency between the generated and original queries, thereby improving the quality and accuracy of the retrieval results.
(5) Retrieval and Ranking: The weighted query vectors are then used by ColBERTv2’s late interaction mechanism to retrieve and rank relevant documents from the corpus. The kernel-based weighting enhances both semantic fidelity and retrieval accuracy.
3.1. LLM-Based Query Expansion
The process begins by providing the initial query
to a LLAMA 3 8B. Based on fine-tuned instructions, the LLM then generates a series of
independent rewritten versions, aiming to enhance the semantic depth and breadth of the original query through diverse formulations, ultimately forming a set of optimized queries. The method for generating optimized queries is given by Equation (1):
where
refers to the process of inputting the original query
into the LLAMA 3 8Bmodel in order to generate a series of optimized queries
. Here,
represents the prompt specifically set for the task, determining the number of optimized queries
to be generated, thereby enriching the semantic coverage of the original query. Each optimized query
is a full-text natural language sentence. We adopt this formulation because our downstream retrieval model ColBERTv2 requires each query input to be in natural language format for effective contextualized embedding.
3.2. ColBERTv2 Encoding
After obtaining the optimized query set, we use the pre-trained ColBERTv2 query and document encoders to encode the queries and documents separately. The ColBERTv2 query and document encoders share weights but are differentiated by the special prefix tokens
and
. For an input query
, as shown in Equation (2), the query encoder encodes it into a query embedding list with dimension m. If the original query is shorter than 32 tokens, the “[MASK]” embedding is used to pad the input query to a length of 32. For a document
, as shown in Equation (3), the document encoder encodes it into a document embedding list with dimension m, where
represents the length of the document.
where
represents the query vector with a dimension of
(
), and
represents the document vector with a dimension of
The tokens
are special prefix tokens.
represent the constituent words of query
, and, similarly,
represent the constituent words of document
.
refers to the number of words in the query, and
refers to the number of words in the document.
3.3. Gaussian Kernel Semantic Space
We acknowledge that, although the original query best reflects the user’s initial intent, some of the optimized queries generated through QE techniques may exhibit a certain degree of semantic shift. To address this issue, this study introduces the concept of a Gaussian kernel semantic space.
In the encoding process described in
Section 3.2, we convert the QE set into a set of query vectors. Subsequently, by applying Euclidean distance dimensionality reduction as shown in Equation (4), combined with the Gaussian kernel function as presented in Equation (5), we can effectively quantify the similarity of these vectors in high-dimensional space. To provide an intuitive understanding of the above mathematical formulations,
Figure 2 offers a graphical illustration that corresponds to Equations (4) and (5). Moreover, to ensure the validity of our proposed LSDR
Gs model, we employ an additional data fusion method, as shown in Equation (6), as a comparative experiment (LSDR
D). Specifically, to evaluate whether the Gaussian kernel semantic enhancement method addresses the query drift problem, we provide a simple descent function [
37] for comparison, which we denote as LSDR
D, as described in Equation (6).
where
represents the Euclidean distance between
and the original query
, and
is the kernel function, with
denoting the distance. In this paper, Euclidean distance is used, and
refers to the embedding length of the query, which does not exceed 32.
is a hyperparameter used to balance the semantic relationship between the original query and the optimized queries.
3.4. Retrieval
Here, we use the method proposed by ColBERTv2. Based on the obtained query embedding
and document embedding
, the final similarity score
between the query and the document is calculated as the sum of the highest cosine similarity for each query embedding corresponding to the document embeddings, as shown in Equation (7):
where
represents the highest cosine similarity.
denotes an embedding of a query term within the query
, and, similarly,
denotes an embedding of a document term within the document
.
3.5. Semantic-Enhanced Dense Retrieval
To address the issues of query and topic drift, we calculated the distance relationships between the original query embedding and the optimized query embeddings using ColBERTv2. Additionally, we introduced a Gaussian kernel function to balance the semantic relationship between the original and optimized queries. Subsequently, the ColBERTv2 model was used to perform dense retrieval. The specific computation for our proposed method,
, is given by Equation (8). The specific computation for the comparison experiment,
, is provided in Equation (9).
4. Experimental Settings
4.1. Selection of the Large Language Model
In this study, we selected the LLAMA 3 8B model due to its optimal balance between computational efficiency and retrieval performance. While larger models, such as GPT-4, GPT-3, GPT-3.5 [
38], and Mistral 7B [
39], offer more advanced capabilities, the increased number of parameters significantly impacts inference speed and computational cost. For tasks requiring high retrieval efficiency, the added complexity of these models can result in delays and excessive resource consumption, as illustrated in
Table 1.
The LLAMA 3 8B, with 8 billion parameters and an 8k token context length, provides sufficient capacity to handle retrieval tasks effectively while maintaining low inference costs and computational demands. This makes it an ideal choice for our study, where the goal is to maximize retrieval performance without sacrificing efficiency.
While models with larger parameter sizes, such as GPT-4 (170 billion parameters), offer superior performance, they come with a significant trade-off in terms of resource consumption. In contrast, LLAMA 3 8B achieves a strong balance between performance and resource efficiency, ensuring it can handle complex queries while remaining within a reasonable computational budget. This efficiency is essential for the scalability of the retrieval tasks addressed in our study, making LLAMA 3 8B a highly suitable choice for our experiments.
4.2. Datasets and Evaluation Metrics
In the experimental setup, the datasets used are the TREC Deep Learning 2019 [
40] (abbreviated as TREC DL 2019) and 2020 [
41] (abbreviated as TREC DL 2020) passage retrieval datasets. The TREC DL 2019 test set includes 43 queries, and the TREC DL 2020 passage retrieval test set includes 54 queries. The relevance judgments for both datasets are rated on a scale from 0 (irrelevant) to 3 (highly relevant). During the evaluation phase, we followed the official evaluation standards of each track and reported the key performance metrics on the TREC 2019 and TREC 2020 query sets. These metrics include Mean Reciprocal Rank at 10 (MRR@10), Recall at 1000 (Recall@1000), Normalized Discounted Cumulative Gain at 10 (NDCG@10), and Mean Average Precision (MAP). To ensure consistency and rigor in the evaluation, we adopted the same approach as in previous studies [
11], where document paragraphs with a relevance label of 1 are considered irrelevant.
It is important to note that both TREC DL 2019 and TREC DL 2020 are built upon the MS MARCO passage ranking dataset, where the queries are collected from real-world search logs and formulated in natural language. As such, they can be categorized as fully-semantic queries, which do not include explicit logical operators such as AND, OR, or NOT.
4.3. Hyperparameter Settings
In this paper, we selected the ColBERTv2 pre-trained model as the dense retrieval backbone and followed the configuration standards reported in the original implementation [
10]. Specifically, we utilized the checkpoint at training step 150,000 from the official release and retained the architectural settings defined in the model’s configuration file. The key hyperparameters of this configuration are 12 transformer layers, each with 12 attention heads, a hidden size of 768, an intermediate size of 3072, and dropout rates for attention and hidden layers both set to 0.1. The model uses GELU activation and absolute positional embeddings, and supports a maximum input length of 512 tokens. These settings are aligned with BERT-base and ensure compatibility with ColBERT’s late interaction design.
To handle tasks that involve complex and detailed text generation, particularly in scenarios requiring precise query rewriting, we introduce the LLAMA-3-8B. For this model, we specifically adjusted the maximum sequence length (Max Length) to 2096. As shown in
Figure 3, this is the prompt
for using a large language model, where we refer to Ivica Kostric’s research [
42] to set n to 10. The prompt template was carefully designed and iteratively refined through a multi-stage prompt engineering process, involving comparative testing of different instruction phrasings and output formats. We evaluated various prompt variants in terms of semantic fidelity, structural consistency, and retrieval effectiveness, and selected the most robust prompt that consistently guided the model to generate semantically enriched queries.
To ensure diversity and controllability in the generated text, we set the temperature parameter to a very low value of 0.001. This configuration minimizes randomness in the generated content, ensuring that the output remains stable and closely aligned with the input context. Additionally, recognizing the differences between various documents and datasets, we acknowledge that static smoothing parameters (such as commonly used fixed values like {1, 10, 100}) may not be the most suitable for all cases. Therefore, we adopted a dynamic adjustment strategy, setting the range for the smoothing parameter between 0.1 and , with a step size of 0.1, where represents the maximum Euclidean distance.
5. Experimental Results Analysis
5.1. Comparison with the Sparse Models
To validate the effectiveness of the proposed method, this paper designs a series of comparative experiments, comparing it with various sparse models. The goal is to comprehensively evaluate the performance of LSDRGs across different scenarios. Specifically, we have selected the following representative models as baseline comparisons:
- (a)
BM25 [
43]: A classical sparse retrieval model, widely used in information retrieval tasks, known for its simplicity and efficiency.
- (b)
BM25 + RM3: A QE method based on language models. It generates new queries by extracting high-probability terms from pseudo-relevant documents and generating new queries based on the probability distribution of these terms.
- (c)
BM25 + Rocchio: A QE method based on the vector space model (VSM). It generates new query representations by adjusting the direction and length of the query vector, combining features from the initial query and pseudo-relevant documents.
- (d)
BM25 + BERT: Combines the preliminary retrieval results of BM25 with the re-ranking capability of BERT to improve retrieval accuracy.
- (e)
BM25 + ColBERT [
11]: Based on the preliminary retrieval of BM25, it further optimizes the ranking using the ColBERT model.
We conducted a comprehensive comparison of the LSDR
Gs model with classical sparse retrieval models using the official TREC metrics, including MAP, NDCG@10, MRR@10, and Recall@1000. Using BM25 as the baseline model, we further quantified the performance improvement of other models relative to BM25. The experimental results in
Table 2 clearly demonstrate that re-ranking methods based on pre-trained language models exhibit significant advantages across several key metrics.
The traditional BM25 sparse retrieval model efficiently ranks relevant documents through exact keyword matching between queries and documents. This approach has the advantage of speed and efficiency when applied to large document collections. However, its reliance on lexical matching limits its ability to understand semantics. For example, BM25 struggles to capture synonyms or contextual semantic relationships, which constrains the comprehensiveness and accuracy of the retrieval results. Furthermore, when dealing with long-tail queries or low-frequency terms, the performance of sparse models can degrade significantly, making it difficult to effectively bridge the semantic gap between queries and documents. Our proposed model provides an effective solution to these challenges.
As shown in
Table 2, compared to the baseline model BM25, both LSDR
D and LSDR
Gs achieved significant improvements across various metrics on the TREC 2019 and TREC 2020 datasets. In particular, for the LSDR
Gs model, the improvements in MAP, NDCG@10, and Recall@1000 were 83.41%, 52.65%, and 19.86% (TREC 2019), and 87.96%, 57.71%, and 15.77% (TREC 2020), respectively. Additionally, LSDR
D and LSDR
Gs outperformed other re-ranking models in MRR@10. Notably, while LSDR
D performed well across all metrics, it did not achieve the same overall performance as LSDR
Gs, further validating the superiority of the Gaussian kernel function in capturing deeper semantic information.
In summary, the LSDRGs model not only significantly improves the precision of retrieval results but also demonstrates strong capabilities in handling complex semantic queries. By combining the query rewriting power of LLMs with Gaussian kernel semantic enhancement techniques, LSDRGs achieves efficient semantic retrieval in dense vector space. This approach effectively retains the semantic dimensions of the original user query, expands the query’s semantics using LLMs, and mitigates query drift by introducing the Gaussian kernel function, resulting in a significant enhancement in retrieval performance.
5.2. Comparison with Dense Retrieval Models
To validate the effectiveness of the proposed method, we designed a series of comparative experiments to compare LSDRGs with various dense retrieval models. The goal is to comprehensively evaluate the performance of LSDRGs across different scenarios. Specifically, we selected the following representative models as baselines for comparison:
- (a)
ColBERT E2E [
8]: The end-to-end dense retrieval version of ColBERT, which directly generates query and document representations from raw text, avoiding the limitations of traditional two-stage retrieval.
- (b)
ANCE [
9]: A dense retrieval model with a single representation that optimizes query and document representations through adaptive negative sample selection.
- (c)
SBERT: Sentence-BERT (SBERT) [
44,
45] enhances BERT-based models by generating semantically rich sentence embeddings using a Siamese network architecture.
- (d)
uniCOIL [
43]: A framework combining the pre-trained language model (doc2query-T5) and sparse representations, capable of capturing both lexical and semantic relevance.
- (e)
ColBERTv2 [
10]: An improved version of the ColBERT model, which incorporates more advanced training strategies and model architecture, further enhancing retrieval performance.
- (f)
ANCE-PRF [
44]: ANCE-PRF is a method that combines PRF with the dense retrieval model ANCE. By leveraging the powerful semantic understanding and generation capabilities of pre-trained language models, ANCE-PRF improves upon traditional PRF methods, demonstrating higher accuracy and robustness when handling ambiguous or complex queries.
- (g)
DistilBERT Balanced Average [
44]: A retrieval method based on the lightweight DistilBERT model that improves retrieval stability and accuracy by balancing the importance of different features and performing weighted averaging on multiple embedding vectors.
- (h)
DistilBERT Balanced Rocchio [
44]: This method combines DistilBERT with the classic Rocchio algorithm, dynamically adjusting the weights of query vectors to optimize query representation.
- (i)
CWPRF-AAAT [
46]: Contextualized Word Pseudo-Relevance Feedback with Adaptive Attention and Transformation (CWPRF-AAAT) is a method that combines contextual awareness, PRF, adaptive attention mechanisms, and transformation techniques.
- (j)
CWPRF-OAAT [
46]: An improved version of CWPRF-AAAT, emphasizing optimal attention mechanisms and global optimization transformation techniques.
- (k)
ColBERT-PRF Ranker [
11]: A ranking method that combines the ColBERT model with PRF.
We conducted a comprehensive comparison of the LSDR
Gs model against classic dense retrieval models based on TREC official evaluation metrics, including MAP, NDCG@10, MRR@10, and Recall@1000. Using ColBERT E2E as the baseline model, we further quantified the performance improvements of other models relative to ColBERT E2E. Analyzing the results in
Table 3, we observed a significant trend: the ColBERTv2 dense retrieval method demonstrated superior performance compared to other dense retrieval methods like ANCE and ColBERT, mainly due to its stronger semantic capture ability.
Our proposed LSDRGs model leverages this powerful semantic capture capability and combines it with large language models and Gaussian kernel functions. The experimental results show that, compared to the baseline model, LSDRGs and LSDRD achieve significant improvements across all metrics on the TREC 2019 and TREC 2020 datasets. Specifically, for the LSDRGs model, improvements in MAP, NDCG@10, MRR@10, and Recall@1000 reached 27.98%, 11.35%, 9.98%, and 13.93% (TREC 2019), and 15.34%, 10.09%, 1.38%, and 10.41% (TREC 2020), respectively.
Compared to other pseudo-relevance feedback dense retrieval models (e), (f), (g), (i), and (j), our proposed LSDRGs and LSDRD models also showed significant improvements across all metrics. Notably, since our model operates with only a single round of retrieval, both time efficiency and retrieval accuracy have been enhanced. The main reason for this phenomenon lies in the fact that we enhance the semantic dimensions of the original query through LLMs, providing semantic expansion, and combine this enriched semantic information with a Gaussian kernel function. This approach effectively avoids the loss of semantic information. The enhanced query representation not only improves retrieval recall but also significantly boosts the relevance of the retrieval results.
The enhanced query representation retains the semantic intent of the original query while integrating more relevant background knowledge and potential semantics, making the retrieval results more aligned with user needs. Moreover, this method effectively addresses the potential semantic loss and query drift that might occur when generating queries with LLMs, further improving the accuracy and effectiveness of the retrieval.
5.3. Parameter Sensitivity Analysis
To evaluate the robustness of the proposed model, we conducted an in-depth analysis of the key factors influencing its performance. This section specifically focuses on the impact of the hyperparameter
on the model’s performance, and, in the experimental setup, we set
to vary dynamically between 0.1
and
, with a step size of 0.1
.
Figure 4 presents the detailed evaluation results of the proposed method on the TREC 2019 and TREC 2020 datasets, covering four key metrics: MAP, MRR@10, Recall@1000, and NDCG@10.
Overall, all metrics exhibit a rise-then-fall trend across both datasets. This trend indicates that our model is effective in capturing high-dimensional semantic information, but, when the value of becomes too large, it may lead to query drift, thus affecting the model’s performance. As the value of increases, MAP first rises and then declines, with the optimal value of around 0.7. This suggests that, at this value, the model can more accurately capture the relevance of documents and effectively mitigate query drift. The trend of MRR@10 is similar to MAP, with the optimal value also around 0.7. This indicates that, at this value, the model is able to rank relevant documents more accurately, enhancing the user’s retrieval experience. Recall@1000 reaches its highest value when σ is around 0.8. This shows that, at this value, the model is more capable of recalling relevant documents comprehensively, especially in large-scale document collection scenarios. NDCG@10 performs best when is around 0.6μ. This implies that, at this value, the model can more effectively balance document relevance and ranking order, ensuring that highly relevant documents are ranked higher.
In conclusion, the hyperparameter has a significant impact on the performance of the LSDRGs method, with slight variations in the optimal value across different evaluation metrics. Based on the analysis above, we recommend setting within the range from 0.6 to 0.8 for practical applications to ensure optimal performance across multiple metrics.
5.4. Discussion and Limitations
Building upon the significant performance improvements achieved by the proposed LSDRGs method, we further provide an in-depth discussion from two perspectives: comparative analysis and limitations.
First, based on the experimental results, LSDRGs consistently outperforms both classic sparse retrieval models (e.g., BM25, BM25 + RM3, BM25 + Rocchio) and dense retrieval models (e.g., ColBERT, ANCE, CWPRF) on the TREC 2019 and TREC 2020 benchmark datasets. Notably, our method achieves superior results across four key evaluation metrics: MAP, NDCG@10, MRR@10, and Recall@1000. These results demonstrate that our approach, which integrates LLMs for query expansion and employs Gaussian kernels for semantic enhancement, is capable of more effectively extracting semantic information from queries. Furthermore, LSDRGs significantly improves both the robustness and accuracy of the retrieval system. Compared to using BM25 or traditional PRF methods alone, LSDRGs shows stronger capabilities in handling challenging retrieval scenarios, such as long-tail queries, complex semantic expressions, and low-frequency terms.
Importantly, the types of queries considered in our experiments are predominantly fully semantic in nature, often comprising natural language questions or short keyword-based intents without formal logical structures or connectives. This aligns with the characteristics of the TREC Deep Learning Track query sets, where logical composition (e.g., Boolean operators or nested expressions) is generally absent. Thus, although vector-based semantic models (such as ColBERT and LLM-based expansions) are known to struggle with logical reasoning, their semantic approximation remains suitable and effective for the query types we focus on. In this context, the superiority of LSDRGs can be attributed to their ability to capture nuanced semantic features rather than formal logical relationships. When compared to existing PRF-enhanced dense retrieval models (e.g., CWPRF-OAAT and ColBERT-PRF Ranker), LSDRGs demonstrates superior ability in capturing deep semantic features, owing to the high-quality query expansion texts generated by LLMs.
The proposed method also presents several notable limitations. LLMs may introduce content that is not entirely aligned with the user’s original intent during query expansion. Although we mitigate the query drift issue through the use of a Gaussian kernel function, it remains fundamentally difficult to fully eliminate deviations from the intended retrieval target. Moreover, as LLMs are large-scale pretrained models, their output quality is highly dependent on the coverage and quality of the training data. When faced with queries that are highly domain-specific or expressed in extremely sparse language, the generation quality may degrade, potentially reducing the generalizability of the expanded queries. Second, LSDRGs relies on the use of large pretrained models during both training and inference phases, resulting in higher computational and time costs. This poses challenges for deployment in resource-constrained environments. Finally, while the incorporation of Gaussian kernels improves the precision of semantic matching to a certain extent, the method is highly sensitive to the hyperparameters of the kernel function. Improper parameter settings may diminish the benefits gained from semantic enhancement.
6. Conclusions and Future Work
This paper proposes LSDRGs, a dense retrieval model that integrates LLM-based query expansion with Gaussian kernel semantic enhancement, aiming to alleviate the query drift problem and improve retrieval accuracy and robustness. By leveraging the generative capabilities of LLMs and the fine-grained semantic representation power of the Gaussian kernel, the model enhances the semantic expressiveness of queries while maintaining their alignment with the user’s original intent.
Extensive experiments conducted on the TREC 2019 and TREC 2020 datasets demonstrate the superiority of LSDRGs over both sparse and dense baselines. Compared with the classical sparse retrieval model BM25, LSDRGs achieves improvements of 83.41% MAP, 52.65% NDCG@10, and 19.86% Recall@1000 on TREC 2019, and 87.96% MAP, 57.71% NDCG@10, and 15.77% Recall@1000 on TREC 2020. Furthermore, LSDRGs outperforms recent dense retrieval models, including ColBERTv2, CWPRF-OAAT, and ColBERT-PRF Ranker. For instance, compared to ColBERT, LSDRGs achieves relative improvements of 27.98 MAP and 13.93% Recall@1000 on TREC 2019, highlighting its effectiveness in capturing deeper semantic relationships.
These results validate that our method not only significantly enhances retrieval performance across multiple metrics but also maintains strong generalizability across different retrieval scenarios. By alleviating query drift and leveraging the Gaussian kernel semantic space, we effectively quantify the similarity between the original query and its optimized queries, ensuring that optimized queries remain aligned with the user’s true intent, thus improving the relevance and accuracy of retrieval results.
Although the LSDRGs model has demonstrated excellent performance in the current experiments, future research can further enhance its applicability by exploring the following directions:
Optimizing small-scale large language models: Our initial exploration revealed that smaller models, such as LLAMA 3.2 1B and LLAMA 3.2 3B, struggled to generate optimized queries effectively, whereas LLAMA 3 8B exhibited sufficient capability for this task. This suggests that model size plays a crucial role in query optimization. To address this, future research can focus on training and refining smaller parameter models to strike a balance between semantic expressiveness and computational efficiency, making large language model-based retrieval systems more accessible and scalable.
Exploring more complex semantic enhancement mechanisms: While the current Gaussian kernel function effectively captures semantic relationships between queries, it may still have limitations when handling more complex semantic structures. Future work could consider incorporating additional types of kernel functions or integrating other deep learning techniques, such as graph neural networks, to better model the semantic relationships between queries and documents.
Expanding to multimodal retrieval: With the increasing prevalence of multimedia content, information retrieval tasks are gradually extending beyond pure text to encompass images, videos, and other modalities. Future work could apply the LSDRGs model to multimodal retrieval scenarios, integrating visual and textual information to build more comprehensive and intelligent retrieval systems.