Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments

Bao, Xiaoan; Lv, Zhiqiang; Wu, Biao

doi:10.3390/electronics14050909

Open AccessArticle

Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments

by

Xiaoan Bao

¹,

Zhiqiang Lv

¹ and

Biao Wu

^2,*

¹

School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, Hangzhou 310018, China

²

The School of Science, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 909; https://doi.org/10.3390/electronics14050909

Submission received: 15 January 2025 / Revised: 19 February 2025 / Accepted: 21 February 2025 / Published: 25 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

The task of Visual Language Navigation in Continuous Environments (VLN-CE) aims to enable agents to comprehend and execute human instructions in real-world environments. However, current methods frequently face challenges such as insufficient knowledge and difficulties in obstacle avoidance when performing VLN-CE tasks. To address these challenges, this paper proposes a navigation approach guided by a Retrieval Augmented Generation (RAG) large language model (RAGNav). By constructing a navigation knowledge base, we leverage RAG to enhance the input of LLM, enabling the generation of more precise navigation plans. Furthermore, we introduce a Prompt Enhanced Obstacle Avoidance strategy (PEOA) to improve the flexibility and robustness of agents in complex environments. Experimental results indicate that our method not only increases the navigation accuracy of agents but also enhances their obstacle avoidance capabilities, achieving a 2% and 2.32% increase in success rates on the public datasets R2R-CE and RxR-CE, respectively.

Keywords:

visual language navigation (VLN); continuous environments; Retrieval Augmented Generation (RAG); large language model (LLM); navigation knowledge base

1. Introduction

In recent years, Visual Language Navigation (VLN) [1] has emerged as a cutting-edge research area in artificial intelligence, garnering significant attention from the research community. VLN aims to integrate natural language processing (NLP) and computer vision (CV) technologies, enabling agents to recognize environmental information and understand natural language instructions for autonomous navigation. In traditional VLN settings, the environment is often abstracted as a graph structure, where the navigation task involves selecting one node from a set of neighboring nodes at each time step. However, this approach assumes ideal conditions for navigation between locations and nodes, which does not accurately represent real-world scenarios. To address these limitations and create more realistic navigation tasks, ref. [2] introduced the concept of Visual Language Navigation in Continuous Environments (VLN-CE). In this setting, agents operate in a 3D space, navigating to any accessible point by executing low-level navigation actions (e.g., moving forward 0.25 m or rotating 15 degrees). This shift allows for a more accurate representation of real-world navigation challenges.

Recent research on VLN-CE has increasingly focused on leveraging large language models (LLMs), which enhance the agents’ ability to understand complex instructions and effectively guide navigation. Ref. [3] proposed a zero-shot agent based on the Generative Pre-Trained Transformer(GPT), utilizing the capabilities of GPT-4 [4] for navigation decision-making in the R2R dataset. Another study [5] introduced the Navigational Chain of Thought (NavCoT) strategy, which employs parameter efficient, domain specific training to enable LLMs to make self-guided navigation decisions, thereby mitigating the domain gap. NavGPT further ehances navigation by taking visual observations, navigation history, and textual descriptions of future navigable directions as inputs to infer the agent’s current state and make informed decisions to approach the target. Despite these advancements, the reliance on LLMs’ internal knowledge alone has resulted in insufficient grounding for developing navigation plans, and there has been limited focus on improving the agents’ obstacle avoidance capabilities.

Previous research has shown that agents often become trapped by obstacles or move too close to walls in complex environments, which not only reduces task completion efficiency but also limits their potential for real-world applications. To enhance agents’ obstacle avoidance abilities, SafeVLN [6] was introduced, employing simulated 2D LiDAR occupancy masks to avoid predicting waypoints in obstacle areas. This method utilizes a “reselect after collision” strategy to prevent agents from becoming trapped in a loop of continuous collisions. Another study [7] presented a new controller that assists agents in escaping deadlocks through a trial-and-error heuristic, effectively minimizing performance losses due to sliding prohibitions. These approaches primarily focus on decision-making after encountering obstacles, with less emphasis on preemptively preventing collisions.

To overcome these limitations, we propose a novel visual language navigation approach for continuous environments that enhance large language models through Retrieval-Augmented Generation (RAG). Our approach aims to improve the accuracy and robustness of agents in continuous environments by constructing a navigation knowledge base, utilizing RAG for knowledge retrieval, which is then combined with the language understanding capabilities of LLMs. Additionally, we introduce a Prompt-Enhanced Obstacle Avoidance Strategy (PEOA), which integrates LLMs with structured prompts to facilitate better obstacle avoidance and generate fine-grained navigation plans.

In summary, our work makes the following key contributions:

1.: We introduce the use of RAG to retrieve knowledge from a navigation knowledge base, enhancing LLMs to generate more accurate navigation plans.
2.: We introduce a Prompt-Enhanced Obstacle Avoidance Strategy that leverages LLMs with structured prompts to improve obstacle avoidance and fine-grained navigation.
3.: Extensive experiments on publicly available datasets (R2R-CE and RxR-CE) demonstrate the superiority of our approach, achieving success rate improvements of at least 2% and 2.32% over baseline models, respectively.

2. Related Work

2.1. Vision and Language Navigation

Research on Visual Language Navigation (VLN) [1,8,9,10] has garnered increasing attention from scholars. Datasets such as R2R [1] and RxR [11] provide researchers with rich resources to explore the interaction between natural language and visual environments, thereby advancing technological progress in AI regarding the understanding of complex instructions, precise localization, and autonomous navigation. Early studies primarily focused on discrete environments, such as those simulated using the Matterport3D simulator [12]. In these settings, agents observe panoramic RGB images and can teleport step-by-step to target nodes based on natural language instructions. Basic agents typically incorporate cross-modal attention modules for cross-modal alignment, along with LSTM and transformer-based networks to model contextual history and decode sequences of navigation actions. They are trained using a hybrid approach that combines reinforcement learning and imitation learning [13,14,15].

To achieve more precise navigation planning in continuous environments, researchers have proposed a series of methods that improve navigation from various perspectives [5,6,7,16,17,18]. Among these, integrating contemporary large language model (LLM) technology with VLN has emerged as a particularly popular approach. Ref. [5] presents a strategy known as the Navigational Chain-of-Thought (NavCoT) to address the domain gap issue encountered when utilizing LLMs for VLN tasks. The core idea of NavCoT is to enable LLMs to perform self-guided navigation reasoning through parameter-efficient, domain-specific training, thereby simplifying the decision-making process and enhancing interpretability. Additionally, ref. [18] introduces a novel planning framework called AO-Planner, which aims to tackle some of the challenges and gaps faced by LLMs in VLN tasks.

Multimodal transformers, such as VL-BERT [19] and Flamingo [20], have also shown remarkable success in VLN tasks by jointly modeling visual and textual information. VL-BERT integrates visual and linguistic features through a unified transformer architecture, enabling the model to perform cross-modal reasoning effectively. Flamingo leverages large-scale pretraining on multimodal datasets to achieve state-of-the-art performance in tasks requiring visual and textual understanding. However, these methods often rely heavily on large-scale pretraining, and may struggle with generalization to unseen environments or instructions. In contrast, retrieval-augmented generation (RAG)-based methods, such as ours, focus on dynamically retrieving and utilizing task-specific knowledge, providing greater flexibility and adaptability in complex navigation scenarios. For example, our method outperforms VL-BERT and Flamingo in handling ambiguous instructions by leveraging external knowledge to disambiguate user intent.

2.2. Large Language Models

Since the introduction of BERT by Google in 2018 [21], Large Language Models (LLMs) such as GPT-2 [22] and GPT-3 [23] have demonstrated remarkable language understanding and generation capabilities through unsupervised learning on vast amounts of text data, achieving outstanding results on various NLP benchmark tests. Simultaneously, the application of LLMs in VLN is flourishing, with researchers exploring the integration of language understanding with visual perception, enabling agents to perform navigation tasks in complex environments based on natural language instructions. Related studies have investigated the use of LLMs to generate executable action plans, and the

{LLM}_{p l a}

method has been proposed for few-shot planning in embodied agents [23]. Furthermore, the Navigational Chain-of-Thought (NavCoT) strategy aids LLMs in self-guiding navigation decisions through efficient in-domain training, effectively mitigating domain gap issues [5].

Recent advancements have explored the application of LLMs in robotic navigation. For instance, study [24] demonstrates the use of pre-trained models for navigation tasks. Similarly, ref. [25] investigates the feasibility of using LLMs for path planning. However, these studies primarily focus on the general application of LLMs in navigation tasks and do not delve deeply into the specific challenges of obstacle avoidance in continuous environments.

Currently, the application of RAG [26] in LLMs has become a prominent trend. RAG models process queries and utilize techniques such as semantic search to retrieve relevant materials from associated text databases. They then leverage generative models to integrate this information with the input, providing richer contextual support for navigation tasks. Consequently, RAG-enhanced LLMs exhibit tremendous potential in visual language navigation within continuous environments, further improving the accuracy and effectiveness of navigation strategies.

2.3. Collision in Autonomous Navigation

In the field of robotic navigation, collision avoidance research has garnered significant attention [27,28,29]. For instance, imitation learning, a popular approach in this domain, has been employed in visual navigation [30] to guide agents with expert experience in effectively avoiding collisions. However, in the VLN-CE context, the challenges posed by longer navigation paths, complex environments, and additional requirements for instruction alignment hinder the acquisition of expert actions necessary for collision handling, thus limiting the applicability of imitation learning methods. ETPNav represents a recent study that addresses collisions within VLN-CE [31]. It designs a trial-and-error heuristic controller that avoids agents becoming trapped in collision zones by randomly selecting actions from a predefined action set. Ref. [6] introduces a classification of collisions in VLN-CE for the first time, defining three types: waypoint collisions, navigation collisions, and dynamic collisions. It also proposes the Safe-VLN algorithm, which includes two key components a waypoint predictor and a re-selection navigator to enhance the collision avoidance capabilities of the agent.

There are various approaches to enable effective obstacle avoidance for agents. Inspired by the application of prompts in large language models (LLMs), this paper primarily focuses on the role of prompts in enhancing obstacle avoidance capabilities and introduces the PEOA module to further improve this ability.

3. Proposed Approach

In this paper, we propose a navigation method guided by the RAG large language model, referred to as RAG-Nav. The entire process is illustrated in Figure 1. First, we construct a navigation knowledge base containing general conceptual knowledge related to areas and objects. Next, we utilize LLMext to analyze the input instructions I, extracting two sets of parameterized data,

O_{n}

and

R_{n}

. Here,

O_{n}

represents object-related parameters, while

R_{n}

represents room-related parameters. These are combined with the most relevant knowledge K retrieved from the knowledge base through the RAG mechanism, which can be formulated as:

K = Retrieve (I, Knowledge Base),

where Retrieve is the retrieval function that returns the most relevant knowledge K based on the input instruction I and the knowledge base. The extracted parameters

O_{n}

and

R_{n}

are obtained as:

O_{n}, R_{n} = LLMext (I) .

The retrieved knowledge K, along with

O_{n}

,

R_{n}

, and the scene graph description

S_{t}

generated by CLIP, are input to LLMpla to enhance the accuracy of the large language model when planning coarse-grained navigation

G_{t}

. This process is expressed as:

G_{t} = LLMpla (I, K, O_{n}, R_{n}, S_{t}) .

Figure 1. The core of this framework consists of three large language models with different roles:

{LLM}_{ext}

,

{LLM}_{pla}

, and

{LLM}_{oa}

. The main architecture is divided into two primary modules. The first half analyzes the current RGB image using instructions and CLIP, and then combines this analysis with RAG to generate a plan for the next action. The second half comprises the PEOA module, which generates fine-grained actions for the agent at each step.

Figure 1. The core of this framework consists of three large language models with different roles:

{LLM}_{ext}

,

{LLM}_{pla}

, and

{LLM}_{oa}

. The main architecture is divided into two primary modules. The first half analyzes the current RGB image using instructions and CLIP, and then combines this analysis with RAG to generate a plan for the next action. The second half comprises the PEOA module, which generates fine-grained actions for the agent at each step.

At the fine-grained operation execution level, we use PEOA. The module uses coarse-grained target

G_{t}

and scene description

S_{t}

as input, and also the depth map data as auxiliary input. It is most critical is to judge the obstacles and distances that may be encountered according to the coarse-grained action and depth map data. Input to the

{LLM}_{oa}

generates fine-grained decisions that do not encounter obstacles. Generates A set of parameterized discrete actions

A_{t}

. The fine-grained action generation formula is

A_{t} = {LLM}_{oa} (G_{t}, S_{t}, Depth Map),

where

{LLM}_{oa}

is the LLM responsible for generating discrete actions with obstacle avoidance.A concrete LLM example is shown in Figure 2.There is also a pseudo-code Table 1 that supports my approach.

3.1. Task Setup

We focus on a practical setting, VLN-CE [6]. Specifically, our task is to enable an agent to effectively navigate from a starting point to a target location within a 3D grid environment constructed from the Matterport3D dataset, guided by natural language instructions W =

{w_{i}}_{i = 1}^{L}

. At each location, the agent receives panoramic RGB observations

O = {I^{R G B}, I^{d}}

, which include 12 RGB images and 12 depth images, captured at 12 evenly spaced horizontal orientations (0°, 30°, …, 330°). The action space comprises parameterized discrete actions, such as moving forward (0.25 m), rotating left or right (15°), and stopping. The agent greedily selects the navigation action with the highest predicted probability. Ongoing navigation is terminated if the agent triggers the STOP action or exceeds the maximum number of predictions for the target.

Figure 2. A simple example for LLM.

Table 1. Pseudocode for the RAG-Nav Method.

Step	Description	Functions/Inputs
1	Construct Navigation Knowledge Base	`KnowledgeBase = BuildKnowledgeBase()`
2	Analyze Input Instructions	`O_n, R_n = LLMext(Instructions)`
3	Retrieve Relevant Knowledge	`RelevantKnowledge = RAG(KnowledgeBase, O_n, R_n)`
4	Generate Coarse-Grained Navigation Goal (G_t)	`G_t = LLMpla(O_n, R_n, RelevantKnowledge, S_t)`
5	Generate Scene Graph Description	`S_t = CLIP(InputImage)`
6	Define Obstacle Avoidance Strategy	`PEOA = DefineObstacleAvoidanceStrategy()`
7	Input for Fine-Grained Action Execution	`A_t = PEOA(G_t, S_t, DepthMap)`
8	Execute Actions	`ExecuteActions(A_t)`
9	Evaluate Success Rate	`SuccessRate = EvaluatePerformance()`

3.2. Building a Navigation Knowledge Base

The knowledge base is fundamentally designed to integrate and structure information for efficient retrieval and localization. Specifically, the navigation knowledge base [32] is tailored for VLN, aiming to enhance the agent’s comprehension and navigation accuracy.

This paper leverages scene understanding auxiliary tasks to explore adjacency relationships between regions and objects in discrete navigation environments. We first segment the scene into independent regions to ensure accurate connectivity statistics, where the connectivity between regions is based on the adjacency of nodes. For example, if region A contains node C, and node D in region B is adjacent to node C, then regions A and B are considered connected. Object proximity is determined by assuming adjacency when objects appear within the same view, and their proximity count is incremented accordingly.

After statistical analysis, we obtain region and object proximity count matrices. Outliers in these matrices are addressed, and they are then converted into proximity probability matrices using normalization to reduce the impact of extreme values. We employ the following normalization formula:

P_{i, j} = 0.95 \times \frac{N_{i, j} - N_{i, \min}}{N_{i, \max} - N_{i, \min}},

where

P_{i, j}

represents the proximity probability of region or object,

N_{i, \min}

and

N_{i, \max}

represent the minimum and maximum values of the i-th row of the matrix N, respectively. This process not only enhances the agent’s cognitive capabilities regarding the layout of the environment but also lays the groundwork for the subsequent retrieval processes of the RAG, thereby improving the agent’s ability to navigate accurately and efficiently in complex environments.

3.3. Coarse-Grained Planning with RAG

In VLN systems, to enhance the global planning capabilities of agents during navigation tasks, we employ RAG techniques to extract relevant knowledge from our constructed navigation knowledge base, thereby supporting the global planning and decision-making processes of LLMs.

3.3.1. Knowledge Base Retrieval

For each navigation task, the agent uses a retrieval model (e.g., DPR or BM25) to extract relevant knowledge from the knowledge base based on the given instructions. For example, with the instruction “Go to the restroom and get a basin”, the agent extracts keywords like “restroom” and “basin” using an extraction model

L_{ext}

and retrieves related entries, such as

{restroom, living room, t_{1}}

and

{basin, \sin k, t_{2}}

, where

t_{1}

and

t_{2}

represent association probabilities. The top-k most relevant pieces of knowledge are then selected.

3.3.2. Enhanced Generation

Subsequently, the retrieved relevant knowledge is combined with the navigation instructions and the scene description of the current view

S_{t}

, and input into the planning model

L_{pla}

. This allows us to leverage its powerful textual parsing capability to interpret the information and convert it into sub-goal descriptions

G_{t}

. Furthermore, to mitigate the inaccuracies introduced by the static planner

L_{pla}

, we incorporate a Room and Object Awareness Scene Perceptor (ROASP) proposed in [33].

At each time step t, the agent perceives the environment and obtains a panoramic visual observation

O = {I^{R G B}, I^{d}}

. The image features are extracted using the CLIP Image Encoder as follows:

f_{r} = E_{c}^{v} (o_{i}),

where

E_{c}^{v} (\cdot)

represents the CLIP image encoder. For each room category

C_{r}

and each object category

C_{o}

, the text features are extracted using the pre-trained CLIP text encoder as follows:

f_{r} = E_{c}^{t} (T_{r}),

f_{o} = E_{c}^{t} (T_{o}),

where

E_{c}^{t} (\cdot)

represents the CLIP text encoder. Finally, the similarity score

S_{r}

between the image feature

f_{r}

and the text feature

f_{r}^{T}

as well as the similarity score

S_{o}

between the image feature

f_{r}

and the text feature

f_{o}^{T}

are respectively computed as:

S_{r} = S (f_{r}, f_{r}^{T}),

S_{o} = S (f_{o}, f_{r}^{T}) .

We compute the average of the scores obtained from

S_{r}

and select the highest-scoring prediction as

C_{r}

. The score

S_{o}

increases with the proportion of the object in the view. We then select the top k matching objects with the highest scores as auxiliary environmental feedback. This approach allows ROASP not only to describe and predict the current environment but also to provide the large language model with crucial contextual information, thereby enabling the generation of more accurate ground truth

G_{t}

.

3.4. Prompt-Enhanced Obstacle Avoidance Strategy

In exploring the obstacle avoidance strategies for agents in 3D continuous environments, we propose the PEOA strategy, which involves designing formatted prompts to guide the agents’ obstacle avoidance behavior and fine-grained planning. The obstacle avoidance performance is enhanced by optimizing the prompt structure, ultimately resulting in the planning of a parameterized discrete action

A_{t}

.

In this context, we have designed various formats for prompts aimed at providing agents with richer and more specific environmental information and task instructions. For instance, a typical prompt may contain the following content:

“Nearby obstacles: {(wall, 0.60 m), (table, 0.36 m), (chair, 0.30 m, 31°–32°), (door, 0.2 m)}; sub-goal: {living room}.”

This format not only clarifies the target task but also details the potential obstacles that may be encountered along the planned route, as well as the distances

D_{t}

to these obstacles obtained from depth images. By providing this information to the obstacle avoidance model

L_{o}

, we can generate parameterized discrete actions that ensure each movement progresses toward the goal while minimizing deviations and errors.

4. Experimental Findings and Interpretation

4.1. Evaluation Metrics

For evaluation, we adopted several navigation metrics based on prior work. These include:

Trajectory Length (TL): measuring average path length;
Navigation Error (NE): the average distance between the final and target positions;
Success Rate (SR): the proportion of paths with NE less than 3 m;
Oracle Success Rate (OSR): SR with the Oracle stopping strategy;
Success weighted by Path Length (SPL): to assess efficiency;
Normalized Dynamic Time Warping (NDTW): to evaluate path fidelity;
SDTW: penalizing NDTW by SR.

In the R2R-CE dataset, SR and SPL are the primary metrics, while RxR-CE emphasizes path fidelity, focusing on NDTW and SDTW.

4.2. Datasets

This study evaluates performance on the R2R-CE and RxR-CE datasets. R2R-CE contains 5611 shortest path trajectories, with each trajectory paired with approximately three English instructions, averaging 9.89 m in length and 32 words per instruction. Val Seen includes novel paths and instructions within known scenes, while Val Unseen introduces novel paths, instructions, and scenes.

In contrast, RxR-CE is larger and more challenging, with instructions averaging 120 words and paths averaging 15.23 m. Additionally, agents in RxR-CE face stricter constraints, including a larger chassis radius (0.18 m) and a prohibition on sliding along obstacles, increasing the risk of collisions.

4.3. Implementation Details

Our model was trained on a Linux server equipped with two NVIDIA RTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA). During the training process, we set the batch size to 8 and the learning rate to

2 \times 10^{- 5}

, training for a total of 20,000 iterations. The optimal model was selected based on performance on the validation set with unseen data.

For the LLM, we utilized the Llama-3-8B-Instruct [34] model for navigation planning. For the scene perception module, we retained the top three object prediction results for each location. We built our agent using the Habitat [35] platform.

4.4. Comparison with SOTA Methods

4.4.1. R2R-CE

Table 2 shows that our proposed method demonstrates superior performance compared to state-of-the-art models, particularly in two critical metrics: SR and SPL. Our method achieved an SPL of 60 and an SR of 68 on the Val Seen set, outperforming all other models, including Reborn and ETPNav. Additionally, our method achieved the lowest NE of 3.66 and the highest OSR of 74 on the Val Seen set, demonstrating exceptional navigation precision and potential for success under optimal conditions. On the Val Unseen set, we obtained SPL and SR scores of 51 and 58, respectively, along with a low NE of 4.34 and a high OSR of 66, indicating strong generalization capabilities. Furthermore, on the Test Unseen set, we maintained SPL and SR scores of 50 and 57, respectively, with a competitive NE of 4.85 and an OSR of 65, highlighting robust performance in challenging unseen environments. While our TL is slightly longer compared to some models, it remains competitive with state-of-the-art methods, balancing path efficiency with task success. These results affirm the effectiveness and robustness of our method in navigation tasks, establishing it as a leading approach in the field.

4.4.2. RxR-CE

Table 3 presents a performance comparison of our model with several other methods under different evaluation conditions. The Seq2Seq model employs a sequence-to-sequence architecture, typically used for prediction and generation tasks involving sequential data. CWP-CMA and CWP-RecBERT utilize conditional weighting models and contextual information to optimize navigation decisions and improve success rates, while ETPNav significantly enhances navigation capabilities through an augmented task planning mechanism. On the unseen validation set, ETPNav achieved a NE of 5.64 and a SR of 54.79, whereas our model attained values of 4.58 and 56.85 for the corresponding metrics, demonstrating a superior success rate and indicating its better performance in unknown environments.

Table 2. Comparison with state-of-the-art methods on R2R-CE dataset.

Model	Val Seen					Val Unseen					Test Unseen
	TL	NE	OSR	SR	SPL	TL	NE	OSR	SR	SPL	TL	NE	OSR	SR	SPL
Seq2Seq [1]	9.26	7.12	46	37	35	8.64	7.37	40	32	30	8.85	7.91	36	28	25
SASRA [36]	8.89	7.71	-	36	34	7.89	8.32	-	24	22	-	-	-	-	-
HPN [37]	8.54	5.48	53	46	43	7.62	6.31	40	36	34	8.02	6.65	37	32	30
CM2 [38]	12.05	6.10	51	43	35	11.54	7.02	42	34	28	13.90	7.70	39	31	24
WS-MGMAP [39]	10.12	5.65	52	47	43	10.00	6.28	48	39	34	12.30	7.11	45	35	28
CWP-RecBERT [40]	12.50	5.02	59	50	44	12.23	5.74	53	44	39	13.51	5.89	51	42	36
Sim2Sim [41]	11.18	4.67	61	52	44	10.69	6.07	52	43	36	11.43	6.17	52	44	37
Reborn [42]	10.29	4.34	67	59	56	10.06	5.40	57	50	46	11.47	5.55	57	49	45
ETPNav [7]	11.78	3.95	72	66	59	11.99	4.71	65	57	49	12.87	5.12	63	55	48
Ours	12.11	3.66	74	68	60	11.69	4.34	66	58	51	12.66	4.85	65	57	50

Although our model scored 52.76 on the SPL, slightly higher than ETPNav’s score of 50.83, it achieved a NE of 5.72 and an SR of 53.53 on the unseen test set. In comparison, ETPNav recorded scores of 6.99 and 51.21, respectively. Overall, our model exhibited outstanding performance across various evaluation conditions, particularly in terms of success rate and navigation efficiency. This demonstrates its effectiveness and robustness in complex environments, further validating its advantages in task execution.

Table 3. Comparison with state-of-the-art methods on RxR-CE dataset. The uparrow (↑) indicates that higher values are better, while the downarrow (↓) indicates that lower values are better.

Methods	Val Seen					Val Unseen					Test Unseen
	NE↓	SR↑	SPL↑	NDTW↑	SDTW↑	NE↓	SR↑	SPL↑	NDTW↑	SDTW↑	NE↓	SR↑	SPL↑	NDTW↑	SDTW ↑
Seq2Seq [1]	-	-	-	-	-	-	-	-	-	-	12.10	13.93	11.96	30.86	11.01
CWP-CMA [40]	-	-	-	-	-	8.76	26.59	22.16	47.05	-	10.40	24.08	19.07	37.39	18.65
CWP-RecBERT
citepp36	-	-	-	-	-	8.98	27.08	22.65	46.71	-	10.40	24.85	19.61	37.30	19.05
ETPNav [7]	5.03	61.46	50.83	66.41	51.28	5.64	54.79	44.89	61.90	45.33	6.99	51.21	39.86	54.11	43.30
Ours	4.14	63.27	52.76	68.43	53.87	4.58	56.85	46.34	63.08	47.64	5.72	53.53	41.10	55.73	44.92

4.5. Ablation Study

We conducted a series of ablation experiments on the R2R-CE dataset. All results presented in this section are reported on the validation unseen split.

Table 4 presents the contributions of the main modules of the model. The first row represents the Transformer-based framework without any of the proposed modules. The second row utilizes RAG to extract knowledge from the knowledge base, resulting in a 3.6% increase in success rate, indicating that the knowledge effectively guides navigation decisions. The third row adds the ROASP module, yielding results superior to those of the previous two rows, demonstrating that ROASP enables the agent to better perceive the environment. Finally, the last row incorporates an intelligent obstacle avoidance strategy, significantly enhancing model performance, particularly in terms of SPL, which improved by 2.5%. This indicates that the strategy facilitates efficient fine-grained navigation planning.

Table 4. Performance Metrics with Different Configurations.

RAG	ROASP	PEOA	TL	NE	OSR	SR	SPL
×	×	×	10.31	5.59	56.63	49.86	45.32
✔	×	×	11.64	5.03	59.76	53.42	47.16
✔	✔	×	11.79	4.80	62.59	55.83	48.67
✔	✔	✔	11.69	4.34	65.83	58.32	51.09

Table 5 examines the impact of the number of relevant pieces of information retrieved from RAG on model performance. It is evident that the success rate tends to decrease as the number of retrieved items increases. However, the SPL performance is optimal when Top-K is set to 3. A lower number of retrieved items may result in the omission of the most useful relevant information, while an excessive number can lead to a decline in overall performance.

Table 5. Top-K Ablations.

Top-K	TL	NE	OSR	SR	SPL
1	11.92	4.38	65.97	58.44	50.80
3	11.69	4.34	65.83	58.32	51.09
5	11.63	4.45	65.02	57.10	50.43
10	12.43	4.60	65.81	57.68	49.18

Table 6 presents the results of our method implemented on different LLMs. We selected three of the most popular models currently available. The data indicates that, overall, Llama3 consistently achieves the best performance, although there is a slight decrease in the success rate (SR) compared to GPT-3.0. This suggests that Llama3 exhibits superior reasoning capabilities in this domain.

Table 6. Different LLM Ablations.

LLM	TL	NE	OSR	SR	SPL
GPT3.0	11.77	4.43	64.77	58.34	50.96
qwen2	11.91	4.52	65.21	57.66	49.74
llama3	11.69	4.34	65.83	58.32	51.09

Table 7 compares different prompt structures for PEOA. Structure 1 is a simple query: There are obstacles, and the goal is to reach the living room. What should you do? Structure 2 provides more details: Environmental information: Obstacles {wall, table, chair, door}. Target location: living room. Note: Distance to obstacles is 0.2 m–0.6 m. Suggested actions: {move forward, turn left, turn right}. Structure 3 is the most detailed: Current environment: Nearby obstacles {(wall, 0.60 m), (table, 0.36 m), (chair, 0.30 m, 31°–32°), (door, 0.2 m)}. Target: Sub-goal: living room. The results show that Structure 3 achieves the highest SPL and performs well across other metrics. Despite a slightly lower SR than Structure 2, Structure 3 is preferred for its concise yet informative content, enhancing the agent’s environmental understanding.

Table 7. Performance Metrics for Different Prompt Structures.

Prompt	TL	NE	OSR	SR	SPL
Structure 1	11.76	4.50	65.72	57.92	50.62
Structure 2	11.79	4.41	65.92	58.02	50.77
Structure 3	11.69	4.34	65.83	58.32	51.09

4.6. Qualitative Examples

As depicted in Figure 3, the agent successfully navigated the environment utilizing RAG-enhanced LLM decision-making and PEOA obstacle avoidance strategies. Beginning at the initial point, the agent proceeded to point A, where it paused adjacent to a potted plant. Confronted with two potential paths, the agent inferred a stronger association between the bathroom and bedroom compared to the staircase, prompting a right turn towards the bedroom.

Employing the PEOA strategy, the agent effectively processed wall proximity and angular data (wall, 0.3 m, 30°–68°), thereby avoiding collisions and entering the bedroom, where it halted at point B after detecting a television. Upon failing to locate the bathroom en route to point C, the agent retraced its trajectory to point B, subsequently navigating to point D, and ultimately entering the bathroom.

Figure 3. Example of navigation using RAGNav.

5. Conclusions

This paper introduces RAG-Nav, a novel navigation method that integrates Retrieval-Augmented Generation (RAG) with large language models to enhance navigation accuracy and robustness in complex, continuous 3D environments. By retrieving relevant information from a constructed knowledge base, RAG-Nav augments the generative capabilities of the language model for precise coarse-grained planning. Additionally, it incorporates an intelligent obstacle avoidance strategy using structured prompts for fine-grained planning, effectively mitigating collision risks.

Experimental results on the public datasets R2R-CE and RxR-CE demonstrate that RAG-Nav achieves a 2% and 2.32% increase in navigation success rates, respectively, confirming its effectiveness in improving navigation accuracy avoidance capabilities.

Author Contributions

Conceptualization, X.B. and Z.L.; methodology, X.B. and Z.L.; software, Z.L. and Z.L.; validation, Z.L. and B.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L.; funding acquisition, resources, X.B. and Z.L.; supervision, X.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Research and Development Program of Zhejiang Province (2020C03094), and the General Scientific Research Project of the Department of Education of Zhejiang Province (Y202250677, Y202250706, Y202250679).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Anderson, P.; Wu, Q.; Teney, D.; Brce, J.; Johnson, M.; Sünderhauf, N.; Reid, I.; Gould, S.; Hengel, A.V.D. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3674–3683. [Google Scholar]
Krantz, J.; Wijmans, E.; Majumdar, A.; Batra, D.; Lee, S. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
Zhou, G.; Hong, Y.; Wu, Q. NavGPT: Explicit reasoning in vision-and-language navigation with large language models. arXiv 2023, arXiv:2305.16986. [Google Scholar] [CrossRef]
OpenAI o1. GPT-4 technical report. arXiv 2023, arXiv:2303.08774.
Lin, B.; Nie, Y.; Wei, Z.; Chen, J.; Ma, S.; Han, J.; Xu, H.; Chang, X.; Liang, X. NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning. arXiv 2024, arXiv:2403.07376. [Google Scholar]
Yue, L.; Zhou, D.L.; Xie, L.; Zhang, F.; Yan, Y.; Yin, E. Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments. IEEE Robot. Autom. Lett. 2024, 9, 4918–4925. [Google Scholar] [CrossRef]
An, D.; Wang, H.Q.; Wang, W.G.; Wang, Z.; Huang, Y.; He, K.J.; Wang, L. ETPNAV: Evolving topological planning for vision-language navigation in continuous environments. arXiv 2023, arXiv:2304.03047. [Google Scholar] [CrossRef]
Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.F.; Wang, W.Y.; Zhang, L. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6629–6638. [Google Scholar]
Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. VLN BERT: A recurrent vision-and-language BERT for navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1643–1653. [Google Scholar]
Chen, S.; Guhur, P.L.; Schmid, C.; Laptev, I. History aware multimodal transformer for vision-and-language navigation. Adv. Neural Inf. Process. Syst. 2021, 34, 5834–5847. [Google Scholar]
Ku, A.; Anderson, P.; Patel, R.; Ie, E.; Baldridge, J. Room-across-room: Multilingual vision-and-language navigation with dense spatio-temporal grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 4392–4412. [Google Scholar]
Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niebner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGBD data in indoor environments. In Proceedings of the International Conference on 3D Vision, Qingdao, China, 10–12 October 2017; pp. 667–676. [Google Scholar]
Landia, F.; Baraldi, L.; Cornia, M.; Corsini, M.; Cucchiara, R. Multimodal attention networks for low-level vision-and-language navigation. Comput. Vis. Image Underst. 2021, 210, 103255. [Google Scholar] [CrossRef]
Li, J.; Tan, H.; Bansal, M. Improving cross-modal alignment in vision language navigation via syntactic information. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1041–1050. [Google Scholar]
Ma, C.; Lu, J.; Wu, Z.; AlRegib, G.; Kira, Z.; Socher, R.; Xiong, C. Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Taioli, F.; Rosa, S.; Castellini, A.; Natale, L.; Bue, A.D.; Farinelli, A.; Cristani, M.; Wang, Y. I2EDL: Interactive Instruction Error Detection and Localization. arXiv 2024, arXiv:2406.05080. [Google Scholar]
Zhan, Z.; Yu, L.; Yu, S.; Tan, G. MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains. arXiv 2024, arXiv:2405.10620. [Google Scholar]
Chen, J.; Lin, B.; Liu, X.; Liang, X. Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation. arXiv 2024, arXiv:2407.05890. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. arXiv 2022, arXiv:1908.08530. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. arXiv 2022, arXiv:2204.14198. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning; PMLR: Birmingham, UK, 2022. [Google Scholar]
Shah, D.; Osiński, B.; Levine, S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. arXiv 2022, arXiv:2207.04429. [Google Scholar]
Latif, E. 3P-LLM: Probabilistic Path Planning using Large Language Model for Autonomous Robot Navigation. arXiv 2022, arXiv:2403.18778. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Han, R.; Wang, S.; Wang, S.; Zhang, Z.; Zhang, Q.; Eldar, Y.C.; Hao, Q.; Pan, J. RDA: An accelerated collision-free motion planner for autonomous navigation in cluttered environments. IEEE Robot. Autom. Lett. 2023, 8, 1715–1722. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, Y.; Han, R.; Zhang, L.; Pan, J. A generalized continuous collision detection framework of polynomial trajectory for mobile robots in cluttered environments. IEEE Robot. Autom. Lett. 2022, 7, 9810–9817. [Google Scholar] [CrossRef]
Hirose, N.; Shah, D.; Sridhar, A.; Levine, S. EXAUG: Robot-conditioned navigation policies via geometric experience augmentation. In Proceedings of the International Conference on Robotics and Automation, London, UK, 29 May–2 June 2023; pp. 4077–4084. [Google Scholar]
Du, H.; Yu, X.; Zheng, L. Learning object relation graph and tentative policy for visual navigation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 19–34. [Google Scholar]
Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual language maps for robot navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 10608–10615. [Google Scholar]
Li, X.; Wang, Z.; Yang, J.; Wang, Y.; Jiang, S. KERM: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2583–2592. [Google Scholar]
Qiao, Y.; Qi, Y.; Yu, Z.; Liu, J.; Wu, Q. March in chat: Interactive prompting for remote embodied referring expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15758–15767. [Google Scholar]
Yiming, C.; Yang, Z.; Yao, X. Efficient and effective text encoding for Chinese Llama and Alpaca. arXiv 2023, arXiv:2304.08177. [Google Scholar]
Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V. Habitat: A platform for embodied AI research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9339–9347. [Google Scholar]
Irshad, M.Z.; Mithun, N.C.; Seymour, Z.; Chiu, H.P.; Samarasekera, S.; Kumar, R. SASRA: Semantically-aware spatio-temporal reasoning agent for vision-and-language navigation in continuous environments. In Proceedings of the International Conference on Pattern Recognition, Montreal, QC, Canada, 21–25 August 2022; pp. 4065–4071. [Google Scholar]
Krantz, J.; Lee, S. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 15162–15171. [Google Scholar]
Georgakis, G.; Schmeckpeper, K.; Wanchoo, K.; Dan, S.; Miltsakaki, E.; Roth, D.; Daniilidis, K. Cross-modal map learning for vision and language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 15460–15470. [Google Scholar]
Chen, P.; Ji, D.; Lin, K.; Zeng, R.; Li, T.H.; Tan, M.; Gan, C. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Adv. Neural Inf. Process. Syst. 2022, 35, 38149–38161. [Google Scholar]
Hong, Y.; Wang, Z.; Wu, Q.; Gould, S. Bridging the gap between learning in discrete and continuous environments for vision-and-language navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15439–15449. [Google Scholar]
Krantz, J.; Lee, S. Sim-2-sim transfer for vision-and-language navigation in continuous environments. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 588–603. [Google Scholar]
An, D.; Wang, Z.; Li, Y.; Wang, Y.; Hong, Y.; Huang, Y.; Wang, L.; Shao, J. 1st place solutions for RXR-Habitat vision-and-language navigation competition. arXiv 2022, arXiv:2206.11610. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, X.; Lv, Z.; Wu, B. Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments. Electronics 2025, 14, 909. https://doi.org/10.3390/electronics14050909

AMA Style

Bao X, Lv Z, Wu B. Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments. Electronics. 2025; 14(5):909. https://doi.org/10.3390/electronics14050909

Chicago/Turabian Style

Bao, Xiaoan, Zhiqiang Lv, and Biao Wu. 2025. "Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments" Electronics 14, no. 5: 909. https://doi.org/10.3390/electronics14050909

APA Style

Bao, X., Lv, Z., & Wu, B. (2025). Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments. Electronics, 14(5), 909. https://doi.org/10.3390/electronics14050909

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Large Language Models with RAG for Visual Language Navigation in Continuous Environments

Abstract

1. Introduction

2. Related Work

2.1. Vision and Language Navigation

2.2. Large Language Models

2.3. Collision in Autonomous Navigation

3. Proposed Approach

3.1. Task Setup

3.2. Building a Navigation Knowledge Base

3.3. Coarse-Grained Planning with RAG

3.3.1. Knowledge Base Retrieval

3.3.2. Enhanced Generation

3.4. Prompt-Enhanced Obstacle Avoidance Strategy

4. Experimental Findings and Interpretation

4.1. Evaluation Metrics

4.2. Datasets

4.3. Implementation Details

4.4. Comparison with SOTA Methods

4.4.1. R2R-CE

4.4.2. RxR-CE

4.5. Ablation Study

4.6. Qualitative Examples

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI