1. Introduction
The field of transformer models has shown remarkable growth in machine learning in general and natural language processing (NLP) in particular [
1,
2,
3]. Their self-attention mechanism effectively models long-range dependencies and works well with sequence-structured data. Structuring data domains as sequences enables their application to various fields, such as computer vision [
4,
5] and decision-making [
6], as well as their combinations in multimodal tasks [
7].
In such conditions, transformer-based language models are promising candidates for robotics and decision-making. Instruction-based systems implemented in robotics are often required in scenarios involving close interactions with humans. The use of natural language, whether verbal or written, makes it accessible to people. Many real environmental tasks require visual observations. In particular, vision-and-language navigation (VLN), which involves guiding robots to complete tasks in real-world environments by interpreting visual data and natural language instructions, has gained significant attention [
8,
9] as a fundamental task supporting various robotic operations.
In many real-world environments, the decision-making process relies on previously observed states and performed actions. For example, consider a navigational task given to a mobile robot agent: “Go outside of the office, but if the door is closed, go back to the starting point”. The agent must retain a history of either performed actions or observed states to complete the task. History-aware multimodal transformers (HAMTs) [
10] have been proposed, which encode sequences of visual and language data with a history of observations effectively replacing long short-term memory (LSTM) [
11] for low-level vision-and-language navigation (VLN) task [
12].
Despite progress in vision-and-language navigation, some challenges still exist. Low-level navigation requires gathering navigation demonstrations that should demonstrate certain skills in optimal path finding, obstacle avoidance, etc., that leads to large and diverse datasets. Training smart systems to learn these skills is less practical, especially given the availability of reliable semiautomatic LIDAR-based routing systems, which are widely used in mobile robotics. Low-level navigation requires the collection of action and observation history at a certain rate, including visual observations, which results in inefficient memory utilization and constraints on history retention. In mobile applications, computational efficiency is always a priority, so the use of a single graphical processing unit (GPU) is a practical constraint.
The core idea of this work is the integration of a semiautomatic LIDAR-based path-searching system with a vision–language model (VLM) pretrained on visual and textual data, leveraging its rich interconnections, such as visual question answering (VQA), for high-level decision-making. In our approach, prior to navigation, we prepare action options with vision–language multimodal annotations and incorporate dynamic action space adaptiveness. A dynamic action space is necessary for high-level navigation because navigational targets form a unique set in each environment, differing in both description and length.
The key contributions of this work are as follows:
Integration of a vision–language model (VLM) with a LIDAR-based planner to enhance high-level navigation decision-making.
A novel dynamic action space approach using an attention score classifier, which enables multiple choices from a set of describable options through a transformer-inspired learnable layer.
As a result, our approach reduces the need for large training datasets while maintaining effective navigation skills. Additionally, our state–action history grows at a slower rate due to high-level decision-making, leveraging both the deep world knowledge of the vision–language model (VLN) and the real-time reliability of the semiautomatic LIDAR-based planner. Furthermore, a dynamic and describable action space allows future advanced smart systems to transition seamlessly not only between different navigational environments but also across various task domains.
2. Related Work
The introduction of large-scale transformer-based models has significantly impacted the field of natural language processing. In the paper “Language models are few-shot learners” [
13], the authors demonstrate that language models can generalize to various tasks without extensive task-specific fine-tuning based on a few demonstration examples in few-shot learning settings. They show the ability to combine, in a single prompt, the task description, all possible action descriptions, and a few examples of correct decisions. Text-based descriptions of possible actions are crucial for switching models between different action spaces and varying numbers of action options. A multimodal (vision–language) approach for describing both the task and action options within a single prompt incorporating images is presented by Dai et al. [
14]:
The ability to build state–action sequence policies from such models is limited for tasks where every state of the environment contains an image. To produce a single action, all previous actions taken and every observed state, including images, must be retained as context, leading to rapid memory growth.
In the work “Do as I can, not as I say: grounding language in robotic affordances” [
15], such operations are performed by a large language model (LLM) utilizing a predefined list of skills for robotics. The skills are described textually, and the decision-making LLM decomposes task instructions into a sequence of text-described skills. The probability of the skill option making progress towards the given task completion is estimated based on the LLM’s aggregated probability of generating its textual description. The paper demonstrates a describable approach for action annotation and the potential to select from a varying number of action options, estimating them one by one. However, large language models (LLMs) are primarily unimodal, pretrained on language data, and disconnected from visual representations, making them limited for multimodal tasks. This approach is not adaptable for extrapolation to a vision-and-language framework in terms of multimodal action representation.
In contrast, our approach estimates all possible options simultaneously in a single forward pass of the vision–language multimodal model.
Some approaches to multimodal history-aware policies address the issue of growing history [
16,
17]. These approaches target policies for a 7 DoF gripper manipulator robot in the RLBench [
18] virtual environment and real-world implementation. Specifically, in the InstructRL [
17] framework, they use a moment-wise encoder to convert every state into a rich representation embedding and a history-aware decoder to predict the next action based on the context of all previous states and actions. For effective gripper manipulation, inverse kinematic automation is utilized, enabling action prediction at a higher level to produce target joint positions, which are executed semiautomatically. The macro-step approach [
19] reduces the frequency of decision-making actions and limits the growth of history context sequences.
While utilizing history awareness, these policies have a fixed action space fine-tuned to the robot’s inner structure. They do not generalize to navigational tasks, and macro-steps remain a separate consideration for such tasks.
Therefore, for navigation tasks of a mobile robot, shifting to macro-steps is proposed in our work as actions represented by target coordinates on a LIDAR-generated map instead of real-time driving signals. Achieving the target position on the map is taken out of the responsibility of the decision-making system. Moreover, in our work, target points are predefined and described, shifting the task focus from low-level vision–language navigation (VLN) to high-level decision-making.
In our research, we develop compressed history-awareness with a dynamic action space and action annotation capability.
3. Materials and Methods
3.1. Problem Definition
Multimodal Representation. For decision-making policies in navigation tasks, a moment-wise multimodal representation , adapted for processing by a vision–language multimodal transformer, is used. It contains an image from the robot’s front-facing camera and a prompt containing natural language text . Multimodal representation is applied both for states () and actions ():
State representation: , where:
represents the robot’s current front facing observation and local LiDAR sensor’s observation.
is a copy of the global navigation task prompt given to the robot at the beginning of the experiment.
Action representation: , where:
provides visual reference to the action option.
contains textual description of the action option.
Predefined environment data. Before navigation begins, we define static environment data:
where each action option is defined by:
(): Coordinates and orientation on the map .
: The multimodal representation of the action option, serving as a reference for the decision-making model.
Multimodal representations of action options in the action vocabulary are not related to any timestamp. The visual part serves as an example of the surrounding environment, while the textual prompt contains a description of the relative location in the environment. Thus, selecting action means forming a command for navigation planner to move to that corresponding coordinate.
Decision-Making as a Sequence Task. In addition to predefined occupancy map and action vocabulary, in a dynamic environment, observable data are split into either a state or action related to timestamp . States and actions are resolved in pairs attached to the same timestamp, implying minimal observation change during the time between the captured state and action. In our problem, we define autonomous navigation as a sequential decision-making task, where at a timestamp , multimodal representation of current state and history of all previous states and actions in the episode, denoted as }, are accessible. Each episode starts with the submission of a textual navigation task, which triggers the first state capture and ends with the selection of an end-of-sequence action option. The number of states and actions in episodes is equal, and the end-of-sequence token has no multimodal representation.
Policy Objective. The goal of our policy is to map a sequence of states and history into a sequence of navigation actions, dynamically selecting the most appropriate action from the given action vocabulary:
where:
is the history-aware decision-making policy parametrized with ;
} is the state–action history up to time in multimodal representation;
is current state’s multimodal representation;
is the action vocabulary;
is a selected action option with multimodal representation used by policy and coordinates and orientation required by navigation planner, supplied by occupancy grid map.
3.2. State
Each state is an image–prompt pair, containing an observation of the environment and a copy of the given task prompt. The task prompt is a natural language instruction provided to the robot at the beginning of the episode, specifying the navigation goal. It describes the desired destination or sequence of actions the robot should take. The image consists of two concatenated parts: an image from the front camera of the mobile robot and a LIDAR point cloud map of the surroundings with an “arrow” anchor, clarifying the robot’s relative localization on the map (
Figure 1a). Unlike LIDAR, which provides relative localization context and obstacle profiling, the camera enables contextual perception, allowing the robot to interpret visual cues that influence navigation decisions, especially for the navigational tasks conditioned on visual observations.
3.3. Actions
Low-level deterministic navigation. The task of LIDAR-equipped navigation of mobile robots is deterministic and well established. It is beneficial to use classical deterministic methods rather than iterative machine learning models for path planning and motion control for several reasons. The classical approach does not require data collection or trial-and-error iterations with a reward loop associated with achieving the navigational goal or avoiding obstacles. It reduces the dependency on machine learning training for high-level decision-making, thereby lowering dataset requirements. Sequences of states and actions are reduced through the implementation of high-level decision-making. The independent navigation system performs real-time path routing and driving while avoiding obstacles, with an adjustable execution rate.
LIDAR-based pathfinding and navigation are supported within the ROS (robot operating system) [
20]. An occupancy map of the environment is generated by the simultaneous localization and mapping (SLAM) algorithm [
21]. The navigation system includes a global planner, a local planner, and a PID (proportional–integral–derivative) controller [
22]:
The global planner (global_planner ROS package) finds a route from the current robot position to the target point on the occupancy map, commonly using the A* algorithm [
23]. The A* algorithm is preferred over Dijkstra’s algorithm [
24] due to its reduced search time. Since Dijkstra’s algorithm visits all nodes while computing the total cost from the starting point. A* visits nodes in order of their heuristic function values, which is often the Euclidean distance.
The local planner (dynamic window approach—DWA) [
25] evaluates possible translational and rotational velocities within a limited time window and selects the best option for following the global path while avoiding obstacles.
The PID (proportional–integral–derivative) motion controller [
22] adjusts these velocity commands to ensure smooth execution and correct trajectory deviations before they are translated into differential drive signals for the robot’s motors.
Detailed considerations on path planning and motion control within the ROS navigation stack can be found in Zhao et al. [
26].
High-level decision-making, where an action is selected as the target waypoint on the map and low-level path-planning and motion control are delegated, incorporates the advances of a modular approach. One such advantage is the flexibility of selecting the low-level algorithm for special needs like instability of control signals and nonlinear environmental disturbances. Advanced techniques of nonlinear robust control, such as multilayer neurocontroller [
27] and state-filtered disturbance rejection control (SFDRC) [
28], can replace the classical PID controller in future work, enhancing the stability and robustness of robotic navigation systems.
High-level annotated actions. While it is possible to simultaneously search for the route and build the occupancy map using a LIDAR sensor, we specify the following prior annotations for our experiments:
Thus, we define actions as waypoints on the LIDAR-based occupancy map, annotated with an image and a textual description, ensuring consistency between state and action representation. The distinction is that actions are annotated prior to any experiment, whereas states are observed during the experimental episode.
3.4. Datasets
The reinforcement learning (RL) approach is widely used for robotic tasks involving agent–environment interactions. In this approach, the agent learns by exploring the environment, performing actions, and observing states, along with receiving rewards to estimate action quality.
However, in many robotic applications, the focus shifts to a supervised approach, where the objective is to predict the exact patterns of demonstrated behavior. This approach requires a dataset of state–action sequence episodes, where every action is considered correctly executed. We accept that approach for our paper in the form of behavioral cloning [
29]. Considering every action in the demonstration dataset as correct and a target to predict, we do not provide any other quality estimation, such as a reward function.
In our case, the dataset consists of episodes of environment states in the form of a front camera image, a LIDAR point cloud map, and a task in a natural language prompt. Each state in the episode is paired with an action choice from a predefined annotated list. A state always consists of a single bundle of a camera image, a LIDAR map, and a prompt gathered at that moment the next action is expected. Semiautomatic navigation actions may take varying amounts of time, during which no data are gathered.
Gathering such a dataset in practice involves a preparation phase and the collection of demonstration episodes. During preparation, an occupancy map of the surrounding area is collected using a LIDAR sensor, covering the entire area available for navigation. In this environment, a set of points of interest is manually defined as potential navigation targets, with their relative coordinates and poses specified on the occupancy map. Each point of interest (action option) is described using a short textual prompt and visited by the mobile robot once to collect an image from the front camera view.
The collection of dataset episodes starts with sending a task prompt message and capturing the first state’s image from the front camera along with the relative pose from the occupancy map. The most appropriate action option is chosen from the described action vocabulary and recorded in the sequence. After performing a navigation step, the robot captures the next state and the process repeats until the end of the sequence, where a special action option is chosen in the case of successful episode completion. The robot moves to the starting point for the next episode, which begins with receiving a task prompt. During dataset gathering, the performed actions are considered correct with respect to the current observation, previously performed actions, prior observations, and action option annotations.
3.5. Dynamic Action Space
For navigation approaches that strongly rely on environment annotation, we must state an additional requirement: dynamic action space adaptivity for our model. This is necessary for updating the annotation of an existing environment, such as adding new action options to the existing action vocabulary. Similarly, different environments should have different annotated action vocabularies with corresponding possible poses to achieve within them. More generally, the ability of a model to flexibly adapt to switching between action spaces is a promising feature for multidomain and multitask learners.
3.6. Model
Our model consists of the following components:
Moment-Wise Multimodal Transformer Encoder. This accepts a prompt–image pair and produces a rich representation embedding as a single token. It works uniformly with states and actions represented as image–prompt pairs. From annotated action options, it creates embeddings and forms a so-called action vocabulary (
Figure 2a).
Training demonstration episodes are structured as sequences of states and actions. States consist of camera images and surrounding occupancy point clouds captured during semiautomatic navigation to predefined target points on the map, while actions are identifiers of those points in the action annotation list. The moment-wise encoder processes a dataset of state–action sequences and annotated action options to generate embeddings for every action option and state in the episode, representing each as a single token embedding.
We accept vision–language transformers (ViLTs) [
30] and FLAVA [
31] pretrained for visual question answering (VQA) as a baseline model.
- 2.
History-Aware Causal Decision-Making Transformer. Based on the history of all previous states and actions (and masked future states and actions), this predicts the next action, agnostic of action options, in the form of a transformer token of the same shape (
Figure 2b).
This component consists of self-attention transformer layers with a causal mask, which hides future (later in sequence) tokens for attention computation at every position. It is initialized without pretrained checkpoints, and the number of attention layers is treated as a training hyperparameter. After passing through the layers, we generally consider each token as a prediction of the next one. In our case, we use only the predictions of the next actions. We utilize the GPT (generative pretrained transformer) architecture [
32] as a causal decision-making transformer.
- 3.
Attention score classifier. From action option-agnostic prediction, this selects the most relevant token in action vocabulary. Although this is still a classification task, it differs from traditional methods because the number of action classes is dynamic (
Figure 2c). Every action prediction and action option from vocabulary is a numeric vector–transformer-produced embedding. We propose using attention scores with action agnostic prediction as a query (
Q) and stacked action vocabulary embeddings as a key matrix (
K) (Equation (3)). Attention scores are part of transformer’s attention mechanism [
1] and enable self-attention to work with various input sequence length. We utilize this ability for handling various multiple-option choices. Attention scores, being a vector of numbers with the length of keys vectors, can be used as “digits” for classification. This approach is trainable if queries (
Q) and keys (
K) have trainable parameters (weights), and it is scalable, allowing the insertion of self-attention layers [
1] before the attention score classifier.
Practical implementation of the dynamic action space in the proposed framework includes separate on-demand calls to the moment-wise encoder for action vocabulary to form single-vector embeddings for action options from their multimodal vision–language annotations. The call for new action option embeddings may be triggered by any change in the action vocabulary, including modification of an action option annotation, the addition or removal of action options, or a full change of the action annotation when switching to a new environment. During the training phase, action embeddings may be recomputed to update the representation at arbitrary frequency. The current approach provides flexibility in the usage of the framework: no changes in the action space introduce architectural changes to the model or introduce new trainable parameters, demonstrating its adaptability to dynamic action spaces.
Here:
Q = Query matrix
K = Key matrix
dk = Dimensionality of the key vectors (a scalar)
4. Experimental Setup
Navigation experiments were conducted in two domains.
Gazebo virtual environment—simulation of robot and the student office. A simulation of the robot and the student office. In the ROS-based virtual environment engine Gazebo, we built a robot simulation that includes simplified body joints, differential drivers, a camera, and LIDAR sensors (
Figure 3). For the environment, we used a LIDAR occupancy grid gathered in the student office to represent inelastic walls. The environment utilized the default physics engine of Gazebo and was modified with standard library visual and physical objects, such as a fridge and a shelf.
A real-world environment—the student office room. In this domain (
Figure 4), the Xiaor Geek mobile robot (
Figure 5) equipped with a camera and LIDAR and running on an ROS (robot operating system) was used. The robot was connected to a remote server via Wi-Fi, sending sensor observations (camera image stream, LIDAR point cloud occupancy map, and odometry measurements) and receiving control signals as differential driver voltages.
The attention score classifier output logits are treated as multiclass classification logits and optimized minimizing the categorical cross-entropy loss function
, defined as:
where:
is 1 for true label and 0 for false (one-hot encoding);
is predicted probability of being a true class;
is total number of classes.
For machine learning computations, we utilized two Nvidia RTX A4500 GPU units. For training with the AdamW optimizer, we used cycles of learning rate cosine annealing with a decay factor of 10 over 100 epochs, followed by a constant learning rate. These rounds are repeatable, with learning rates dropping by a factor of 10 and initial rates set to
and
. The warmup phase uses a linear increment to the initial learning rate over 10 epochs. For the baseline model, we used the multimodal encoder ViLT [
30] with a token dimension (d) of 768. For the state–action decoder, we used GPT [
32] without pretraining. State and action tokens were assigned separate modality-encoding vectors and distinct 1D positional encodings.
5. Results
In the Gazebo virtual environment, we gathered a dataset of 100 episodes of simple navigational tasks such as “go to the fridge.” The virtual environment was designed to closely resemble the real-world domain, with the LIDAR occupancy map demonstrating an almost-perfect match to the real-world environment, while visual complexity was kept minimal. In this domain, two points of interest were defined as action options: “position facing the fridge and rack” and “position outside the lab facing the hallway.”
The primary objective in this initial stage of experimentation was to achieve 100% accuracy in both training and validation, which served as a benchmark for potential baseline model selection. This goal was intended to establish the feasibility of a small GPT-based decision-making policy in controlled, low-variability conditions before extending the approach to more complex real-world environments.
For the experiments we used two baseline encoders trained on visual question answering (VQA):
ViLT (Vision-and-Language Transformer) [
30]—a lightweight and efficient vision–language transformer that directly processes image–text pairs without region-based feature extraction using 16*16 pixels image patches embedded with trainable linear projections together with language BERT-like embeddings within the same lightweight transformer.
FLAVA (Foundational Language and Vision Alignment) [
31]—A multimodal transformer that learns strong representations from both unimodal (unpaired images and text) and multimodal (image–text pairs) data. FLAVA consists of separate image and text encoders, both based on transformer architectures, followed by a multimodal transformer encoder that fuses the unimodal representations for integrated reasoning.
In the real-world domain, we prepared five distinct action options corresponding to target points of interest and a dataset of 77 episodes of complex navigation tasks, such as “Go to the fridge, then go to the right hall outside the lab. If there is a pink cone, turn left into the hall.” For both baseline encoders, we added our decision-making causal transformer and attention score classifier with random weight initialization. The results for different numbers of self-attention layers in decision-making GPT are shown in
Table 1.
For the investigation of potential ability of the sim-to-real transfer approach to achieve better convergence in the real-world setting, we trained our model in the same real-world domain after pretraining in a virtual environment. The results of that approach are shown in
Table 2.
We measured the GPU VRAM usage during the training phase for both ViLT and FLAVA baseline encoders with the same variety of GPT layers used and batch size of one (
Table 3). Increasing the batch size primarily adds VRAM usage due to additional data size, but the memory required by the model’s weights and gradients remains constant. The addition of more GPT layers results in an approximately linear increase in VRAM consumption, as expected (
Table 3). While VRAM usage can be theoretically estimated, we conducted empirical measurements with Nvidia’s nvidia-smi tool to capture the combined impact of model activations, attention computations, and memory overhead, ensuring accurate insights into the practical feasibility of training on resource-constrained GPUs. The video memory usage demonstrates that the proposed training pipeline can be executed on systems with a single 8 GB GPU, which aligns with widely available consumer-grade GPUs.
6. Ablation Study
The classical classification approach of multilayer perceptron (MLP) is not appropriate for tasks with a changing number of classes, as long as its weights are tied to specific class outputs, making it unable to adapt dynamically. The closest comparison to our proposed attention score method is the cosine similarity metric, which also relies on the dot product operation between vectors. In our case, this dot product occurs in the form of a query vector compared against a key matrix, where each key represents a candidate action embedding.
The key difference lies in how the query and keys are formed. While cosine similarity uses fixed embeddings, the attention score mechanism utilizes trainable parameters for generating both the query and key vectors. This feature potentially allows the attention score method to adaptively learn relationships between predictions and dynamically changing action options, which is not achievable with static similarity metrics. To validate this, we performed an ablation study by replacing the attention score mechanism with cosine similarity on the same classification task using the ViLT baseline model (
Table 4). The results clearly demonstrate the limitations of cosine similarity: while it achieves acceptable performance in the simpler virtual environment with only two action options, its performance drops significantly in the real-world dataset with five action options.
7. Discussion
7.1. Trainable Parameter Distribution
To estimate the trainable parameters in our proposed model, we suggest a conservative approximation focused on the quadratic scaling of parameters in a standard transformer layer. Specifically, the dominant part of the parameter growth can be approximated as 12d2, where d is the hidden dimension size:
4d2 accounts for the self-attention mechanism (query, key, and value projections, along with output projection),
8d2 represents the parameters for the two fully connected layers in the position-wise feedforward network, with the hidden layer typically using a dimension size of 4d.
For simplicity, we ignore smaller contributions, such as the 2d parameters for layer normalization and potential projections of image embeddings or token embeddings. Additionally, optimizations like parameter sharing or mixed precision computations are not included in this estimation.
Both encoder models, ViLT and FLAVA, utilize BERT-style vocabulary projections from textual tokens into d-dimensional embeddings, but FLAVA has trainable projections. To reflect this, we include an additional term of V * d, where V is the vocabulary size of language embeddings, to account for the FLAVA’s trainable parameter estimation.
In our experiments, the hidden dimension size (
d) is 768 and vocabulary size
V is 30,000. The dimension size (d) of 768 was proposed in the original transformer paper [
1] as an empirical compromise between computational cost and representational capacity of embeddings. The same dimension size
d was used both in FLAVA and ViLT and was adopted in our paper for causal transformer and attention score classifier to ensure consistency. The vocabulary size
V of 30000 tokens is derived from BERT’s original WordPiece [
33] tokenization, ensuring compatibility with pretrained models.
Using our conservative parameter estimation method, we calculate the trainable parameters for the baseline models and our proposed components as follows:
ViLT: transformer with 12 attention layers (
Table 5).
FLAVA: FLAVA consists of three components: 12 layers for vision, 12 layers for natural language, and 6 multimodal layers together with vocabulary projections.
Decision-Making GPT: For our history-aware GPT decision-making component, six-, seven-, and eight-layer options were used.
Attention Score Classifier: The attention score classifier adds a lightweight 2d2 parameter contribution.
The estimation of free parameters highlights the computational efficiency of our proposed framework. While baseline models like ViLT and FLAVA require 84.9 M and 235.3 M parameters, respectively, our decision-making GPT remains lightweight with 42.5 M–56.6 M parameters, depending on the number of layers. The attention score classifier adds a compact overhead of 1.2 M parameters, maintaining a minimal impact on resource usage while having a significant impact on results.
7.2. Inference Time and Real-Time Feasibility
We conducted an empirical analysis of inference latency in the real-world environment during the forward pass of the framework, focusing on its real-time deployment feasibility. For ViLT-based models, the forward pass time was consistently under 0.23 s, while for FLAVA-based models, it remained under 0.35 s. Notably, increasing the number of GPT layers (from six to eight, as tested in our other experiments) did not lead to significant changes in latency. Similarly, increasing the state–action sequence length to 100 elements in embedding form resulted in no measurable slowdown. However, if full multimodal representations (image + text) were retained instead of compressed embeddings, a linear increase in latency was observed, approximately 0.07 s per additional state–action pair. For example, with the ViLT encoder, a history of 10 such multimodal elements results in a forward pass time that can exceed 1 s. Since full multimodal retention is used during dataset collection and not required during policy inference, this overhead does not impact deployment. These results demonstrate the computational efficiency of our approach for real-time use in robotics settings.
7.3. Scalability
The scalability of our approach is relevant for both scaling up and scaling down the model to fit different computational requirements. Scaling up allows for the use of larger, more advanced vision–language models as baselines, which improves world knowledge and overall quality. Scaling down is essential for deployment in onboard mobile robot equipment, where computational efficiency is a priority. Our key addition, the attention score classifier, is inherently scalable due to its use of the transformer attention mechanism. Since the classification logic is based on attention, it maintains flexibility with respect to the dimension size of the transformer. This ensures that changes to the baseline model’s size or complexity are naturally absorbed by the classifier. The number of GPT layers can also be adjusted.
7.4. Limitations
The experiments presented in the paper used relatively small datasets, with 77 episodes in the real-world environment and 100 episodes in the virtual domain. We consider the results as a proof of concept, and future work will require larger datasets to validate the stability of the performance and support cross-domain generalization.
Our model is designed for instruction-based decision-making, but it is limited in its ability to support two-way interaction, such as answering with text or forming a dialogue. This limitation comes from our choice to use a vision–language multimodal encoder-only model as the baseline. The main reason for this design choice is the need to encode both states and actions as single tokens, which simplifies the process of building a sequential state–action history. However, most vision–language models capable of generating text responses, like encoder–decoder or decoder-only models, use multiple token embeddings instead of a single token. These multiple embeddings are usually comparable by the number to the input sequence of the encoder, making it difficult to fit them into our simple token-based approach.
Another limitation of our approach is the manual description of action options in terms of their number, position on the map, and text descriptions. This process requires human input during the preparation stage. In future work, this limitation could be addressed by enabling the system to autonomously identify points of interest during the exploration phase, as the robot gathers the LIDAR occupancy map. This limitation is specific to the applied task of mobile robot navigation and does not necessarily affect more general decision-making policies, where action options and their dynamic changes are handled differently.
8. Conclusions
Our work introduces a scalable framework for decision-making tasks with dynamically varying action spaces based on accepting a widely trained vision–language multimodal encoder for moment-wise rich single-embedding representation forming, with the addition of a history-aware causal transformer with a proposed attention score classifier block. Efficiency is demonstrated in the context of instruction-based navigation for autonomous mobile robots. To address the challenge of dynamic class numbers, we designed the attention score mechanism, a trainable classification layer capable of efficiently selecting actions based on adaptive relationships between predictions and dynamically annotated options.
The experiments in both simulated and real-world environments validate the effectiveness of our approach. Comparative evaluations using ViLT and FLAVA as baseline multimodal transformers highlight the superior performance of FLAVA within the constraints of a single 8 GB VRAM system, showcasing the method’s practical feasibility. Additionally, our ablation study reveals the critical role of trainable attention scores in achieving robust performance over static similarity metrics like cosine similarity. We believe that the approach of describable and dynamic action spaces is not limited by our application and is crucial for developing multitasking or switching-domain systems.