Next Article in Journal
Transforming Farming: A Review of AI-Powered UAV Technologies in Precision Agriculture
Previous Article in Journal
Influence of the Inclusion of Off-Nadir Images on UAV-Photogrammetry Projects from Nadir Images and AGL (Above Ground Level) or AMSL (Above Mean Sea Level) Flights
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space

1
College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan 618307, China
2
School of Electronic Information Engineering, Beihang University, Beijing 100191, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Drones 2024, 8(11), 663; https://doi.org/10.3390/drones8110663
Submission received: 5 October 2024 / Revised: 30 October 2024 / Accepted: 7 November 2024 / Published: 10 November 2024
(This article belongs to the Section Innovative Urban Mobility)

Abstract

:
This study was designed to address the challenges of autonomous navigation facing UAVs in urban air mobility environments without GPS. Unlike traditional localization methods that rely heavily on GPS and pre-mapped routes, Mamba-VNPS leverages a self-supervised learning framework and advanced feature extraction techniques to achieve robust real-time localization without external signal dependence. The results show that Mamba-VNPS significantly outperforms traditional methods across multiple aspects, including localization error. These innovations provide a scalable and effective solution for UAV navigation, enhancing operational efficiency in complex spaces. This study highlights the urgent need for adaptive positioning systems in urban air mobility (UAM) and provides a methodology for future research on autonomous navigation technologies in both aerial and ground applications.

1. Introduction

Autonomous driving has experienced significant growth, largely driven by advancements in artificial intelligence, sensor fusion, and real-time data processing [1]. While much of the early development has focused on ground vehicles—particularly cars, which must navigate through complex urban and rural environments—autonomous driving as a concept extends far beyond automobiles. In fact, the foundational technologies that power self-driving cars, such as computer vision, sensor integration, and advanced navigation algorithms, are laying the groundwork for future applications in areas like unmanned aerial vehicles (UAVs) [2].
In ground-based autonomous systems, precise and reliable localization is critical for safe navigation. Traditional localization methods, such as GPS, LiDAR, and high-definition maps, have been integral to this effort [3]. However, these approaches face inherent challenges: GPS signals can be obstructed or degraded in urban areas with tall buildings, tunnels, or dense vegetation, while LiDAR and pre-mapped environments can struggle to cope with dynamic scenarios like road construction, unpredictable objects, or rapid environmental changes [4]. These limitations are amplified in highly dynamic environments, prompting a shift toward more versatile solutions.
As autonomous technology extends from cars to UAVs, new challenges emerge. Unlike ground vehicles that rely on two-dimensional navigation, UAVs must navigate complex three-dimensional environments. This includes dynamic airspaces filled with other vehicles, tall buildings, and ever-changing weather conditions. Localization for UAVs faces even greater demands; precise positioning is critical not only for safe navigation but also for obstacle avoidance in real-time flight. Traditional GPS systems become even less reliable in aerial environments where signal interference is common, while ground-based localization methods like LiDAR or pre-mapped routes are impractical for airborne navigation [5]. LiDAR proves impractical for airborne localization due primarily to its weight, which contradicts the design principles of UAVs aimed at lightness and agility. The payload capacity of most UAVs is limited to maintain maneuverability, and the inclusion of heavy sensors like LiDAR would increase both energy consumption and operational constraints. This weight-induced limitation not only undermines the UAV’s intended efficiency but also restricts flight time.
Urban air mobility (UAM), the future vision of urban transportation, further amplifies these challenges. In UAM, autonomous air vehicles will be expected to transport passengers and goods through dense urban airspaces, where traditional navigation technologies face severe limitations [6]. The necessity of real-time, vision-based localization in these aerial environments is paramount, but UAVs must overcome additional obstacles: rapidly shifting perspectives, significant changes in scale due to altitude variations, and environments cluttered with both stationary and moving obstacles. Unlike autonomous cars, which operate on relatively stable ground, UAVs must adapt to constantly changing viewpoints and unstructured surroundings [7].
In urban air mobility (UAM) environments, the shielding effects of buildings and multipath interference often distort traditional GPS signals and degrade localization accuracy, making it challenging for GPS-reliant drones to meet high-precision navigation requirements. Thus, using onboard cameras to interpret the surrounding environment becomes essential for reliable navigation. Vision-based localization systems are emerging as a more adaptive approach, using onboard cameras to interpret the surrounding environment. These systems reduce the dependency on external signals and offer flexibility in handling dynamic situations. However, even vision-based methods face hurdles such as varying lighting conditions, repetitive visual patterns (e.g., long stretches of highways), or low-feature-density landscapes. Existing vision-based localization techniques, such as SLAM and VO, often struggle in these conditions. SLAM, for instance, requires significant computational resources, which can be prohibitive for lightweight onboard systems commonly used in UAVs. Moreover, both SLAM and VO are sensitive to dynamic lighting conditions and unstructured obstacles, making them less reliable in unpredictable aerial environments [8]. Similarly, feature extraction methods like RepVGG, while effective in recognizing complex patterns, struggle with the computational demands of long-sequence data processing, particularly in real-time localization tasks [9]. Moreover, offline map training approaches—through which satellite maps are compressed and dimensionality reduced to optimize memory usage—often depend on GPS corrections to handle outliers and thus lack the robustness needed for GPS-denied or signal-compromised scenarios [10].
In the context of autonomous navigation, each traditional method demonstrates utility, yet substantial limitations arise in large-scale, real-time environments, particularly within urban air mobility (UAM). With UAVs increasingly positioned as a cornerstone of future urban transportation, precise and efficient navigation within complex three-dimensional spaces has become imperative. UAVs must contend with dynamic conditions—constantly shifting perspectives, fluctuating lighting, and intricate aerial obstacles—underscoring the inadequacies of conventional vision-based systems. These systems often struggle to deliver reliable performance in the high-dimensional and dynamic settings that UAM applications demand, highlighting the need for more adaptive, robust localization solutions [11].
In response to these challenges, we propose Mamba-VNPS (A Visual Navigation and Positioning System with State-Selection Space), a novel system that integrates and enhances the strengths of these traditional approaches while addressing their limitations. Mamba-VNPS is designed to provide robust, efficient localization for both ground-based vehicles and UAVs in highly dynamic environments.
The key innovations of Mamba-VNPS are as follows:
1.
Self-supervised learning framework: Mamba-VNPS introduces a self-supervised visual navigation framework designed specifically for UAVs. By leveraging satellite imagery and pre-learned features, it enables precise localization without requiring extensive pre-mapped environments or a reliance on external signals like GPS.
2.
Mamba Block for lightweight and efficient feature extraction: Mamba-VNPS incorporates Mamba Block, an optimized feature-extraction module that significantly improves efficiency when handling long-sequence image data. By addressing the inefficiencies of traditional neural networks and attention mechanisms, Mamba Block enables fast, real-time navigation for aerial vehicles.
3.
Dual feature-extraction architectures: To balance performance and accuracy, Mamba-VNPS employs two feature-extraction architectures. For large geospatial datasets, Mamba Block efficiently processes map data, while during real-time inference, a residual network is used to maintain high localization precision without sacrificing speed.
4.
Efficient feature vector matching: A highly efficient feature vector-matching system reduces the computational load by eliminating redundant matching data. Through sliding interpolation and similarity matrix comparison, Mamba-VNPS is able to achieve fast and accurate air navigation positioning.
Mamba-VNPS addresses the limitations faced by today’s mainstream navigation and positioning systems for UAVs. By combining their strengths and mitigating their weaknesses, it provides a scalable, real-time positioning solution for a variety of autonomous applications. Whether it is autonomous driving or UAV navigation, Mamba-VNPS paves the way for more efficient, precise, and adaptable operations in increasingly complex environments.

2. Related Work

2.1. Visual Navigation and Positioning

In the domain of autonomous navigation and localization, the use of visual information has emerged as a critical area of research. Vision-based navigation and localization methods can be broadly categorized into two approaches: map-less (or map-free) methods and map-based methods [12]. This section provides a comprehensive review of these approaches, examining their respective techniques and contributions to the field.
Map-less localization methods, such as visual odometry (VO) [13,14] and simultaneous localization and mapping (SLAM) [15,16,17], rely solely on real-time visual data to estimate the position and orientation of the vehicle or unmanned aerial vehicle (UAV) without depending on pre-constructed maps. Visual odometry, for instance, is a technique that computes the trajectory of a moving platform by analyzing sequential images captured via onboard cameras. By tracking feature points across consecutive frames, VO estimates relative motion and navigates the environment. Despite its utility, VO is often challenged by rapid movements, lighting variations, and featureless regions, which can impact its accuracy and robustness [18].
Simultaneous localization and mapping represent a more holistic approach, addressing both localization and map creation simultaneously. SLAM techniques utilize visual inputs to build and update maps in real time. While SLAM methods are effective in dynamic environments, they face challenges related to high computational demands and difficulties in handling large-scale environments or significant changes in scale and perspective.
In contrast, map-based localization methods leverage preexisting maps or satellite imagery to enhance localization accuracy. This approach is particularly beneficial in scenarios where accurate maps are available. For example, recent work on UAV localization using autoencoded satellite images demonstrates how compressing and dimensionally reducing satellite imagery can address memory constraints and improve localization [19,20]. However, this method remains dependent on GPS information to resolve anomalies, limiting its generalizability across various environments.
Another significant contribution in this area is the integration of image embeddings with map embeddings to achieve global localization [21]. This method combines visual and map features to enhance localization accuracy, but its effectiveness hinges on the quality and resolution of the available map data. Similarly, GPS-denied UAV localization using preexisting satellite imagery focuses on overcoming GPS limitations by utilizing historical imagery for reliable localization. Although this approach provides a solution for GPS-denied scenarios, it is constrained by the accuracy and availability of the satellite images used [22].
The application of deep learning techniques has also made strides in this field. A deep convolutional neural network (CNN)-based framework for enhanced aerial imagery registration has been proposed to improve UAV geolocalization [23]. This framework leverages deep learning to enhance feature extraction and matching, offering more accurate geolocalization. Nevertheless, the reliance on extensive training data and computational resources poses a challenge for practical implementation. Additionally, iSimLoc utilizes a deep convolutional neural network (CNN) to address visual global localization in novel environments by employing simulated imagery to augment training data [24]. However, the reliance on simulated environments introduces potential discrepancies between synthetic and real-world conditions, which could reduce the system’s robustness in unpredictable, real-world UAM scenarios.
Recent studies have also explored edge computing solutions for real-time processing in UAV systems. For instance, a framework for resource-efficient visual localization has been proposed, utilizing edge computing to minimize latency while maintaining accuracy in dynamic environments [25]. However, this approach often faces limitations related to bandwidth constraints, potential latency due to network instability, and the challenge of ensuring consistent data privacy. Additionally, hardware constraints may restrict the processing capabilities at the edge, leading to trade-offs between performance and resource consumption [26].
In addition, recent advancements in deep reinforcement learning (DRL) have introduced promising approaches for visual localization. Techniques leveraging DRL can optimize UAV trajectory design and resource allocation dynamically. However, these methods often struggle with high computational demands and the need for extensive training in diverse environments, which may hinder their applicability in real-world scenarios [27,28]. While these approaches show potential, they remain susceptible to limitations such as dependency on accurate reward structures and the challenge of generalization across varying contexts.
In summary, map-less approaches such as visual odometry (VO) and SLAM provide adaptive, real-time navigation solutions but are often constrained by high computational complexity and vulnerability to environmental variations. In contrast, map-based techniques, which leverage pre-existing maps and satellite imagery, offer improved accuracy and reliability, though they remain limited by the quality and accessibility of these reference materials. This study emphasizes the importance of achieving high localization accuracy and the critical role of high-quality reference maps, particularly in UAV navigation scenarios. Recent advancements in edge computing can improve real-time processing for efficient visual localization under hardware constraints, though issues like bandwidth limitations, network latency, and hardware restrictions must be addressed to fully realize these benefits. Additionally, deep reinforcement learning methods have demonstrated potential in optimizing UAV trajectory design and resource allocation, yet they face substantial challenges due to intensive computational demands and the need for extensive, diverse training data. As the development of these methods progresses, it remains essential to innovate beyond their current limitations, driving forward the field of vision-based navigation and localization.

2.2. Deep Learning-Based Feature Extraction

Deep learning techniques have been widely adopted across various fields, demonstrating their versatility and effectiveness in addressing complex challenges. In image processing, notable architectures such as AlexNet [29], VGG [30], and ResNet [31] have revolutionized the way visual information is analyzed, achieving remarkable performance in tasks ranging from image classification to object detection. In the realm of natural language processing, recent advances include models like BERT [32], GPT-3 [33], and T5 [34], which have significantly improved the understanding and generation of human language. In speech recognition, recent breakthroughs include the Conformer model [35], which combines convolutional neural networks and transformers to enhance performance in automatic speech recognition tasks and works on manifold regularization-based deep convolutional autoencoders for unauthorized broadcasting identification [36].
Building on this foundation, deep learning-based feature extraction has significantly advanced in the context of autonomous navigation and localization. This evolution has driven innovation in the use of visual information across various platforms, including ground vehicles and unmanned aerial vehicles (UAVs). Convolutional neural networks (CNNs) have historically formed the backbone of feature extraction, leading to impressive performance in numerous applications. However, traditional architectures often struggle with scalability and efficiency, particularly in real-time systems where processing constraints are critical [37]. This challenge has paved the way for more efficient and advanced architectures, such as RepVGG [38], EfficientNet [39], and several variants of transformers [40], which have been successfully applied to feature-extraction tasks in autonomous navigation.
In addition to CNN-based architectures, transformer-based models have gained significant attention for feature extraction due to their ability to capture long-range dependencies in visual data. Vision transformers (ViTs) treat images as sequences of patches and utilize self-attention mechanisms to model global relationships. Unlike CNNs, which primarily focus on detecting local features, ViTs excel at capturing large-scale spatial dependencies, making them ideal for tasks such as UAV navigation in complex and dynamic environments. Recent transformer models have optimized both efficiency and performance. The Swin transformer introduces a hierarchical design with shifted windows, balancing local and global contextual understanding while reducing computational complexity [41,42]. Another notable model, the Convolutional Vision Transformer (CvT), combines local feature extraction from CNNs with global attention mechanisms, improving feature extraction in diverse environments [43,44]. The Data-efficient Image Transformer (DeiT), on the other hand, is tailored for effective training with smaller datasets, which is particularly beneficial for UAV systems that may operate with limited data availability [45,46].
Recent advances in sequence modeling have introduced the Mamba architecture, which addresses the inefficiencies of traditional models in processing long sequences. Built on selective state-space modeling (SSM), Mamba offers significant computational advantages over CNNs and transformers by achieving linear-time complexity, making it highly scalable and efficient for real-time applications like autonomous navigation [47]. Its application to remote sensing has been particularly transformative, improving both computational efficiency and analytical accuracy, while establishing Mamba-based methods as a focal point for handling the unique demands of large-scale remote sensing data [48]. This architecture excels in extracting local and global features with a minimal computational overhead, a critical factor for real-time applications such as UAV-based navigation and environmental monitoring.
In multi-modal data fusion, Mamba demonstrates its versatility with models like FusionMamba, which effectively combines high-resolution spatial images with lower-resolution spectral data, significantly enhancing performance in tasks like hyperspectral imaging [49]. The architecture’s ability to maintain efficiency while improving the quality of fused outputs is particularly beneficial in UAV navigation, where real-time decisions are required based on complex visual data [50]. Moreover, Mamba’s contributions extend to semantic segmentation with the introduction of Samba, a framework built specifically for very high-resolution (VHR) remote sensing imagery. Samba optimizes Mamba’s state-space model for dense prediction tasks, making it particularly suited for land-cover classification and urban mapping while maintaining computational efficiency [51]. The RSMamba variant takes this further by incorporating an omnidirectional selective scanning module (OSSM), allowing for the extraction of large-scale spatial features from multiple directions. This makes RSMamba especially effective in dense prediction tasks for large remote sensing images, such as land-use classification and environmental monitoring, where both efficiency and accuracy are essential [52]. Mamba’s flexibility is exemplified in pan-sharpening tasks through the development of Pan-Mamba, which efficiently fuses high-resolution panchromatic and multispectral data [53].
Overall, the Mamba architecture, with its innovative SSM-based approach, is advancing efficiency and accuracy in remote sensing, paving the way for future research and applications in large-scale image analysis. However, further exploration is needed to investigate its potential applications in UAV visual localization.

3. Problem Formulation

Accurate and efficient visual localization in urban air mobility (UAM) environments is essential for UAVs navigating in dense, GPS-denied urban areas. The core problem is to achieve precise localization by minimizing the root mean square error (RMSE) between the predicted and true positions of the UAV, subject to constraints on feature extraction, vector matching, and dynamic triplet-based learning.
The localization challenge is represented as the task of determining an optimal mapping function f ( x ) , where x denotes visual input data, with the goal of aligning feature vectors with predefined references while minimizing localization error. This objective is subject to the following three key constraints:
1.
Feature ( Q -constraint): To ensure robustness and consistency in localization, extracted features must be compressed into a 1 1000 dimensional vector. This dimensional constraint enables efficient computation while preserving critical localization details. Formally, let ϕ ( x ) denote the extracted feature vector; then, the constraint can be expressed as follows:
ϕ ( x ) R 1 × 1000
2.
Dynamic triplet-based learning ( L -constraint): The learning process should adaptively update feature representations to maintain robustness in changing urban landscapes. By employing a dynamic triplet-based learning approach, the model enhances its capability to distinguish between similar and dissimilar locations based on contextual changes. The triplet constraint can be formulated as follows:
E f x a f x p 2 + δ f x a f x n 2 0
where x a , x p , and x n denote the anchor, positive, and negative samples, respectively, and δ is the margin to separate positive and negative pairs.
3.
Vector matching ( T -constraint): Given the need for rapid localization updates, each feature matching operation must be completed within a predefined time limit T max , allowing the UAV to adjust its position in real time. This ensures that localization calculations do not impede the UAV’s responsiveness:
T ( f ( x ) ) T max
The localization task can thus be formulated as a constrained optimization problem, where the goal is to minimize the RMSE of the UAV’s predicted position relative to the ground truth position p ^ :
min f 1 N i = 1 N p ^ i p i 2
This formulation aims to balance feature dimensionality, computational efficiency, and learning adaptability, ensuring that Mamba-VNPS can perform high-accuracy UAV localization in real time for UAM applications.

4. Method

In this section, we elaborate on the core components of the Mamba-VNPS (Visual Navigation and Positioning System with State-Selection Space) framework, which addresses the limitations of traditional UAV localization techniques. The proposed system is designed to offer robust, real-time, and precise localization capabilities in highly dynamic environments. By incorporating a self-supervised learning approach, efficient feature extraction modules, and a robust feature-matching strategy, Mamba-VNPS enables UAVs to navigate complex three-dimensional spaces with high accuracy.

4.1. Overview

The self-supervised learning framework of Mamba-VNPS leverages both pre-existing map data and real-time visual inputs to enable accurate localization without relying on external signals, such as GPS. As shown in Figure 1, this dual-phase framework integrates offline baseline data processing with online real-time image analysis, enabling a robust and adaptive localization process for UAVs navigating urban environments.
In the offline phase, global features are extracted from a pre-acquired baseline map, typically derived from satellite imagery. This map undergoes comprehensive feature extraction to yield a set of baseline vectors, denoted as V base R m × n × 1000 , where each feature vector encapsulates key spatial attributes across the m × n grid of the map. Each element v i , j = f ( x i , j ) represents the 1000-dimensional feature vector at spatial location ( i , j ) , derived through the feature-extraction and dimensionality-reduction function f ( · ) . This function maps high-dimensional data, X, to a lower-dimensional form while retaining essential spatial information, expressed as v i , j = P T · x i , j + ϵ , where P is the projection matrix, and ϵ denotes a minor reconstruction error. These baseline vectors are subsequently stored as reference points, optimized for memory efficiency and computational ease, to support real-time UAV localization.
In the online phase, the UAV continuously captures images of its surroundings. These real-time images are processed through an inference-only feature extraction module, which has been optimized for rapid computation, ensuring responsiveness. Unlike offline feature extraction, which processes the entire dataset, this online module selectively extracts features necessary for timely navigation decisions. The extracted features are then matched with the pre-extracted baseline vectors through a sliding-window similarity-comparison mechanism. This process is fundamental for accurate posture and position estimation, as the matching of real-time features with the baseline vectors determines the UAV’s precise localization relative to the map.
Mamba-VNPS’s self-supervised framework autonomously refines its localization by updating the feature-matching process based on real-time input, thereby adapting to environmental variations, such as changes in lighting or occlusions, without external supervision. By incorporating satellite imagery and baseline maps, this system reduces dependence on constant GPS corrections, offering a scalable, robust solution for urban aerial navigation.
This dual-phase architecture ensures that Mamba-VNPS maintains both high localization accuracy and computational efficiency, closing the loop between image feature extraction and navigational decision-making and making it highly suitable for real-time UAV navigation in diverse, complex environments.

4.2. Dual Feature-Extraction Architectures

The Mamba-VNPS framework operates with two distinct feature-extraction architectures, each optimized for different phases of the localization process: the baseline feature extraction during the offline phase, and the real-time image feature extraction during the online phase. These two architectures are intrinsically linked, enabling the system to navigate efficiently by leveraging the strengths of both pre-acquired spatial data and live visual inputs. The fundamental difference between these architectures lies in their approach to processing scale and computational demands, with the baseline extraction focusing on a comprehensive global representation and the real-time extraction prioritizing speed and efficiency for immediate decision-making.
As shown in Figure 2, in the offline phase, the baseline feature extraction network processes a global map, typically sourced from high-resolution satellite imagery, to generate a robust and expansive set of feature vectors representing the essential spatial characteristics of the environment. The architecture integrates Mamba Blocks, optimized to model long-range dependencies through state-space dynamics. The core operation of each Mamba Block at layer l can be represented by a state-space model, where the input feature map x l is transformed through a convolutional kernel W l and a non-linear activation function f, expressed as follows:
h l = θ ( W l x l + b l ) , y l = ( h l + α h l 1 )
where h l denotes the hidden state, y l denotes the output of the Mamba Block, and α is a learned parameter regulating the influence of previous layer states. This recursive formulation enables the Mamba Block to capture long-range dependencies effectively. Additionally, residual connections allow the network to maintain gradient flow, enhancing stability in training, expressed as follows:
z l = y l + x l
To ensure multi-scale feature extraction, each Mamba Block employs a series of pooling operations, with the feature maps refined at multiple scales. If we let f s ( l ) denote the feature map at scale s in layer l, then the aggregated multi-scale feature output F l at layer l can be described as follows:
F l = s = 1 S β s · f s ( l )
where fi s represents the weighting factor assigned to each scale, learned through backpropagation to prioritize informative scales. This multi-scale aggregation enables the baseline network to create a global representation of the spatial environment, which is further refined using principal component analysis (PCA) to reduce dimensionality while preserving spatial fidelity. The PCA transformation is given by the following:
Z = X W PCA
where X is the high-dimensional feature set, W PCA is the P C A matrix containing eigenvectors, and Z is the reduced-dimensional representation.
In contrast, the real-time feature extraction network processes the UAV’s in-flight visual data, optimizing for low latency. This lightweight architecture reduces the number of convolutional layers and channel dimensions but retains essential components such as Mamba Blocks. Here, each Mamba Block is tailored to prioritize real-time processing, with fewer layers and constrained parameter spaces to enhance inference speed. The feature map F RT from this network, extracted at each time step t , is compared with the pre-computed baseline vectors F baseline using a sliding window approach. The similarity between the real-time and baseline features is evaluated through a normalized cosine similarity score S as follows:
S = i = 1 N F RT , i ( t ) · F baseline , i i = 1 N F RT , i ( t ) 2 i = 1 N F baseline , i 2
where N represents the dimensionality of the feature vectors. This similarity metric ensures robust matching under scale and perspective variations, critical for accurate localization.
The distinct yet complementary roles of the baseline and real-time feature extraction networks underscore the adaptability and precision of the Mamba-VNPS framework in UAV localization. The baseline network, constructed with Mamba Blocks, leverages state-space modeling and multi-scale feature aggregation to provide a comprehensive spatial representation, serving as a reliable foundation for accurate localization. In contrast, the real-time network achieves lightweight processing by bypassing Mamba Blocks entirely, focusing instead on streamlined multi-scale feature extraction. This skip connection mechanism enhances computational efficiency by directly processing feature maps in multi-scale modules while excluding the resource-intensive Mamba Blocks.
For each layer, l, in the real-time network, the feature extraction process is represented by a multi-scale aggregation function, where each scale, s, produces a refined feature map f s ( l ) . The multi-scale feature output for the real-time network, F l , is thus given by the following:
F l = s = 1 S β s · σ W s ( l ) f l 1 + b s ( l )
Here, σ is a non-linear activation function, W s ( l ) and b s ( l ) represent the weights and biases specific to each scale, and β s is a learned parameter that assigns importance to each scale through backpropagation, emphasizing the most informative features.
To achieve lightweight efficiency, a skip connection across each layer, l, bypasses the Mamba Block entirely. This residual shortcut directly feeds the previous layer’s output f l 1 into the aggregation function, formulated as follows:
f l = F l + γ · f l 1
where γ is a scaling factor that modulates the residual influence of f l 1 on the current layer, preserving crucial spatial information with minimal computation. This formulation eliminates redundant operations, thereby ensuring rapid feature extraction without sacrificing essential spatial accuracy.
By combining the baseline network’s exhaustive spatial detail with the real-time network’s lightweight, residual-enhanced multi-scale processing, Mamba-VNPS effectively balances computational efficiency and precision, making it highly suitable for UAV navigation in diverse and dynamic environments.
The distinct yet complementary roles of these two feature-extraction networks underscore the adaptability and precision of the Mamba-VNPS framework. The baseline network, with its comprehensive spatial representation, provides a solid foundation for accurate localization, while the real-time network ensures that the UAV can operate efficiently in dynamic environments without sacrificing the accuracy of its posture and position estimations. The inclusion of the comparative convolution graph in Figure 3 illustrates the performance of the Mamba-VNPS framework against other mainstream image feature extraction algorithms, highlighting its superior capability in processing complex visual data while maintaining efficiency. By combining these two architectures, Mamba-VNPS achieves a delicate balance between computational efficiency and spatial precision, making it a highly effective solution for UAV navigation in a variety of complex and unpredictable environments.

4.3. Mamba Block for Lightweight and Efficient Feature Extraction

The Mamba Block is the cornerstone of our proposed network, driving lightweight and efficient feature extraction. This innovative block is specifically engineered to overcome the limitations of traditional convolutional methods by dynamically incorporating long-sequence dependencies while ensuring computational efficiency. Its underlying architecture utilizes state-space modeling (SSM), residual connections, and multi-dimensional convolution, which together enhance both the feature extraction and representation capabilities of the network.
The architecture of the Mamba Block, illustrated in Figure 2, is composed of several key components: a convolutional layer, the state-space model, and a multi-layer perceptron (MLP), all interwoven with residual connections. The input feature map, denoted as X R H × W × C , undergoes a convolution operation aimed to reduce spatial dimensionality while maintaining critical information. This operation is followed by an SSM, which plays a pivotal role in capturing long-range dependencies across both spatial and temporal dimensions.
In this process, as shown in Figure 4, the scanning mechanism involves four distinct scanning phases: horizontal Scan 1 , reverse horizontal Scan 2 , vertical Scan 3 , and reverse vertical Scan 4 , each represented by learnable parameters A , B , C , D . These scanning phases sequentially extract feature representations across the spatial dimensions, allowing the block to model long-range dependencies. The scanning process can be visualized in the corresponding input scan diagram, where the input is processed in these four scanning directions.
In the Mamba Block, the state-space model (SSM) is designed to efficiently capture long-range dependencies by iteratively updating a state representation at each time step. This approach enables the network to maintain temporal and contextual information, which is critical for spatial localization accuracy. The state at each time step t, denoted by s t , summarizes relevant information accumulated up to that point, while the input at time t, x t , contributes new observations. The state-space model updates s t and generates an output, y t , according to the following:
s t = A s t 1 + B x t y t = C s t + D x t
In these equations, s t represents the updated state at time t, x t is the input feature at time t, and y t is the output feature incorporating both previous state information and current input features. The matrices A , B , C , and D are learnable parameters, facilitating the model’s ability to capture dependencies across time steps. Matrix A , the state transition matrix, defines the influence of the previous state, s t 1 , on the current state, s t , effectively capturing temporal dependencies. Meanwhile, matrix B , the input matrix, modulates the impact of the current input, x t , in updating s t , incorporating new information at each time step. Output matrices C and D then map the state and input to the output y t , allowing it to retain essential temporal features; C incorporates information from the state s t , while D introduces a direct influence from the input x t , ensuring responsiveness to the latest input features.
After processing through the SSM, the feature map undergoes further refinement via a multi-layer perceptron (MLP), which applies a two-layer transformation with non-linear activations. Formally, the MLP transformation is expressed as follows:
MLP ( z ) = σ ( W 2 ( σ ( W 1 z + b 1 ) ) + b 2 )
where W 1 and W 2 are the weight matrices for the first and second layers, respectively, and b 1 and b 2 are bias terms. The non-linear activation function σ , often chosen as ReLU, ensures effective feature transformation through each layer.
The final output of the Mamba Block integrates a residual connection, preserving original input information while enhancing it with global contextual features derived from the SSM. The resulting formulation for the output is as follows:
y final = x + MLP ( SSM ( x ) )
Here, x represents the initial input to the Mamba Block, while the residual connection maintains essential local information from the input, enabling the output to combine both local and global contextual information. This integration of SSM and residual connectivity in the Mamba Block balances the need to capture long-range dependencies with computational efficiency, which is particularly advantageous for UAV-localization tasks that demand real-time processing alongside spatial accuracy.
The efficiency of Mamba Block comes from replacing traditional convolutional layers with the state-space model, reducing the computational complexity. The traditional convolutional complexity is O ( HWC 2 ) , while the Mamba Block reduces this to O ( HWC ) by leveraging linear operations in the state-space model.
Additionally, Mamba Block employs a dynamic weighted scanning mechanism, which iterates over spatial dimensions horizontally and vertically, followed by temporal scanning. The recursive update of states is expressed as follows:
z t = W 4 σ W 3 σ W 1 x t + W 2 s t 1 + b 1 + b 2
In this formula, the state update σ W 1 x t + W 2 s t 1 is first computed, followed by another non-linear transformation involving W 3 and W 4 . The biases b 1 and b 2 add further flexibility to the learned features, ensuring that the final feature map z t is compact yet highly expressive.
By integrating these two steps into a single operation, Mamba Block reduces computational overhead while preserving a high capacity for modeling complex temporal and spatial relationships, making it suitable for real-time applications in image feature extraction.

4.4. Dynamic Triplet-Based Learning for Enhanced Localization

The Mamba-VNPS framework incorporates a sophisticated triplet network to train the self-supervised learning model, which is fundamental for ensuring robust UAV localization in dynamic environments. This network is designed to exploit the relationship between different image samples, using both similarity and dissimilarity to refine the positioning model. By comparing feature embeddings across various image inputs, the triplet network enables the UAV to differentiate between its current location and other potential locations, thus enhancing the accuracy of real-time visual navigation.
At the core of this triplet network, each training instance consists of one anchor image, three positive samples, and one negative sample. The anchor image, denoted as x a , represents the real-time input captured by the UAV during flight. Positive samples x p 1 , x p 2 , and x p 3 are derived from nearby or similar locations to the anchor, typically from different angles or perspectives on the same environment. These positive samples play a critical role in ensuring that the network captures the local spatial characteristics of the UAV’s surroundings. In contrast, the negative sample x n is selected from a significantly different or distant location, ensuring that its feature embedding is distinctly different from that of the anchor.
The training process of the triplet network is governed by the triplet loss function, which optimizes the relationship between anchor, positive, and negative samples. The primary goal of the loss function is to minimize the feature distance between the anchor and its positive samples while maximizing the distance between the anchor and the negative samples. This approach encourages the network to group images of the same environment while distancing images from different environments.
To address the overfitting problem and enhance the model’s discriminative power, we optimized the treatment of positive samples and the loss function. Due to the inability to label map images, our study developed a self-supervised learning paradigm. In this approach, we further refined positive samples based on their similarity levels. Specifically, positive samples are randomly selected within a defined range around the anchor sample, providing a richer set of comparison options.
Employing contrastive learning among samples at different similarity levels significantly improves the model’s discriminative ability. We constructed a set of sample pairs, which includes high-similarity samples with substantial overlapping regions, low-similarity samples with minimal overlap, and negative samples with no overlapping regions. This strategic grouping enhances the model’s discriminative capability and adds practical significance to the similarity measurements. The triplet loss function is mathematically defined as follows:
L x a , x p , x n = i = 1 3 f x a f x p i 2 2 f x a f x n 2 2 + α +
Here, f ( · ) represents the feature extraction function, mapping the input images into a lower-dimensional embedding space where spatial similarities and differences are more readily captured. The · 2 term denotes the Euclidean distance between two feature vectors in this space, and α is a margin parameter that defines how much farther the negative sample must be from the anchor compared to the positive samples. Since only distances are relevant to the model, absolute values are applied to ensure non-negative distances between embeddings. The hinge loss operator [ · ] + ensures that the network only incurs a loss when the negative sample is not sufficiently far from the anchor, providing a meaningful margin between positive and negative feature embeddings.
This triplet-based training structure is particularly well suited to environments where the UAV is navigating through visually complex or repetitive landscapes. The use of multiple positive samples x p 1 , x p 2 , and x p 3 ensures that the network captures a more comprehensive representation of each environment. Different angles, lighting conditions, or partial occlusions in the positive samples encourage the network to focus on the most consistent and meaningful features of the environment, enhancing robustness. The inclusion of a negative sample x n , derived from a distant or unrelated location, prevents the network from collapsing into a trivial solution where all samples are considered similar. Instead, the network learns to draw sharp distinctions between different areas.
As the network trains, the parameters θ of the feature extraction module f θ ( · ) are updated iteratively using a gradient-based optimization approach. The gradients of the triplet loss function guide the network to adjust its parameters such that the anchor-positive distance is minimized, and the anchor-negative distance is maximized:
θ θ η θ L x a , x p , x n
In this equation, η represents the learning rate, and θ L denotes the gradient of the triplet loss with respect to the network parameters θ . By following this optimization process, the triplet network gradually improves its ability to embed spatial information, learning to discern fine-grained differences between locations.
Overall, this triplet-based training structure equips the Mamba-VNPS framework with a powerful self-supervised learning capability. By effectively distinguishing between different spatial locations based on image embeddings, the triplet network allows the UAV to localize itself accurately, even in challenging environments, without the need for external signals such as GPS. The combination of multiple positive samples and a carefully selected negative sample ensures robustness in real-world scenarios, while the mathematical foundation of the triplet loss guarantees that the feature extraction process is both efficient and effective. This advanced training structure is instrumental in enabling the Mamba-VNPS system to provide precise and reliable navigation in diverse aerial environments.

4.5. Efficient Feature Vector Matching

Efficient feature vector matching is a critical step in ensuring the accuracy and speed of UAV-based navigation systems, particularly in time-sensitive operations. Our proposed method addresses the computational challenges of real-time feature matching by employing a sliding window strategy combined with a similarity matrix, both optimized for UAV flight dynamics.
The sliding window mechanism restricts feature vector matching to a fixed range of consecutive frames, reducing the computational burden of searching across all frames in a dataset. This is crucial for UAVs operating in real time, as only the most recent frames are typically relevant to ongoing motion. This window is updated with each new frame, ensuring that only the most relevant, temporally local feature vectors are considered. By limiting the search to a smaller subset of vectors, the UAV can maintain low-latency operations while navigating dynamically changing environments.
To improve matching precision, we introduce a similarity matrix that evaluates the similarity between feature vectors based on a weighted cosine similarity metric. This metric considers both angular differences between feature vectors (cosine similarity) and scale variations (captured through a weighting mechanism), which are crucial in UAV applications where camera perspectives and object sizes can change rapidly due to motion. The weighted cosine similarity thus accounts for both orientation and scale, making the matching process more robust against typical distortions such as changing altitudes, camera rotations, and UAV speed variations.
Figure 5 illustrates the feature-matching process within the proposed framework. The figure visually represents the interaction between the sliding window and the similarity matrix. As depicted, feature vectors from the current frame are compared against vectors from preceding frames within the sliding window. The similarity matrix is constructed for each comparison, and a matching score is calculated. In the event of ambiguity (i.e., multiple high-similarity scores), the algorithm selects the match with the highest weighted similarity to ensure accuracy.
Additionally, Figure 5 demonstrates how the sliding window adapts dynamically to the UAV’s motion. For instance, during periods of stable flight, the window remains narrow, focusing on the most recent frames. However, in scenarios with rapid changes in scene (e.g., when the UAV makes sharp turns or changes altitude), the window can expand to capture a broader temporal range, allowing the system to adjust to the new environment and maintain accurate feature matching.
Below (Algorithm 1) is the algorithm that governs the efficient feature vector-matching process:
Algorithm 1 Efficient feature vector matching using sliding window and weighted cosine similarity.
 
Require: Feature vector set F = { f 1 , f 2 , , f n } , sliding window size W, similarity threshold τ
 
Ensure: Matched feature vector pairs M, final output based on w and p
 1:
Initialize matched pairs set M =
 2:
for  i = 1 to n do
 3:
      Define sliding window W i = { f i W , , f i 1 }             ▹ Features in the sliding window
 4:
      for each f j in W i  do
 5:
            Compute weighted cosine similarity S ( f i , f j )     ▹ Cosine similarity with scaling
 6:
            Similarity calculation: For each feature vector f j in the sliding window, compute its weighted cosine similarity S ( f i , f j ) with the current feature vector f i . This similarity is computed using cosine similarity, combined with a scaling factor based on the vector magnitudes to adjust the similarity result.
 7:
            if  S ( f i , f j ) τ  then
 8:
                  Add pair ( f i , f j ) to M
 9:
                  Matching judgment: If the similarity S ( f i , f j ) exceeds or equals the predefined threshold τ ; then, f i and f j are considered a match, and this pair ( f i , f j ) is added to the match set M.
10:
           end if
11:
      end for
12:
end for
13:
Construct weight matrix w based on similarities in M     ▹ Construct similarity weight matrix
14:
Generate position matrix p corresponding to w                 ▹p indicates spatial positions
15:
Compare w and p to determine final matching accuracy              ▹ Compare weight and position matrices
16:
Generate final output based on the comparison of w and p matrices                 ▹ Posture
17:
return Posture
This algorithm iterates over each frame and compares the feature vector of the current frame to those in the sliding window. The similarity score is calculated using a weighted cosine similarity function that adjusts for both angular differences and scale changes. If the similarity score exceeds a predefined threshold, the vectors are considered a match. This method ensures that only the most relevant and similar vectors are matched, minimizing errors caused by environmental changes or UAV movements. This approach achieves efficient and precise feature matching, which is essential for real-time UAV navigation and localization tasks.
The joint analysis of feature extraction, dynamic triplet-based learning, and vector matching ensures a theoretically grounded approach that aligns with the constraints outlined in the problem formulation, addressing the challenge of real-time UAV localization in UAM environments. First, feature extraction guarantees the acquisition of distinctive visual features from each frame, which are fundamental to localization. This process leverages state structure space theory, targeting high-contrast areas, edges, or keypoints to obtain robust features resilient to variations in lighting and partial occlusions, supporting positional stability in dynamic environments. Under this constraint, feature vectors are compressed into a one-dimensional vector to optimize computational efficiency while preserving positional characteristics. Second, dynamic triplet-based learning enhances the discrimination of extracted features by adaptively selecting difficult triplets for training, helping the model distinguish subtle differences between similar and dissimilar locations. Rooted in contrastive learning theory, the model effectively learns by focusing on challenging positive and negative pairs, improving robustness against subtle variations due to UAV movement. Lastly, vector matching applies weighted cosine similarity to calculate feature matches, using similarity measurement theory to treat cosine similarity as a scale-invariant metric, thereby accommodating slight variations in viewpoint and scale. Weighted adjustments offset UAV positional shifts, allowing only the most similar feature vectors to exceed a predefined threshold for precise matching, ensuring that the computational efficiency and real-time responsiveness requirements for UAM localization are met.

5. Experiment

In this section, we present the experimental evaluation of our proposed efficient feature-matching approach. Our experiments are divided into four subsections: the description of the self-built dataset, the details of drone flight experiments, ablation studies conducted to investigate the impact of different components, and a performance comparison with other classic methods.

5.1. Self-Built Dataset

To rigorously assess the efficacy of our proposed approach under practical, real-world conditions, we designed and constructed a self-built dataset specifically tailored to the complex visual challenges inherent to low-altitude urban environments. This dataset was collected in a densely populated and visually diverse area located in the urban center of Chengdu, Sichuan, China, spanning an area of approximately 1 km by 800 m. The chosen location is characterized by a multitude of intricate visual features, including high-rise buildings, roads, varying vegetation, and other urban structures. These elements introduce significant occlusions, repetitive textures, and illumination variations, making it an ideal testbed for evaluating the robustness of visual navigation algorithms.
The data acquisition process was conducted using a DJI Mavic Pro 2 drone, which was selected for its superior flight stability and high-quality imaging capabilities. As shown in Figure 6, the drone is equipped with an RK3399 CPU, providing onboard processing power that minimizes latency in capturing and processing high-resolution imagery. This is crucial in maintaining a continuous data acquisition process under dynamic flight conditions. The onboard Mono RGB camera is mounted in a nadir (top-down) orientation, capturing high-resolution images that cover the entire target area. The use of top-down imagery ensures spatial consistency and allows the construction of a reliable visual dataset that effectively captures the structural characteristics of the urban environment.
To provide a more comprehensive visual representation of the environment, Figure 7 showcases both top-down and oblique (side-view) images of the dataset. While the oblique views are not part of the reference data used for feature extraction, they serve to highlight the structural complexity of the urban landscape, where tall buildings, street furniture, and other obstacles create numerous occlusions and ambiguities. These environmental conditions—combined with factors such as dynamic lighting changes, moving objects, and atmospheric disturbances—make this dataset particularly challenging for real-time UAV navigation tasks.
The baseline data collected in this region are subsequently processed through the Mamba-VNPS pipeline, where nadir images are used to extract feature vectors. These vectors serve as reference points in the real-time navigation experiments, facilitating precise posture and position estimation. The self-built dataset is designed to simulate realistic urban flight conditions, including complex environmental changes, which make it ideal for testing visual navigation algorithms. In the subsequent sections, we will discuss how this dataset is employed in drone flight experiments and evaluate the proposed approach’s performance under different experimental conditions.

5.2. Drone Flight Experiments

To evaluate the effectiveness and precision of our proposed Mamba-VNPS algorithm in real-world navigation scenarios, we conducted extensive drone-flight experiments using three pre-planned flight routes. These routes were carefully selected to cover a wide range of complex urban environments, encompassing various challenging visual navigation conditions such as occlusions, sharp turns, and height variations [54]. The experiments were designed to test the accuracy and robustness of Mamba-VNPS in comparison with other state-of-the-art navigation and positioning methods.
The experiments were carried out using a DJI Mavic Pro 2 drone, equipped with high-resolution Mono RGB cameras and onboard processing capabilities. The drone was programmed to take off from a fixed starting point at coordinates (0, 0, 0), and after a rapid ascent to an altitude of 60 m, it proceeded to follow the flight routes at low altitudes between 60 and 120 m. This low-altitude range reflects real-world scenarios where drones must operate in urban environments, navigating between buildings and other obstacles.
We designed three distinct flight routes with varying levels of complexity and length to thoroughly test the positioning accuracy of our algorithm under different conditions:
  • Route 1: A relatively short route, approximately 0.8 km in length, focused on testing the system’s performance in dense urban areas with large turns and varying textures.
  • Route 2: A medium-length route spanning 1.2 km, which includes both urban and semi-urban sections, introducing mixed visual cues such as roads, trees, and open spaces.
  • Route 3: A long, complex flight route measuring approximately 6.8 km. This route features a combination of narrow streets, open areas, significant elevation changes, and sharp turns, simulating more diverse and challenging operational environments for UAV-based navigation.
Each of these flight routes was executed in a controlled manner, with the drone performing waypoint-based navigation to ensure consistent experimental conditions across all test runs. Throughout the flight, the drone’s trajectory was continuously monitored and logged, allowing us to analyze the navigation performance and positioning accuracy of the Mamba-VNPS algorithm.
For comparative purposes, we evaluated the performance of Mamba-VNPS against two other widely recognized visual navigation methods: the RepVGG-based positioning algorithm and the AutoEncoder-based navigation system. Additionally, GroundTruth data, representing the exact trajectory of the drone, was used as a reference to measure the deviation and accuracy of each algorithm. The comparison was based on the precision with which each method could estimate the drone’s position along the flight routes, particularly in complex regions such as sharp turns, height variations, and areas with significant visual clutter.
As illustrated in Figure 8, the experimental results for all three flight routes are presented, with key areas of interest highlighted in red boxes and zoomed in for closer inspection. These areas were selected based on the complexity of the environment and the challenges they posed to accurate positioning. Figure 8 showcases the trajectory tracking capabilities of the Mamba-VNPS algorithm in comparison to RepVGG, AutoEncoder, and the GroundTruth reference. In the magnified regions, it is evident that Mamba-VNPS retains precise trajectory alignment even in challenging areas with sharp turns and significant height variations. In contrast, both RepVGG [55] and AutoEncoder [20] exhibit noticeable drift and positional errors, particularly when the drone encounters rapid changes in direction or altitude.
Additionally, in the third column of Figure 8, we highlight specific moments where the AutoEncoder-based algorithm fails to track the drone’s trajectory. These tracking failures, marked by circles around the X, Y, and Z-axis trajectory deviations, indicate moments when the AutoEncoder method loses the drone’s position, particularly after sharp turns or elevation changes. When such tracking loss occurs, the AutoEncoder algorithm erroneously resets the drone’s estimated position to (0, 0, 0), as seen in the circled regions of the trajectory plot.
In contrast, the AutoEncoder-based algorithm exhibited notable performance degradation during rapid drone movements, particularly when the UAV encountered sharp turns or steep ascents and descents. This algorithm often lost track of the drone’s trajectory in these situations, leading to substantial positional drift and reduced accuracy. The RepVGG-based method, while performing better than the AutoEncoder in most scenarios, still fell short of the precision offered by Mamba-VNPS, particularly in areas with dense urban features or rapidly changing visual cues.
To provide a quantitative assessment of navigation accuracy, we computed three key metrics for each flight route: the root mean square error (RMSE), maximum positional error, and standard deviation of the error. These metrics clearly indicate the performance differences between the algorithms across all flight routes. In addition to comparing RepVGG and autoencoders, we also conducted comparisons with ORB-SLAM3 and VINS-Mono. This comparison aims to validate the differences between learning-based methods and traditional SLAM approaches on the self-collected dataset used in this study. Table 1 summarizes these metrics for each flight route.
As shown in Table 1, the Mamba-VNPS algorithm consistently achieved lower RMSE values and smaller maximum positional errors compared to the RepVGG and AutoEncoder methods, as well as the traditional SLAM approaches, ORB-SLAM3 and VINS-Mono. The standard deviation of the error was also significantly reduced, demonstrating the stability and reliability of our proposed approach under various flight conditions. Specifically, Route 3, which was the most challenging due to its longer length and complex urban features, exhibited the highest RMSE and maximum error for all algorithms. However, even in this scenario, Mamba-VNPS maintained superior performance, highlighting its robustness in diverse and dynamic environments.

5.3. Ablation Experiments

To evaluate the contributions of each individual component within the Mamba-VNPS system, we conducted three comprehensive ablation studies targeting the following modules: multi-scale feature extraction, dynamic weighted long-sequence extraction, and efficient feature vector matching. Given the complexity of the task, we concentrated our analysis on Route 3, which presents the most challenging navigational conditions, characterized by diverse urban textures and dynamic obstacles. By systematically removing submodules from each of the main components and recording the corresponding performance, measured in root mean square error (RMSE), we aim to quantitatively assess the necessity of each submodule for overall system accuracy.

5.3.1. Multi-Scale Feature Extraction Ablation

The multi-scale feature extraction module is composed of three key submodules: Gaussian pyramid, residual network, and cross-scale fusion. Each submodule plays a critical role in extracting hierarchical, scale-invariant features from the input data, which are indispensable for accurate UAV navigation in complex environments. We conducted ablations on these submodules to quantify their contribution to the system’s performance. The results are detailed in Table 2.
The cross-scale fusion submodule exhibited the most significant impact on system accuracy, with its removal leading to the highest RMSE. Similarly, the Gaussian pyramid submodule contributed substantially to the positioning accuracy, as its exclusion resulted in a notable performance degradation. The residual network learns complex features effectively while reducing computational demands compared to deeper architectures without residual connections. indicating that, although it contributes to feature extraction, it is not the primary driver of performance in this context, as indicated by the increase in RMSE following its removal. These results confirm that each submodule within the multi-scale feature extraction module is indispensable for maintaining optimal system accuracy.

5.3.2. Dynamic Weighted Long-Sequence Extraction Ablation

To further explore the influence of the Mamba Blocks and their associated submodules on the dynamic weighted long-sequence extraction module, we performed additional ablations on the long-sequence memory unit, adaptive weighting mechanism, and global context encoder. We also varied the number of Mamba Blocks to assess its effect on RMSE. The results are provided in Table 3.
The results demonstrate that the adaptive weighting mechanism had the most significant impact on performance, as evidenced by an RMSE increase of up to 9.03 m when this mechanism was ablated. This underscores its critical role in balancing long-term dependencies across sequential data. Both the long-sequence memory unit and the global context encoder also contributed meaningfully to the system’s accuracy, with their removal resulting in moderate increases in RMSE. Additionally, reducing the number of Mamba Blocks from 4 to 2 consistently degraded performance, highlighting the importance of maintaining an adequate number of blocks to preserve accuracy.

5.3.3. Efficient Feature Vector Matching Ablation

The efficient feature vector matching module is designed to facilitate the rapid and accurate matching of feature vectors between the UAV’s current observations and pre-stored map data. This module consists of three primary submodules: the sparse matching algorithm, similarity matrix computation, and sliding window optimization. To evaluate the contribution of each submodule, we systematically removed them and observed the impact on RMSE, as presented in Table 4.
The sparse matching algorithm had the most substantial influence on overall performance, as its ablation led to the largest increase in RMSE (up to 9.13 m). This indicates the importance of sparsity-based strategies in maintaining efficient and accurate feature matching, particularly in dynamic environments where rapid changes necessitate a robust response. The algorithm effectively filters irrelevant features by utilizing the similarity matrix, which allows the system to eliminate the vast majority of unrelated vectors from the baseline map, thereby focusing on the most pertinent data for accurate navigation. Both the similarity matrix computation and sliding window optimization submodules also played critical roles in improving the system’s precision, as their removal resulted in significant increases in RMSE.
These ablation studies affirm the necessity of each submodule within the three core components of the Mamba-VNPS system. The observed increases in RMSE upon their removal highlight their collective importance in achieving robust UAV navigation and positioning in complex environments.

5.4. Performance Experiments

In this section, we assess the performance of our proposed Mamba-VNPS model against AutoEncoder [20] and a five-stage RepVGG [38] model across six critical metrics: processing time, memory usage, noise robustness, scalability, generalization capability, and model size [56]. Each of these metrics highlights different aspects of UAV navigation-system performance, with results visually summarized in Figure 9 using a radar chart.
  • Processing time (s): The processing time refers to the time taken for each model to complete feature extraction and navigation estimation per frame. Efficient processing is critical for real-time UAV applications, where quick responses to dynamic environments are essential. Mamba-VNPS demonstrated an exceptionally fast processing time of just 0.05 s per frame, significantly outperforming both RepVGG and AutoEncoder. RepVGG, while more efficient than AutoEncoder, still took 2.98 s per frame, and AutoEncoder lagged behind with 0.21 s per frame. Mamba-VNPS’s processing speed enables rapid adjustments to environmental changes, facilitating smoother and more efficient UAV navigation in real-time scenarios.
  • Memory usage (GB): Memory efficiency is particularly important for UAVs with limited onboard resources. Mamba-VNPS used only 128 MB of memory, substantially lower than RepVGG’s 210 MB and AutoEncoder’s 320 MB. This lightweight memory footprint allows for more efficient resource allocation during operation.
  • Noise robustness (RMSE, m): Noise robustness measures the model’s ability to handle sensor noise, occlusions, and varying lighting conditions. In noisy environments, Mamba-VNPS demonstrated greater resilience, with a root mean square error (RMSE) of 4.51 m, outperforming both RepVGG (6.25 m) and AutoEncoder (7.92 m). This robustness to sensor noise and environmental occlusions is critical for maintaining reliable navigation accuracy.
  • Scalability (error increase per 100 m2 area, %): Scalability reflects how well each model scales when exposed to larger areas or more complex environments. Mamba-VNPS exhibited a lower error increase of 2.8% per 100 m2, compared to 5.5% for RepVGG and 8.1% for AutoEncoder, indicating its superior ability to handle expansive areas with minimal performance degradation.
  • Generalization capability (accuracy drop on unseen data, %): Generalization capability evaluates the model’s ability to perform in unseen environments or with test conditions that differ from training data. Mamba-VNPS also demonstrated a strong generalization capability, with only a 3.2% accuracy drop when tested on unseen environments, significantly outperforming RepVGG (6.7%) and AutoEncoder (9.3%). This adaptability ensures more reliable performance in new, untrained scenarios.
  • Model Size (MB): The model size reflects the total memory footprint of each model when deployed on a UAV. An efficient model size is critical for real-time onboard applications where memory is limited. The compact size of Mamba-VNPS (53.68 MB) compared favorably to RepVGG (268.29 MB) and AutoEncoder (79.14 MB). Its smaller footprint is highly advantageous for deployment on resource-constrained UAVs, without compromising overall performance.
As illustrated in the radar chart, Mamba-VNPS consistently surpasses its counterparts in multiple dimensions, particularly in processing speed, memory efficiency, and robustness to noise, making it an ideal choice for real-time UAV navigation applications. RepVGG, while providing some advantages in scalability and noise robustness, is hindered by its substantial memory usage and slower processing time. AutoEncoder, though smaller in size, struggles with noise robustness and generalization, making it less reliable in unpredictable environments.

6. Discussion

The findings of this study provide key insights into the future of autonomous navigation, particularly as it transitions from ground-based systems to UAVs in urban air mobility (UAM) contexts. Traditionally, the development of autonomous vehicles has concentrated on ground systems, utilizing localization methods such as GPS, LiDAR, and high-definition maps. However, these technologies are inherently limited in urban environments where signal occlusion and dynamic changes hinder precise localization. For instance, in urban settings, GPS signals can be obstructed by tall buildings, resulting in multipath effects that distort position estimates and compromise UAV navigation capabilities. Moreover, the presence of moving objects—such as pedestrians and vehicles—further complicates localization efforts, especially in the aerial domain, where UAVs must navigate three-dimensional airspaces filled with unpredictable obstacles and rapidly changing conditions.
The proposed Mamba-VNPS (Visual Navigation and Positioning System with State-Selection Space) directly addresses these challenges by offering a robust alternative to conventional localization methods. Unlike traditional systems that rely heavily on pre-mapped environments or GPS signals, Mamba-VNPS utilizes self-supervised learning and satellite imagery to achieve high-precision localization in real time. This approach is particularly advantageous in environments where external signals are unreliable, such as dense urban centers or GPS-denied areas. The results indicate that Mamba-VNPS is well suited to meeting the evolving demands of UAM, enabling UAVs to maintain safe and efficient navigation in increasingly complex environments.
This study introduces an image feature-extraction framework based on a state-space module, enhancing traditional convolutional neural networks and significantly improving the model’s feature extraction capabilities in complex environments. By integrating Gaussian pyramids and a triplet network training approach, the framework effectively mitigates the impact of image scale variations on model performance, thereby enhancing the system’s processing ability across different scales and angles. The sliding window and similarity matrix strategies employed in this study not only accelerate inference speed but also optimize feature-matching logic, providing robust support for real-time localization tasks in UAV applications.
The practical scalability of Mamba-VNPS, particularly in commercial UAV applications, is noteworthy. Its design allows for easy integration with existing UAV platforms, facilitating deployment across various commercial sectors such as delivery services, urban surveillance, and emergency response. The system’s reliance on onboard cameras and efficient feature vector matching ensures compatibility with a wide range of UAV models and operational scenarios, thus enhancing its market adaptability.
However, despite the advancements introduced via Mamba-VNPS, several challenges persist. Vision-based localization systems, while more adaptable, remain susceptible to dynamic lighting conditions and feature-poor environments. Future research should focus on enhancing the robustness of these systems in such contexts. Additionally, while Mamba-VNPS demonstrates promise in navigating the complexities of three-dimensional spaces, further testing in real-world UAM scenarios is essential to validate its scalability and long-term reliability.
In conclusion, Mamba-VNPS marks a significant advancement in autonomous navigation for UAVs operating in UAM environments. Its ability to integrate and enhance existing localization techniques while addressing their limitations provides a scalable solution for real-time, high-precision navigation. The findings suggest that continued exploration of self-supervised learning frameworks and efficient feature extraction methods is crucial for tackling the evolving challenges of autonomous navigation in both ground and aerial systems.

Author Contributions

Conceptualization, L.H.; supervision, L.H.; methodology, Z.W.; software, Z.W.; writing—original draft preparation, Z.W.; investigation, Q.X.; resources, Q.X.; project administration, R.Q.; writing—review and editing, R.Q.; data curation, C.Y.; writing—review and editing, C.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC)—Joint Fund of Civil Aviation Research (No. U2333214), the Civil Aviation Administration of China Safety Capacity Building Project (No. MHAQ2024033), the Key Laboratory of Civil Aviation Flight Technology and Flight Safety for funding through its Open Project Program (No. FZ2021KF13), Fundamental Research Funds for the Central Universities (No. J2023-045), and the Graduate Innovation Fund of the Fundamental Research Funds for the Central Universities for the year 2024 (No. 24CAFUC10188).

Data Availability Statement

The dataset in this study is available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shi, H.; Chen, J.; Zhang, F.; Liu, M.; Zhou, M. Achieving Robust Learning Outcomes in Autonomous Driving with Dynamic Noise Integration in Deep Reinforcement Learning. Drones 2024, 8, 470. [Google Scholar] [CrossRef]
  2. Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
  3. Sigala, A.; Langhals, B. Applications of Unmanned Aerial Systems (UAS): A Delphi Study projecting future UAS missions and relevant challenges. Drones 2020, 4, 8. [Google Scholar] [CrossRef]
  4. Puphal, T.; Flade, B.; Probst, M.; Willert, V.; Adamy, J.; Eggert, J. Online and predictive warning system for forced lane changes using risk maps. IEEE Trans. Intell. Veh. 2021, 7, 616–626. [Google Scholar] [CrossRef]
  5. Hill, B.P.; DeCarme, D.; Metcalfe, M.; Griffin, C.; Wiggins, S.; Metts, C.; Bastedo, B.; Patterson, M.D.; Mendonca, N.L. UAM Vision Concept of Operations (CONOPS) UAM Maturity Level (UML). 4. 2020. Available online: https://www.nasa.gov/directorates/armd/aosp/uam-vision-concept-of-operations-conops-uam-maturity-level-uml-4/ (accessed on 1 October 2024).
  6. Straubinger, A.; Rothfeld, R.; Shamiyeh, M.; Büchter, K.D.; Kaiser, J.; Plötner, K.O. An overview of current research and developments in urban air mobility—Setting the scene for UAM introduction. J. Air Transp. Manag. 2020, 87, 101852. [Google Scholar] [CrossRef]
  7. Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
  8. Kazerouni, I.A.; Fitzgerald, L.; Dooly, G.; Toal, D. A survey of state-of-the-art on visual SLAM. Expert Syst. Appl. 2022, 205, 117734. [Google Scholar] [CrossRef]
  9. Shu, F.; Lesur, P.; Xie, Y.; Pagani, A.; Stricker, D. SLAM in the field: An evaluation of monocular mapping and localization on challenging dynamic agricultural environment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowbird, UT, USA, 5–9 January 2021; pp. 1761–1771. [Google Scholar]
  10. Erat, O.; Isop, W.A.; Kalkofen, D.; Schmalstieg, D. Drone-augmented human vision: Exocentric control for drones exploring hidden areas. IEEE Trans. Vis. Comput. Graph. 2018, 24, 1437–1446. [Google Scholar] [CrossRef]
  11. Duffy, J.P.; Cunliffe, A.M.; DeBell, L.; Sandbrook, C.; Wich, S.A.; Shutler, J.D.; Myers-Smith, I.H.; Varela, M.R.; Anderson, K. Location, location, location: Considerations when using lightweight drones in challenging environments. Remote Sens. Ecol. Conserv. 2018, 4, 7–19. [Google Scholar] [CrossRef]
  12. Arafat, M.Y.; Alam, M.M.; Moh, S. Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges. Drones 2023, 7, 89. [Google Scholar] [CrossRef]
  13. Zhan, H.; Weerasekera, C.S.; Bian, J.W.; Reid, I. Visual odometry revisited: What should be learnt? In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE, Paris, France, 31 May–4 June 2020; pp. 4203–4210. [Google Scholar]
  14. Engel, J.; Koltun, V.; Cremers, D. Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  15. Gupta, A.; Fernando, X. Simultaneous localization and mapping (slam) and data fusion in unmanned aerial vehicles: Recent advances and challenges. Drones 2022, 6, 85. [Google Scholar] [CrossRef]
  16. Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
  17. Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
  18. Chen, W.; Shang, G.; Ji, A.; Zhou, C.; Wang, X.; Xu, C.; Li, Z.; Hu, K. An overview on visual slam: From tradition to semantic. Remote Sens. 2022, 14, 3010. [Google Scholar] [CrossRef]
  19. Goforth, H.; Lucey, S. GPS-denied UAV localization using pre-existing satellite imagery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), IEEE, Montreal, QC, Canada, 20–24 May 2019; pp. 2974–2980. [Google Scholar]
  20. Bianchi, M.; Barfoot, T.D. UAV localization using autoencoded satellite images. IEEE Robot. Autom. Lett. 2021, 6, 1761–1768. [Google Scholar] [CrossRef]
  21. Samano, N.; Zhou, M.; Calway, A. Global aerial localisation using image and map embeddings. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, Xi’an, China, 30 May–5 June 2021; pp. 5788–5794. [Google Scholar]
  22. Russell, J.S.; Ye, M.; Anderson, B.D.; Hmam, H.; Sarunic, P. Cooperative localization of a GPS-denied UAV using direction-of-arrival measurements. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1966–1978. [Google Scholar] [CrossRef]
  23. Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep learning and transformer approaches for UAV-based wildfire detection and segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef]
  24. Yin, P.; Cisneros, I.; Zhao, S.; Zhang, J.; Choset, H.; Scherer, S. isimloc: Visual global localization for previously unseen environments with simulated images. IEEE Trans. Robot. 2023, 39, 1893–1909. [Google Scholar] [CrossRef]
  25. Liu, Z.; Cao, Y.; Gao, P.; Hua, X.; Zhang, D.; Jiang, T. Multi-UAV network assisted intelligent edge computing: Challenges and opportunities. China Commun. 2022, 19, 258–278. [Google Scholar] [CrossRef]
  26. McEnroe, P.; Wang, S.; Liyanage, M. A survey on the convergence of edge computing and AI for UAVs: Opportunities and challenges. IEEE Internet Things J. 2022, 9, 15435–15459. [Google Scholar] [CrossRef]
  27. Ding, R.; Gao, F.; Shen, X.S. 3D UAV trajectory design and frequency band allocation for energy-efficient and fair communication: A deep reinforcement learning approach. IEEE Trans. Wirel. Commun. 2020, 19, 7796–7809. [Google Scholar] [CrossRef]
  28. Wang, C.; Deng, D.; Xu, L.; Wang, W. Resource scheduling based on deep reinforcement learning in UAV assisted emergency communication networks. IEEE Trans. Commun. 2022, 70, 3834–3848. [Google Scholar] [CrossRef]
  29. Yuan, Z.W.; Zhang, J. Feature extraction and image retrieval based on AlexNet. In Proceedings of the Eighth International Conference on Digital Image Processing (ICDIP 2016), SPIE, Vancouver, BC, Canada, 16–18 February 2016; Volume 10033, pp. 65–69. [Google Scholar]
  30. Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef]
  31. Wu, Z.; Shen, C.; Van Den Hengel, A. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef]
  32. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  33. Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
  34. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  35. Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
  36. Zheng, Q.; Zhao, P.; Zhang, D.; Wang, H. MR-DCAE: Manifold regularization-based deep convolutional autoencoder for unauthorized broadcasting identification. Int. J. Intell. Syst. 2021, 36, 7204–7238. [Google Scholar] [CrossRef]
  37. Lee, T.; Mckeever, S.; Courtney, J. Flying free: A research overview of deep learning in drone navigation autonomy. Drones 2021, 5, 52. [Google Scholar] [CrossRef]
  38. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13733–13742. [Google Scholar]
  39. Koonce, B.; Koonce, B. EfficientNet. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 109–123. [Google Scholar]
  40. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  41. Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-based decoder designs for semantic segmentation on remotely sensed images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
  42. Yuan, M.; Ren, D.; Feng, Q.; Wang, Z.; Dong, Y.; Lu, F.; Wu, X. MCAFNet: A multiscale channel attention fusion network for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 361. [Google Scholar] [CrossRef]
  43. Chen, Y.; Gu, X.; Liu, Z.; Liang, J. A fast inference vision transformer for automatic pavement image classification and its visual interpretation method. Remote Sens. 2022, 14, 1877. [Google Scholar] [CrossRef]
  44. Zeng, G.; Wu, Z.; Xu, L.; Liang, Y. Efficient Vision Transformer YOLOv5 for Accurate and Fast Traffic Sign Detection. Electronics 2024, 13, 880. [Google Scholar] [CrossRef]
  45. Yu, M.; Qin, F. Research on the Applicability of Transformer Model in Remote-Sensing Image Segmentation. Appl. Sci. 2023, 13, 2261. [Google Scholar] [CrossRef]
  46. Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
  47. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  48. Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 310520. [Google Scholar] [CrossRef]
  49. Peng, S.; Zhu, X.; Deng, H.; Lei, Z.; Deng, L.J. Fusionmamba: Efficient image fusion with state space model. arXiv 2024, arXiv:2404.07932. [Google Scholar]
  50. Ma, W.; Yang, Q.; Wu, Y.; Zhao, W.; Zhang, X. Double-branch multi-attention mechanism network for hyperspectral image classification. Remote Sens. 2019, 11, 1307. [Google Scholar] [CrossRef]
  51. Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e12345. [Google Scholar] [CrossRef] [PubMed]
  52. Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
  53. Wang, L.; Li, D.; Dong, S.; Meng, X.; Zhang, X.; Hong, D. PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery. arXiv 2024, arXiv:2406.10828. [Google Scholar]
  54. Andle, J.; Soucy, N.; Socolow, S.; Sekeh, S.Y. The Stanford Drone Dataset Is More Complex Than We Think: An Analysis of Key Characteristics. IEEE Trans. Intell. Veh. 2022, 8, 1863–1873. [Google Scholar] [CrossRef]
  55. Huang, R.; Huang, Z.; Su, S. A Faster, lighter and stronger deep learning-based approach for place recognition. In Proceedings of the CCF Conference on Computer Supported Cooperative Work and Social Computing, Beijing, China, 24–26 August 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 453–463. [Google Scholar]
  56. Vujović, Ž.; Hamidović, S.; Kisić, A.; Alihodžić, A. Classification model evaluation metrics. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 599–606. [Google Scholar] [CrossRef]
Figure 1. Mamba-VNPS employs a comprehensive feature-extraction network to process baseline spatial data, generating reference feature vectors for navigation. During real-time operations, the UAV captures images that are processed through a lightweight feature extractor. The extracted features from these real-time images are then compared to the pre-processed reference vectors using a sliding-window similarity-comparison mechanism and facilitating accurate positioning and posture estimation.
Figure 1. Mamba-VNPS employs a comprehensive feature-extraction network to process baseline spatial data, generating reference feature vectors for navigation. During real-time operations, the UAV captures images that are processed through a lightweight feature extractor. The extracted features from these real-time images are then compared to the pre-processed reference vectors using a sliding-window similarity-comparison mechanism and facilitating accurate positioning and posture estimation.
Drones 08 00663 g001
Figure 2. The feature extraction workflow in Mamba-VNPS. To enhance feature representation, the system integrates a Pyramid Extraction Block, allowing multi-scale analysis. The feature extraction pipeline includes four consecutive Mamba Blocks, each refining the feature maps through state selection. During the inference phase, residual connections allow the system to skip these Mamba Blocks, maintaining efficiency without compromising accuracy. The final output is a compressed feature vector that will be used for matching and localization in real-time UAV navigation.
Figure 2. The feature extraction workflow in Mamba-VNPS. To enhance feature representation, the system integrates a Pyramid Extraction Block, allowing multi-scale analysis. The feature extraction pipeline includes four consecutive Mamba Blocks, each refining the feature maps through state selection. During the inference phase, residual connections allow the system to skip these Mamba Blocks, maintaining efficiency without compromising accuracy. The final output is a compressed feature vector that will be used for matching and localization in real-time UAV navigation.
Drones 08 00663 g002
Figure 3. Comparison of the Mamba-VNPS feature extraction module with recent benchmark methods using UAV-captured aerial images. The red-framed section highlights Mamba-VNPS’s refined capability to capture localized details essential for accurate navigation. This zoom-in illustrates how the algorithm maintains spatial integrity and clarity.
Figure 3. Comparison of the Mamba-VNPS feature extraction module with recent benchmark methods using UAV-captured aerial images. The red-framed section highlights Mamba-VNPS’s refined capability to capture localized details essential for accurate navigation. This zoom-in illustrates how the algorithm maintains spatial integrity and clarity.
Drones 08 00663 g003
Figure 4. The scanning process optimizes feature extraction by utilizing selective scanning across both horizontal and vertical axes. In the “Scan Expand” step, the scanning mechanism dynamically adjusts to the spatial structure of the image, allowing for a more focused analysis of important features. Meanwhile, the “Scan Sequence” illustrates how each region of the image is processed in a sequential manner, ensuring continuity and enhancing the system’s ability to capture critical spatial information effectively.
Figure 4. The scanning process optimizes feature extraction by utilizing selective scanning across both horizontal and vertical axes. In the “Scan Expand” step, the scanning mechanism dynamically adjusts to the spatial structure of the image, allowing for a more focused analysis of important features. Meanwhile, the “Scan Sequence” illustrates how each region of the image is processed in a sequential manner, ensuring continuity and enhancing the system’s ability to capture critical spatial information effectively.
Drones 08 00663 g004
Figure 5. A three-stage process for efficient feature vector matching. The first stage, frame queue management, organizes feature vectors using a sliding window to structure the input data. In the second stage, matching calculation, weighted cosine similarity is computed between vectors, incorporating a scaling factor to adjust for vector magnitudes. The final stage, position comparison, compares a weight matrix and a position matrix to produce the final localization output, ensuring precise matching and spatial positioning of feature vectors.
Figure 5. A three-stage process for efficient feature vector matching. The first stage, frame queue management, organizes feature vectors using a sliding window to structure the input data. In the second stage, matching calculation, weighted cosine similarity is computed between vectors, incorporating a scaling factor to adjust for vector magnitudes. The final stage, position comparison, compares a weight matrix and a position matrix to produce the final localization output, ensuring precise matching and spatial positioning of feature vectors.
Drones 08 00663 g005
Figure 6. Configuration of the DJI Mavic Pro 2 drone used for data collection. Equipped with an RK3399 CPU for onboard processing and a Mono RGB camera for high-resolution image capture.
Figure 6. Configuration of the DJI Mavic Pro 2 drone used for data collection. Equipped with an RK3399 CPU for onboard processing and a Mono RGB camera for high-resolution image capture.
Drones 08 00663 g006
Figure 7. An overview of the self-built dataset, showing a top-down view (left) and a side view (right) of the urban environment. The dataset covers an area of 1 km by 800 m in Chengdu, China, providing a comprehensive testbed for low-altitude drone navigation.
Figure 7. An overview of the self-built dataset, showing a top-down view (left) and a side view (right) of the urban environment. The dataset covers an area of 1 km by 800 m in Chengdu, China, providing a comprehensive testbed for low-altitude drone navigation.
Drones 08 00663 g007
Figure 8. Comparison of trajectory tracking results for Mamba-VNPS, RepVGG, AutoEncoder, and GroundTruth across three flight routes. Highlighted red boxes show key challenging regions where Mamba-VNPS significantly outperforms the other methods in terms of positional accuracy.
Figure 8. Comparison of trajectory tracking results for Mamba-VNPS, RepVGG, AutoEncoder, and GroundTruth across three flight routes. Highlighted red boxes show key challenging regions where Mamba-VNPS significantly outperforms the other methods in terms of positional accuracy.
Drones 08 00663 g008
Figure 9. Radar chart comparing Mamba-VNPS, RepVGG, and AutoEncoder across six performance metrics. Mamba-VNPS consistently outperforms other methods, particularly in terms of memory usage, noise robustness, and inference accuracy.
Figure 9. Radar chart comparing Mamba-VNPS, RepVGG, and AutoEncoder across six performance metrics. Mamba-VNPS consistently outperforms other methods, particularly in terms of memory usage, noise robustness, and inference accuracy.
Drones 08 00663 g009
Table 1. Position error indicators (RMSE, maximum error, and standard deviation) of map-based methods, ORB-SLAM3 and VINS-Mono, and learning-based methods, Mamba-VNPS, RepVGG, and AutoEncoder (before loss) on three routes.
Table 1. Position error indicators (RMSE, maximum error, and standard deviation) of map-based methods, ORB-SLAM3 and VINS-Mono, and learning-based methods, Mamba-VNPS, RepVGG, and AutoEncoder (before loss) on three routes.
Flight RouteApproachesRMSE (m)Max (m)Stand. (m)
Route 1ORB-SLAM3 [16]18.5021.124.23
VINS-Mono [17]15.4717.634.15
AutoEncoder [20]14.1617.294.86
RepVGG [55]8.7910.553.94
Mamba-VNPS6.239.582.15
Route 2ORB-SLAM318.6521.784.35
VINS-Mono15.3417.424.05
AutoEncoder12.6415.504.29
RepVGG8.9410.674.01
Mamba-VNPS7.1410.232.94
Route 3ORB-SLAM319.2722.344.55
VINS-Mono15.7516.823.87
AutoEncoder13.6217.454.91
RepVGG9.1211.904.12
Mamba-VNPS7.7612.673.65
Table 2. Ablation study on multi-scale feature extraction.
Table 2. Ablation study on multi-scale feature extraction.
Gaussian PyramidResidual NetworkCross-Scale FusionRMSE (m)
7.76
10.11
10.73
13.05
Table 3. Ablation study on dynamic weighted long-sequence extraction.
Table 3. Ablation study on dynamic weighted long-sequence extraction.
Mamba BlocksLong-Seq. Memory UnitAdaptive WeightingGlobal Context EncoderRMSE (m)
47.76
8.91
9.34
9.63
28.21
8.97
9.61
9.83
18.91
9.78
10.45
10.73
Table 4. Ablation study on efficient feature vector matching.
Table 4. Ablation study on efficient feature vector matching.
Sparse Matching AlgorithmSimilarity Matrix ComputationSliding Window OptimizationRMSE (m)
7.76
9.13
9.22
9.96
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, L.; Wang, Z.; Xiong, Q.; Qu, R.; Yao, C.; Li, C. Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space. Drones 2024, 8, 663. https://doi.org/10.3390/drones8110663

AMA Style

Huang L, Wang Z, Xiong Q, Qu R, Yao C, Li C. Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space. Drones. 2024; 8(11):663. https://doi.org/10.3390/drones8110663

Chicago/Turabian Style

Huang, Longyang, Zhiyuan Wang, Qiankai Xiong, Ruokun Qu, Chenghao Yao, and Chenglong Li. 2024. "Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space" Drones 8, no. 11: 663. https://doi.org/10.3390/drones8110663

APA Style

Huang, L., Wang, Z., Xiong, Q., Qu, R., Yao, C., & Li, C. (2024). Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space. Drones, 8(11), 663. https://doi.org/10.3390/drones8110663

Article Metrics

Back to TopTop