Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework

Zhou, Xiang; Zhang, Xiangkai; Yang, Xu; Zhao, Jiannan; Liu, Zhiyong; Shuang, Feng

doi:10.3390/rs17173048

Open AccessArticle

Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework

by

Xiang Zhou

^1,†

,

Xiangkai Zhang

²,

Xu Yang

²,

Jiannan Zhao

^1,*,†,

Zhiyong Liu

² and

Feng Shuang

¹

Guangxi Key Laboratory of Intelligent Control and Maintenance of Power Equipment, School of Electrical Engineering, Guangxi University, Nanning 530004, China

²

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

^†

Current address: School of Electrical Engineering, Guangxi University, No. 100, Daxue East Road, Nanning 530004, China.

Remote Sens. 2025, 17(17), 3048; https://doi.org/10.3390/rs17173048

Submission received: 20 June 2025 / Revised: 19 August 2025 / Accepted: 20 August 2025 / Published: 2 September 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Precise and robust localization for micro Unmanned Aerial Vehicles (UAVs) in GNSS-denied environments is hindered by the lack of diverse datasets and the limited real-world performance of existing visual matching methods. To address these gaps, we introduce two contributions: (1) the SatLoc dataset, a new benchmark featuring synchronized, multi-source data from varied real-world scenarios tailored for UAV-to-satellite image matching, and (2) SatLoc-Fusion, a hierarchical localization framework. Our proposed pipeline integrates three complementary layers: absolute geo-localization via satellite imagery using DinoV2, high-frequency relative motion tracking from visual odometry with XFeat, and velocity estimation using optical flow. An adaptive fusion strategy dynamically weights the output of each layer based on real-time confidence metrics, ensuring an accurate and self-consistent state estimate. Deployed on a 6 TFLOPS edge computer, our system achieves real-time operation at over 2 Hz, with an absolute localization error below 15 m and effective trajectory coverage exceeding

90 %

, demonstrating state-of-the-art performance. The SatLoc dataset and fusion pipeline provide a robust and comprehensive baseline for advancing UAV navigation in challenging environments.

Keywords:

deep learning for image recognition; dataset; GNSS-denied navigation; vision-based UAV localization; aerial-satellite image matching

1. Introduction

The widespread application of Unmanned Aerial Vehicles (UAVs) in fields such as surveillance, logistics delivery, and infrastructure inspection heavily relies on their reliable autonomous navigation capabilities. However, in critical operational environments like urban canyons, indoor spaces, or signal-jammed areas, Global Navigation Satellite System (GNSS) signals are susceptible to factors such as building obstruction, multipath effects, and malicious jamming. This leads to a significant decline in localization accuracy and reliability, posing a severe challenge to the autonomous flight of UAVs [1].

To address the challenges of absent or unreliable GNSS signals, visual localization technology, particularly Absolute Visual Localization (AVL), has emerged as a highly promising solution. By matching real-time images captured by the UAV with geo-referenced maps, AVL provides drift-free localization results [2]. Despite its potential, visual localization faces numerous challenges, including variations in illumination, weather, and seasons, as well as significant viewpoint differences between the UAV and the reference map (e.g., satellite imagery) [3]. For small-scale rotorcraft UAVs, size and weight limitations restrict the type and number of sensors that can be carried, while limited onboard computational resources constrain the complexity of algorithms [4]. Furthermore, lower flight altitudes result in substantial discrepancies between the UAV’s aerial perspective and the satellite’s nadir view, increasing the difficulty of visual matching localization [5,6]. Therefore, developing robust autonomous localization technologies for GNSS-denied environments is crucial for expanding the application scenarios of UAVs.

Current research on vision-based UAV localization primarily revolves around cross-view image matching, visual odometry (VO), and multi-source information fusion. Cross-View Geo-Localization (CVGL) aims to determine the geographic location by matching images from different viewpoints, such as a UAV’s aerial view and a satellite’s view [2]. Existing methods can generally be categorized into retrieval-based and direct matching/inference-based approaches. Retrieval-based methods locate the query image by searching for the most similar image in a large geo-tagged reference database [5,7], but this requires extensive database support and incurs high computational costs. In contrast, direct matching methods attempt to infer the location directly or match local features using deep networks [8], offering higher potential accuracy but demanding greater feature quality and matching robustness. Recently, deep learning models like Transformers have shown great potential in handling viewpoint changes due to their powerful global context modeling capabilities [5,9,10]. Moreover, foundation models such as DINOv2 are becoming a new trend for feature extraction, owing to their superior generalization and the advantage of not requiring task-specific fine-tuning [11]. Nevertheless, when faced with the inherent, significant differences in viewpoint, scale, illumination, and seasonality between UAV and satellite imagery, these methods still struggle to guarantee the stability and reliability of localization results in real-time missions.

Concurrently, visual odometry (VO) or Simultaneous Localization and Mapping (SLAM) techniques estimate the relative motion of the UAV through inter-frame feature matching (e.g., using feature points like ORB or SuperPoint [12]). However, their inherent issue of cumulative drift prevents them from being used alone for long-term, accurate localization [2]. To overcome the limitations of a single information source, multi-source fusion has become an essential approach. Researchers commonly employ Kalman filters (e.g., EKF, AKF [13,14]) or optimization-based methods (e.g., graph optimization [15]) to fuse information from visual localization, VO/SLAM, and Inertial Measurement Units (IMUs). A key aspect of this process is to effectively quantify the confidence of different information sources and adaptively adjust the fusion weights based on their reliability, which is crucial for ensuring the final localization accuracy and robustness [13].

Despite these advancements, several key bottlenecks persist. Firstly, compared to the abundance of public datasets in the field of ground-based visual localization (e.g., Pitts30k, MSLS) [16,17], there is a scarcity of large-scale public datasets specifically for UAV-satellite image matching, especially those targeting low-altitude, small-scale rotorcraft UAVs in diverse real-world scenarios [18]. Existing datasets like University-1652 [19,20] and SUES-200 mostly focus on singular scenes such as urban or campus environments, making it difficult to comprehensively evaluate the geographical adaptability of algorithms. Secondly, current research often fails to adequately address the coupled challenges of robust and real-time cross-view matching, the cumulative drift of visual odometry, and the effective fusion of multi-modal information.

In response to the aforementioned problems and challenges, this paper introduces the SatLoc dataset, specifically designed for the research of matching localization between small-scale rotorcraft UAVs and satellite imagery in low-altitude (100–300 m) complex environments, along with its accompanying baseline fusion localization framework. The SatLoc dataset features scene diversity, multi-source information synchronization, and high precision, covering various terrain types and weather conditions, thereby providing a benchmark for evaluating algorithm robustness. Our proposed fusion localization framework adopts a novel three-layer algorithm design and a confidence-adaptive fusion method, aimed at addressing the challenges of severe onboard resource constraints and high variability in visual information availability. Through the synergy of a three-layer algorithm encompassing satellite–UAV image matching, inter-frame feature point matching, and optical flow combined with altitude measurements, the framework significantly enhances localization accuracy and robustness in complex environments. Experiments demonstrate that the framework can achieve a real-time computation rate of no less than 2 Hz on an airborne computer with 6 TOPS of processing power, with an absolute localization error of less than 15 m. Its performance surpasses existing methods, providing a competitive benchmark for subsequent related research.

2. Materials and Methods

2.1. SatLoc Dataset

Although various localization algorithms (e.g., feature extraction, visual odometry) have made progress on specific datasets, achieving sustained robustness under diverse real-world conditions remains a pervasive challenge [1]. This “robustness gap” is the core problem that this paper’s multi-layer fusion framework and the diverse SatLoc dataset aim to address. Many studies claim SOTA performance on specific datasets [21], but [22] points out that supervised methods tend to overfit region-specific cues, exhibiting limited generalization capabilities in new regions. Ref. [3] also mentions that even SOTA methods encounter difficulties in perceptually aliased or less structured environments, as well as when facing viewpoint and illumination changes. This indicates that high performance on a single dataset does not equate to robustness in practical applications. This work addresses this challenge by creating a more diverse dataset (SatLoc) for more comprehensive generalization assessment and by designing a multi-layer framework to cope with situations where a single method might fail. The core design principles of SatLoc include the following:

Targeted Platform Selection: Data acquisition using representative commercial DJI Mavic UAVs and custom-built small-scale rotorcraft UAVs to reflect the flight and imaging characteristics of platforms in practical applications.
Geomorphological Scene Diversity: Covering typical geographical environments for low-altitude operations of small-scale UAVs, including urban, suburban, rural, water bodies, and industrial areas. Efforts were made to include different lighting (noon/morning/dusk) and weather conditions to ensure sufficient testing of algorithm generalization.
Dynamic Diversity: Incorporating diverse flight modes, such as different altitudes, speeds, and maneuvers (hovering, straight-line flight, turns, etc.), to approximate the flight conditions of real mission scenarios as closely as possible.
Data Quality and Completeness: Providing high-precision trajectories as ground truth. In addition to the core UAV images and satellite images, necessary sensor intrinsic and extrinsic parameters, as well as other sensor data that may be used for fusion (including IMU and barometer), are provided, ensuring precise temporal synchronization of all sensor data with the ground truth trajectory.

2.1.1. Data Acquisition

Considering the ubiquity and diversity of quadrotor flight platforms in real-world applications, we selected two types of flight platforms: the commercially available and widely accepted DJI Mavic series aircraft and a custom-built quadrotor aircraft. The image sensors on these two platforms differ in imaging clarity and stability. The latter’s imaging may contain more motion blur and noise, a deliberately introduced characteristic to better approximate the effects of vibration and imaging clarity issues encountered in actual low-altitude flight environments. This also facilitates the evaluation of matching algorithms’ robustness to image noise. Furthermore, hardware synchronization trigger signals are used on the custom-built platform to ensure the temporal alignment quality of data from various sensors. The appearance and configuration of the two flight platforms are shown in Figure 1.

UAV Platforms

DJI off-the-shelf product Mavic Air 2, take-off weight 520 g, wheelbase 320 mm, equipped with a three-axis stabilized gimbal, capable of providing high-quality stable imagery. It also has a built-in GNSS time synchronization function. Its visible light imaging component includes a 1/2-inch CMOS sensor, lens equivalent focal length of 24 mm, 84-degree wide field of view, aperture f/2.8, downward-facing shooting mode, and original image resolution of 1920 × 1080.
A custom-built multi-rotor flight platform using the PX4 flight control system, equipped with a miniature three-axis electro-optical pod. The visible light imaging component has a focal length of 2.4–9.1 mm, and an original image resolution of 1920 × 1080.

Sensors

GNSS device: The Mavic Air 2 uses a built-in GNSS positioning module with a horizontal positioning accuracy of 1.5 m. The custom-built flight platform uses a u-blox M8030-KT with a positioning accuracy of 1.5 m. Note that GNSS module data is only used to record ground truth trajectories and is not used in the evaluation algorithms.
IMU: The Mavic Air 2 uses its built-in IMU. The custom-built flight platform uses a BMI-055 as the onboard IMU, providing three-axis gyroscope and three-axis accelerometer data at a sampling frequency of 500 Hz.
Barometer/Altimeter: The custom-built flight platform uses an MS5611 as a pressure sensor, and its data is fused with IMU data to calculate relative altitude, with an elevation measurement accuracy of 1.5 m.

Time Synchronization

For the custom-built flight platform, to ensure the quality of image/GNSS positioning data/IMU and barometer data sources, hardware synchronization trigger signals are used to guarantee the temporal alignment of each data stream. The Mavic Air 2 platform has a built-in image coordinate alignment function.

Route Planning

A total of 50 flight trajectories were collected, with a total mileage of 395 km and a flight altitude range of 100–300 m, covering the typical flight envelope of small-scale rotorcraft.

Environment

The SatLoc dataset covers richer environmental features, including urban buildings, road networks, lakes and wetlands, mountainous forests, and rural farmlands. The coverage area is approximately 136.3 square kilometers.

Satellite Imagery

Satellite images were obtained from public data sources such as Google Earth/Siwei Earth. Level-18 tile data corresponding to the coordinate range covered by the route planning was downloaded. The image acquisition time was between 2022 and 2024. Each tile satellite image is annotated with the geographic coordinates of the upper left and lower right corners, allowing the algorithm to obtain the geographic coordinates corresponding to any pixel through interpolation and other methods.

The details of data collection are summarised in Table 1.

2.1.2. Dataset Characteristics

A comparison of the SatLoc dataset with existing visual localization benchmarks is provided in Table 2. The primary design goal of SatLoc is to establish a comprehensive and challenging benchmark to advance UAV visual localization in GNSS-denied environments, with a specific focus on evaluating the environmental robustness of satellite image retrieval and matching algorithms. Through meticulously planned data acquisition, we aim to deliver a resource that reflects real-world complexity and diversity, thereby addressing the shortcomings of current datasets. SatLoc emphasizes deliberately curated diversity across key variables over mere data quantity. For instance, we deliberately expanded the seasonal span of the satellite imagery because, as the introduction correctly notes, real-world visual localization must cope with “illumination, weather conditions, seasonal changes, and viewpoint differences.” The temporal gap between satellite and UAV imagery—for example, matching a UAV image from a lush summer flight with a satellite map from a sparsely vegetated spring—directly embodies this challenge. This makes the dataset a more difficult and, therefore, more realistic benchmark for evaluating algorithm robustness. Consequently, the SatLoc dataset can serve as a “stress test” specifically designed for cross-view localization algorithms. Samples of captured images and their corresponding reference satellite imagery are shown in Figure 2.

Furthermore, we have categorized the dataset into three difficulty levels (easy, medium, hard) based on scene texture richness, as shown in Figure 3. For example, scenes with sparse global features, such as mountainous jungles and lakes, are classified as ‘hard,’ as they may require stitching multiple images to construct an effective global descriptor. Conversely, scenes with distinct features like roads, buildings, and landmarks are labeled ‘easy,’ as their nadir views facilitate straightforward matching with satellite imagery. We introduce a quantitative method for classifying scene difficulty based on the richness of landmark features critical for global matching. For each grayscale UAV image I, we compute a difficulty index,

D_{i n d e x}

, derived from two metrics: image entropy and feature density. First, Shannon entropy,

H (I) = - \sum_{i} p (i) {log}_{2} p (i)

, is calculated to measure textural complexity, where

p (i)

is the probability of pixel intensity i. Concurrently, feature density,

F (I)

, is determined by normalizing the count of detected FAST corners by the image area to quantify the prevalence of salient keypoints. These two metrics are normalized across the dataset into

H_{n o r m} (I)

and

F_{n o r m} (I)

, and then linearly combined into a unified Richness Score,

R (I) = w_{H} \cdot H_{n o r m} (I) + w_{F} \cdot F_{n o r m} (I)

. As difficulty is inversely proportional to feature richness, the final index is defined as

D_{i n d e x} (I) = {(R (I) + ϵ)}^{- 1}

, where

ϵ

is a small constant to prevent division by zero. To classify an entire trajectory, we first calculate the mean difficulty index,

D_{t r a j}

, for all images within that trajectory. We then establish two thresholds,

T_{e a s y}

and

T_{h a r d}

, based on the 33rd and 66th percentiles of

D_{t r a j}

values across all 50 trajectories in the SatLoc dataset. Trajectories are then classified as ‘easy’, ‘medium’, or ‘hard’ by applying thresholds to the mean

D_{t r a j}

.

Through these features, SatLoc provides a more balanced and challenging testbed for algorithmic research and helps address the common machine learning problem of insufficient coverage for ‘long-tail’ scenarios [24].

2.2. Methodology: Three-Layer Adaptive Fusion Pipeline

As previously mentioned, over-reliance on a single visual localization mode (e.g., solely depending on satellite image matching or visual odometry) is prone to failure in complex environments. For instance, in texture-sparse regions, inter-frame matching for visual odometry may fail, while during adverse weather conditions or poor satellite image quality, the success rate of satellite matching significantly drops. This provides the motivation for the multi-layer fusion framework proposed in this paper.

As a baseline evaluation method for the SatLoc dataset, this paper proposes SatLoc-Fusion, a novel three-layer estimation algorithm fusion pipeline, to ensure that the visual localization method can effectively cope with variations in image information availability in complex environments. This method drives three core visual localization and velocity estimation layers to provide position and velocity estimates, which are then fused using a classic Kalman filtering approach to generate final, robust, and self-consistent state estimates. The code of the proposed SatLoc-Fusion can be accessed by https://github.com/ameth64/SatLoc-Fusion (To be accessible after 1 October 2025).

2.2.1. System Overview

The goal of the SatLoc-Fusion system is to estimate the UAV’s position

p \in R^{3}

in real-time. Its overall architecture is shown in Figure 4. Inputs include the following: UAV camera image sequences, relative altitude sensor readings, IMU data, and a reference satellite map. The output is a fused state vector

x_{T}

with uncertainty estimation. The pipeline consists of three parallel localization layers, whose outputs are fed into an adaptive fusion module based on a Kalman filter framework for final state estimation. The confidence of each layer’s estimation result is mapped to process noise and then fed into the Kalman filter framework.

2.2.2. Layer 1: Absolute Localization (Aerial-Satellite Image Matching)

Objective: Provide global absolute pose estimation by matching the current UAV view with a satellite map to correct accumulated drift and anchor the localization result in a global coordinate system.
Input: UAV downward-facing camera image $I_{u a v} (t)$ at current time t, and a set of corresponding satellite map tiles $I_{s a t} (t)$ retrieved based on a coarse prior position or the estimated position from the previous time step.
Output: Estimated position $z_{1} = {[x, y]}^{T}$ in the global coordinate system.
Processing Flow: Use a pre-trained DinoV2 model (ViT architecture [8]) to extract features from $I_{u a v} (t)$ and $I_{s a t} (t)$ , respectively. Similarity is determined through criteria of cosine similarity and neighborhood spatial consistency. Finally, the uncertainty of this layer is calculated based on the distribution of similarity scores and pixel scale similarity.
Matching Strategy: Our localization is achieved through a hierarchical matching process that begins with global image retrieval and concludes with local patch-level refinement, as shown in Figure 5. The process leverages a database of reference satellite image tiles, ${I_{i}^{S}}$ , prepared offline from a large-scale image, $I_{i n i}^{S}$ . For each tile, a global descriptor, $f_{i}^{S}$ , is pre-computed using a backbone network and optimal transport aggregation [25]. During online operation, a query image, $I_{t}^{U}$ , is captured by the UAV, preprocessed for size and resolution consistency, and encoded into an equivalent descriptor, $f_{t}^{U}$ . Coarse localization is first performed by retrieving the best-matching satellite tile, $I_{b e s t}^{S}$ , which maximizes the cosine similarity with the query descriptor, as defined in (1).

$s_{t, i} = \frac{f_{t}^{U} f_{t}^{S}}{∥ f_{t}^{U} ∥ ∥ f_{t}^{S} ∥}$

(1)

This is followed by a fine localization step. Given that the UAV camera is nadir-pointing, we use the image center as the UAV’s projected position. A multi-scale pyramid is generated from the central patch of

I_{t}^{U}

and matched against the candidate tile

I_{b e s t}^{S}

to determine the final, precise absolute position.

Uncertainty Metric $(C_{1})$ : The reliability of this absolute localization result is quantified using the following two indicators:
- Distribution Concentration of Matching Scores: The candidate matching scores are normalized, and the confidence is determined by the distinctiveness between the best score and the average of sub-optimal scores. A more concentrated distribution indicates higher confidence. Let $S_{best}$ be the optimal matching score, and $S_{sub - opts} = Mean {s_{1}, s_{2}, \dots, s_{M}}$ be the mean of a set of M sub-optimal scores, with both normalized to $[0, 1]$ . Let the score difference threshold be $T h r s_{diff}$ . The confidence from the score distribution is expressed as:
  
  $C_{score} = 1 - sigmoid ((S_{best} - S_{sub - opts}) - T h r s_{diff})$
- Pixel Scale Similarity: When the Ground Sampling Distance (GSD) of the UAV image is much smaller than that of the satellite image, it implies that the UAV’s field of view is too small, making global feature extraction difficult. Therefore, a smaller ratio between their GSDs indicates lower pixel scale similarity and thus lower confidence. Let $G S D_{U A V}$ be the GSD of the UAV image (meters/pixel) and $G S D_{S A T}$ be the GSD of the satellite image (meters/pixel). The ratio is $R a t i o_{G S D} = G S D_{U A V} / G S D_{S A T}$ . As the UAV’s flight altitude directly determines its $G S D_{U A V}$ , this metric allows our framework to dynamically assess and adapt to the challenges of scale variation during matching. Let the acceptable ratio threshold be $T h r s_{ratio}$ (determined empirically based on the matching algorithm’s performance, set to 0.8 in this paper). The resulting confidence expression is:
  
  $C_{res_lower} = sigmoid (R a t i o_{G S D} - T h r s_{ratio})$

The final expression for

C_{1}

is:

C_{1} = w_{1, s} \cdot C_{score} + w_{1, r} \cdot C_{bad_hbox}

(2)

where

w_{1, s} + w_{1, r} = 1

.

2.2.3. Layer 2: Relative Pose Estimation (Visual Odometry)

Objective: To accurately estimate the high-frequency relative motion (pose change) of the UAV between consecutive frames.
Input: Consecutive UAV camera images $I_{u a v} (t)$ and $I_{u a v} (t - 1)$ .
Output: Inter-frame relative displacement $z_{2} = T$ .
Processing Flow: Feature matching and relative displacement estimation are performed using XFeat. XFeat is a lightweight and efficient feature detector and matcher, particularly suitable for edge computing platforms, and it performs well in pose estimation and homography estimation tasks [26]. The relative displacement estimation process is as follows:
- Use XFeat to detect keypoints and extract descriptors on $I_{u a v} (t)$ and $I_{u a v} (t - 1)$ .
- Match the descriptors between the two image frames and use the Direct Linear Transform (DLT) algorithm with RANSAC to determine the homography matrix $H$ .
- Decompose the normalized homography matrix according to the form:
  
  $H_{norm} = R + t^{'} n^{T}$
  
  where $R$ is the rotation matrix and $t^{'}$ is the unscaled relative translation vector. Then, combined with the relative altitude measurement h, calculate the scale-recovered translation vector:
  
  $T = h \cdot (t^{'})$
Confidence/Uncertainty Metric ( $C_{2}$ ): Referencing common principles in multi-view geometry [27], the reliability of the relative pose estimation is quantified using the following metrics:
- Number/Proportion of Inliers: The number or proportion of matched point pairs obtained after RANSAC estimation. The more inliers, the higher the confidence. Let $N_{inliers}$ be the number of matched inliers and $T_{inliers}$ be the inlier count threshold. The inlier count confidence is thus given by:
  
  $C o n f_{inliers} = min (1, N_{inliers} / T_{inliers})$
- Distribution of Matched Points: ** Evaluate the spatial distribution of matched points across the image. A more uniform and widespread distribution indicates a more reliable estimation. To this end, the image is divided into k grid cells, and the ratio of non-empty cells (containing inliers) to the total number of cells is calculated. Let $k_{fill}$ be the number of non-empty cells. Then:
  
  $C o n f_{uniform} = k_{fill} / k$
The final confidence $C_{2}$ is expressed as:

$C_{2} = w_{2, inlier} \cdot C o n f_{inliers} + w_{2, uniform} \cdot C o n f_{uniformity}$

(3)

2.2.4. Layer 3: Velocity Estimation

Objective: To provide a direct estimate of the UAV’s velocity vector.
Input: Consecutive UAV camera images $I_{u a v} (t)$ and $I_{u a v} (t - 1)$ , and the synchronized relative altitude measurement $h (t)$ (from a barometer or altimeter).
Output: Estimated velocity vector $z_{3} = {[V_{x}, V_{y}]}^{T}$ (in the world frame).
Processing Flow: Fuse optical flow with altitude information to estimate the UAV’s motion velocity. Optical flow directly estimates pixel motion [28,29], and when combined with altitude information, it can be used to derive the UAV’s velocity. Considering the computational constraints of edge computing hardware [30,31], the pyramidal Lucas-Kanade (LK) sparse optical flow method is adopted [32,33]. The processing flow is as follows:
- Calculate the sparse optical flow field $(δ u, δ v)$ from image $I_{u a v} (t - 1)$ to $I_{u a v} (t)$ . Combined with Harris corner detection, optical flow is computed only for highly distinctive feature points to reduce redundant calculations.
- Using camera intrinsics and the current altitude $h (t)$ , convert the pixel velocities $(δ u / Δ t, δ v / Δ t)$ to motion velocity on the dominant plane (assumed to be the ground here), thereby estimating the UAV’s horizontal velocity $(V_{x}, V_{y})$ .
Confidence/Uncertainty Metric ( $C_{3}$ ): The reliability of the velocity estimation is quantified using the following metrics:
- Forward–Backward Consistency Error: For each feature point, calculate the forward optical flow from $t - 1 \to t$ and the backward optical flow from $t \to t - 1$ . The discrepancy between these two positions, $E_{fb}$ , is taken as the forward–backward consistency error. A smaller value indicates higher confidence. Therefore, we define:
  
  $C o n f_{fb} = exp (- k_{fb} \cdot E_{fb})$
- Photometric Consistency Error: The average Sum of Squared Differences (SSD) of brightness between the tracked feature point patch and the original patch. A smaller error indicates higher confidence. Calculate the Sum of Squared Differences (SSD) between the neighborhood $W (P_{t})$ of a feature point in frame $I_{u a v} (t)$ and its corresponding tracked neighborhood $W (P_{t + 1})$ in frame $I_{u a v} (t + 1)$ . This is given by:
  
  $SSD = \sum_{i, j \in W} {(I_{t} (P_{t} + u_{i j}) - I_{t + 1} (P_{t + 1} + u_{i j}))}^{2}$
  
  where $u_{i j}$ represents the relative coordinates within the neighborhood. A smaller SSD corresponds to higher confidence. Therefore, we have:
  
  $C_{photo} = exp (- k_{ssd} \cdot SSD)$
The final confidence is expressed as:

$C_{3} = w_{3, fb} \cdot C o n f_{fb} + w_{3, photo} \cdot C o n f_{photo}$

(4)

2.2.5. Adaptive Fusion Strategy

This is the core of the SatLoc-Fusion pipeline, used to integrate information from the three localization modules to generate a final, robust, and self-consistent state estimate. Candidate filtering or optimization frameworks include (extended) Kalman filter (KF/EKF) [34,35] or factor graph-based optimizers. While factor graph optimization can offer superior accuracy by retaining and re-linearizing past states over a sliding window, this comes with higher implementation complexity and greater computational demands, which are often prohibitive for our target edge hardware. In contrast, filter-based approaches like the Kalman filter (KF) marginalize out past states, making them computationally efficient. However, this means that if a significant linearization error is introduced, it cannot be easily corrected later. As a baseline method, this paper selects the computationally less expensive and mature KF as the fusion framework, as it offers the best balance of performance, efficiency, and implementation maturity for real-time edge applications.

In the KF algorithm design, we reference the EKF framework commonly used in UAV systems [36] and partition functionalities based on the nature of the information sources. Data sources 2 and 3 provide updates in an incremental/integral form, similar to an IMU, and are therefore used as state prediction information; data source 1 provides an absolute position reference for cumulative error correction, similar to GNSS positioning information, and is thus used as the measurement in the state update step.

State Vector Definition: As previously mentioned, the core outputs of the three localization modules are all related to horizontal displacement information. Therefore, the state vector can be defined as the horizontal position of the UAV at time step k:

$x_{k} = {[p_{x}, p_{y}]}^{T}$

(5)

Auxiliary Velocity Estimation: Since data sources of Layer 2 and 3 are related to relative displacement or its derivative (i.e., relative velocity), we consider using a weighted fusion of the two as the velocity estimate for the current time step. This is used to build the process model, with the absolute localization information from data source 1 subsequently used as the observation for the measurement update. Specifically, at each time step k, we first attempt to estimate an instantaneous velocity ${\hat{v}}_{k - 1}^{f u s e d}$ (representing the average velocity from $k - 1$ to k) from data sources of layers 2 and 3.

Layer 2 (XFeat Relative Displacement): Velocity is estimated from the measurement $Δ z_{2, k - 1} = {[Δ p_{x, r e l}, Δ p_{y, r e l}]}^{T}$ :

$v_{S 2, k - 1} = Δ z_{2, k - 1} / Δ t$

(6)

The variance is calculated based on the layer’s confidence

C_{2}

:

σ_{v, S 2}^{2} = (σ_{2, b a s e}^{2} / (C_{2} + ϵ)) / {(Δ t)}^{2}

(7)

Layer 3 (Optical Flow Velocity): The measurement representing the velocity between $k - 1$ and k is used directly:

$v_{S 3, k - 1} = {[v_{x, l k}, v_{y, l k}]}^{T}$

(8)

And the variance is calculated based on the layer’s confidence $C_{3}$ :

$σ_{v, S 3}^{2} = σ_{3, b a s e}^{2} / (C_{3} + ϵ)$

(9)

The fused velocity ${\hat{v}}_{k - 1}^{f u s e d}$ is estimated as follows:
- If either data source 2 or 3 is available and its confidence is above a threshold, a weighted average is used to fuse the two velocity estimates. The fused velocity is then:
  
  ${\hat{v}}_{k - 1}^{f u s e d} = \frac{w_{v 2} v_{S 2, k - 1} + w_{v 3} v_{S 3, k - 1}}{w_{v 2} + w_{v 3}}$
  
  (10)
  
  where the weights are $w_{v 2} = 1 / σ_{v, S 2}^{2}$ and $w_{v 3} = 1 / σ_{v, S 3}^{2}$ .
- If both are unavailable, ${\hat{v}}_{k - 1}^{f u s e d}$ can be set to zero, or the fused velocity from the previous time step can be used.

Assuming

v_{S 2}

and

v_{S 3}

are independent, the covariance of this fused velocity

\sum_{{\hat{v}}_{k - 1}^{f u s e d}}

is given by:

\sum_{{\hat{v}}_{k - 1}^{f u s e d}} = diag ((\frac{1}{w_{v 2} + w_{v 3}}), (\frac{1}{w_{v 2} + w_{v 3}}))

(11)

Process Model (Prediction Step): Using a constant velocity model, with the velocity provided by ${\hat{v}}_{k - 1}^{f u s e d}$ , the state transition equation is as follows:

${\hat{x}}_{k}^{-} = F_{k} x_{k - 1}^{+} + B_{k} u_{k} = x_{k - 1}^{+} + {\hat{v}}_{k - 1}^{f u s e d} \cdot Δ t$

(12)

Where the state transition matrix is:

$F_{k} = I = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}]$

(13)

The control input matrix is:

$B_{k} = Δ t \cdot I = [\begin{matrix} Δ t & 0 \\ 0 & Δ t \end{matrix}]$

(14)

The control input is:

$u_{k} = {\hat{v}}_{k - 1}^{f u s e d} = {[{\hat{v}}_{x, k - 1}^{f u s e d}, {\hat{v}}_{y, k - 1}^{f u s e d}]}^{T}$

(15)

For the priori covariance matrix prediction, we have:

$P_{k}^{-} = F_{k} P_{k - 1}^{+} F_{k}^{T} + Q_{k}^{'}$

(16)

Here,

Q_{k}^{'} = B_{k} \sum_{{\hat{v}}_{k - 1}^{f u s e d}} B_{k}^{T}

represents the uncertainty of the process model, which is introduced by the velocity estimate

{\hat{v}}_{k - 1}^{f u s e d}

itself.

If

{\hat{v}}_{k - 1}^{f u s e d}

is unavailable (e.g., both source 2 and 3 fail), the model degrades to a random walk:

\begin{matrix} {\hat{x}}_{k}^{-} = x_{k - 1}^{+} \end{matrix}

(17)

\begin{matrix} P_{k}^{-} = P_{k - 1}^{+} + Q_{r w} \end{matrix}

(18)

where

Q_{r w}

is a random walk process noise with a larger magnitude.

Measurement Model (Update Step): After completing the prediction step using data sources 2 and 3, the measurement update step mainly relies on data source 1 (absolute localization) to correct the position.
- Measurement of Layer 1:
  
  $z_{k} = z_{1, k} = {[p_{x, a b s}, p_{y, a b s}]}^{T}$
  
  (19)
- Measurement Matrix:
  
  $H_{k} = H_{1} = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}]$
  
  (20)
- Measurement Noise Covariance $R_{k}$ : (calculated as before, based on $C_{1}$ ):
  
  $R_{k} = [\begin{matrix} σ_{1}^{2} & 0 \\ 0 & σ_{1}^{2} \end{matrix}]$
  
  (21)
  
  where $σ_{1}^{2} = σ_{1, base}^{2} / (C_{1} + ϵ)$

Kalman Filter Update Steps (Standard):
- Calculate Measurement Residual (Innovation):
  
  $y_{k} = z_{k} - H_{k} {\hat{x}}_{k}^{-}$
  
  (22)
- Calculate Residual Covariance (Innovation Covariance):
  
  $S_{k} = H_{k} P_{k}^{-} H_{k}^{T} + R_{k}$
  
  (23)
- Calculate Kalman Gain:
  
  $K_{k} = P_{k}^{-} H_{k}^{T} S_{k}^{- 1}$
  
  (24)
- Update State Estimate:
  
  ${\hat{x}}_{k}^{+} = {\hat{x}}_{k}^{-} + K_{k} y_{k}$
  
  (25)
- Update Covariance Estimate:
  
  $P_{k}^{+} = (I - K_{k} H_{k}) P_{k}^{-}$
  
  (26)

The final output

x_{k}^{+}

serves as the current UAV position estimate.

This adaptive fusion strategy enables the system to intelligently rely on the most reliable information source under different environmental and dynamic conditions, thereby achieving better overall performance and robustness than any single method or static fusion method.

3. Results

This section details the specific implementation of the SatLoc-Fusion pipeline, experimental setup, evaluation metrics, and performance on the SatLoc dataset.

3.1. Implementation Details

Hardware Platform: The edge computing platform used for experiments is the RockChip RK3588Pro, which features up to 6 TOPS of AI performance and 32 GB of memory. The platform ran the SatLoc-Fusion pipeline algorithm in both a ground-based hardware-in-the-loop simulation (based on playback of pre-collected data) and an actual drone flight environment.
Software Environment: The operating system is Ubuntu-20.04, equipped with the SDK corresponding to the RockChip RK3588 platform. Core libraries include PyTorch-1.13.1, OpenCV-4.10, and RKNN-toolkit2-1.3.0 for model acceleration. The main programming languages are Python/C++.
Algorithm Parameters:
- DinoV2: The model is initialized with weights pre-trained on the UAV-VisLoc dataset. For our training process, positive pairs are constructed by matching each UAV image from UAV-VisLoc with its corresponding satellite patch via GPS coordinates. Negative pairs are formed by randomly associating UAV images with non-corresponding satellite regions.
- XFeat: We fine-tune the model starting from its publicly available pre-trained weights. To do this, we generate an augmented dataset by applying random geometric and photometric transformations to the UAV images from the UAV-VisLoc dataset.
Optimization: To achieve real-time performance on edge devices, techniques such as model quantization and compression were employed. After INT8 quantization on the RKNN processing pipeline, the mean inference time for DinoV2 is 482 ms. XFeat can achieve 30 FPS (at $640 \times 480$ resolution) after acceleration with the RKNN library.

3.2. Evaluation Metrics

We used two metrics from [3] to evaluate UAV localization accuracy: Mean Localization Error (MLE) and Success Rate. Trajectory segments with a localization error of less than 25 m are considered successful localizations, while those with an error exceeding 50 m are considered to have drifted.

Mean Localization Error (MLE): Calculates the Root Mean Square Error (RMSE) between the estimated positions of the entire trajectory and the ground truth, in meters (m).
Trajectory Localization Success Rate: The percentage of the total trajectory length occupied by segments where the localization error remains below a preset threshold (25 m), after excluding segments that have drifted.

3.3. Experimental Setup

Datasets: We evaluate our proposed framework primarily on our new SatLoc dataset and the public AerialVL dataset, as they best represent the target application scenarios. This selection was a principled decision aligned with our primary research objectives. A core contribution of this paper is the introduction of the SatLoc dataset itself, which was meticulously curated to address the documented scarcity of benchmarks for small-scale rotorcraft operating in diverse, real-world geomorphological scenes. Our framework, SatLoc-Fusion, is therefore presented as a strong baseline to demonstrate the utility and challenges of this new resource. To situate our framework’s performance within the existing literature, we use the public AerialVL dataset to provide a direct and fair comparison against the published state-of-the-art, thereby validating its competitiveness. For these reasons, other datasets reviewed in Section 2.1—such as VDUAV (simulation-based), CVUSA (ground-view), and University-1652 (campus scenes)—were considered less suitable. These datasets either lack real-world data from small rotorcraft platforms, have insufficient scene diversity, or assume unrealistic matching conditions, which limits their suitability for testing the specific contributions and robustness of our aerial-to-satellite matching approach. For our experiments, we selected three trajectories from SatLoc (ranging from 3.7 to 11 km in length) and two trajectory groups from AerialVL, which correspond to the two “combined methods” presented in [3]. The deep learning networks within our SatLoc-Fusion pipeline, specifically for the DinoV2 framework, were trained on the UAV-VisLoc dataset for both coarse and fine localization. The UAV-VisLoc dataset contains 6742 UAV images from 11 regions across China, with each image geo-tagged (latitude, longitude, altitude) and paired with a corresponding reference satellite image.

Overall Performance Comparison Baselines: To validate the overall efficacy of our proposed framework, we compare its performance against the two “combined methods” reported in [3] on their respective trajectories from the AerialVL dataset. Since the source code for these methods is not public, we could not re-evaluate them on other trajectories. To further assess the effectiveness of our hierarchical matching architecture, we also conduct comparative experiments against several representative visual odometry (VO) based methods from prior works [3,37,38], including ORB-SLAM3 [39], DSO [40], and the Farneback optical flow method [38].

3.4. Experimental Results

The comprehensive evaluation of our proposed framework is conducted through two primary modalities: desktop hardware-in-the-loop (HIL) simulations and real-world UAV flight tests.

For quantitative analysis, where consistency and the elimination of uncontrollable environmental variables are paramount for fair comparison, we primarily employ the HIL simulation platform. In this setup, pre-recorded image sequences and sensor data streams from our SatLoc dataset are fed into the edge computing hardware running our algorithm. This approach is crucial as it facilitates benchmarking in a highly controlled and repeatable environment, ensuring that performance comparisons against baseline methods are consistent and unbiased.
All other evaluations, particularly qualitative analysis and system robustness assessments, are predominantly conducted through real-world flight tests. These field experiments allow for the validation of the system’s practical efficacy and utility under authentic and unpredictable operational conditions.

In the qualitative analysis, results from these flight tests are presented through representative visualizations. For instance, a successful matching result is illustrated in Figure 6: the left panel displays the geo-referenced satellite image retrieved from the database, while the right panel shows the corresponding image captured in real-time by the UAV’s onboard camera.

3.4.1. Quantitative Analysis

Table 3 and Table 4 show the comparison results of the proposed method with other SOTA methods on the SatLoc and AerialVL datasets, respectively. Table 5 presents the ablation study results for the SatLoc-Fusion pipeline. It is evident that the absence of any layer or the fusion strategy leads to a decline in overall system performance, indicating that the proposed method is effective in enhancing algorithm robustness and task completion capability. Table 6 shows the processing time cost (latency) of each component in our pipeline. As expected, Layer 1 takes the longest time to run and is the bottleneck for the system’s real-time performance. This is related to the nature of its algorithm and also means that SatLoc-Fusion or its upstream navigation system needs to carefully handle errors caused by delays.

3.4.2. Qualitative Analysis

Figure 7 shows a trajectory comparison plot (top-down view) in a challenging scenario from the SatLoc dataset. This trajectory covers scenes relatively rich in global features, such as towns and roads, as well as large areas with monotonous features, such as jungles and fields. It is evident that, compared to other baseline methods, the trajectory corresponding to our method (solid green line in the figure) adheres more closely to the ground truth.

Furthermore, for certain scenes lacking semantic or global features, the proposed three-layer fusion method effectively achieves continuous localization. As shown in Figure 8, in areas with relatively monotonous features such as lakes and farmlands, when the first-layer global matching fails and even when the inlier distribution in the second layer is extremely uneven, the relative displacement estimation from the third layer effectively compensates for the gaps in the localization process, ensuring a high trajectory localization success rate and demonstrating the robustness of this baseline method.

3.4.3. Analysis of System Robustness and Component Failure

This section first qualitatively analyzes the system’s behavior under simulated component failures and discusses how the adaptive fusion framework, through its inherent design, handles such events. We consider a key scenario: a failure in Layer 1 (absolute positioning). In the real world, this can occur when a drone flies into an area with extremely poor satellite imagery or a lack of global features, such as dense forests or lakes. In this case, DinoV2’s matching score will remain persistently low, resulting in a confidence score

C 1 \approx 0

. Our adaptive fusion framework is designed to robustly handle this degradation. When C1 is detected to be persistently low, the fusion algorithm (EKF) essentially significantly downweights the measurement updates from the first layer. Instead of crashing, the system automatically and smoothly switches to a pure dead reckoning mode relying on Layer 2 (XFeat relative displacement) and Layer 3 (optical flow velocity estimation). Importantly, the state covariance matrix P in the Kalman filter is designed to grow appropriately under these conditions. As shown in Figure 9, relying solely on visual relative positioning inevitably introduces cumulative drift due to the lack of absolute correction. The expansion of the covariance matrix accurately reflects the increased uncertainty in the system’s position estimate. Quantifying this uncertainty is crucial for downstream path planning and safety monitoring. Once the drone escapes the fault zone and regains high-quality satellite matches (C1 returns to normal), a high-confidence absolute position correction effectively shrinks the covariance and brings the trajectory back to the true value. This analysis demonstrates that our system can fail over in a predictable and safe manner, which is crucial for ensuring real-world availability. Furthermore, because we employ an adaptive weighting scheme based on continuous confidence scores, our framework is inherently more robust to confidence thresholds than methods using hard-coded switching thresholds. Hard-threshold methods are prone to jitter and unstable switching at critical points, while our weighted fusion scheme provides a smoother transition. However, we acknowledge that a detailed sensitivity analysis of the internal parameters of our confidence model (such as the mapping function from matching scores to confidence scores) would be a valuable avenue for future research.

4. Discussion

This section will delve into the analysis of experimental results, discuss the advantages and limitations of the method, and look forward to future research directions.

4.1. Results Interpretation

Performance Analysis: The quantitative results in Table 3 and Table 4 clearly indicate that SatLoc-Fusion’s comprehensive performance on the SatLoc dataset (MLE < 15 m, success rate >

90 %

, frequency ≤ 2 Hz) is significantly superior to various baseline methods. This is primarily attributed to the following aspects: (1) the robust global localization capability provided by DinoV2 effectively suppresses the cumulative drift of pure VO methods; (2) the high-frequency, precise relative motion estimation from XFeat and optical flow ensures local trajectory smoothness and continuity, providing a direct basis for state prediction; (3) building upon this, the adaptive fusion strategy, using basic Kalman filtering, can dynamically adjust information weights based on the real-time confidence of each layer’s output, effectively fusing the advantages of each layer and maintaining the stability and accuracy of the overall system when a single module’s performance degrades.

Ablation Study Analysis: The ablation study results in Table 5 further confirm the contribution of each component. Removing any localization layer leads to a significant performance drop, proving the necessity of the three-layer architecture. Notably, the performance after removing the adaptive fusion mechanism (using static fusion instead) is inferior to the complete system, highlighting the value of the adaptive strategy in dealing with dynamic environments and sensor uncertainty.

4.2. In-Depth Comparison with SOTA Methods

Compared to existing SOTA methods, SatLoc-Fusion has the following advantages:

Compared to pure CVGL methods: Many CVGL methods primarily focus on one-time place recognition or retrieval and may not provide continuous, high-frequency position-velocity output, making them difficult to use directly for UAV control. SatLoc-Fusion, by combining high-frequency VO and velocity estimation, provides continuous state estimates that meet control requirements.
Compared to pure VO/VIO methods: Pure VO/VIO methods inevitably suffer from long-term cumulative drift [3]. SatLoc-Fusion, by introducing an absolute localization layer based on satellite maps (DinoV2), can periodically eliminate cumulative errors, significantly improving global localization accuracy.

4.3. Limitations Analysis

Although SatLoc-Fusion performs excellently, some limitations still exist:

Environmental Factors: (1) Extreme Weather/Lighting: The current system has been primarily validated during daytime and under good weather conditions. In adverse weather (heavy rain, heavy snow, dense fog) or nighttime conditions, the performance of visual sensors will severely degrade, potentially leading to system failure. (2) Seasonal Changes: The appearance of ground and satellite images can change significantly with seasons (vegetation, snow cover, etc.), which may affect the matching performance of DinoV2.
A quantitative analysis of the system’s error characteristics and failure modes can be derived from our ablation study (Table 5). The results demonstrate the framework’s capacity for graceful degradation rather than catastrophic failure when individual components are compromised. For instance, in challenging scenarios characterized by a lack of distinct global features or sparse local textures, the absolute localization (Layer 1) may become unreliable. The ablation study simulates this condition by removing Layer 1 entirely, which results in a significant degradation of system performance: the Mean Localization Error (MLE) increases from 14.05 m to 27.84 m, and the trajectory localization success rate drops from $94 %$ to $55 %$ . This provides a concrete, quantitative measure of how a single component’s failure impacts the overall error rate. In such situations, the adaptive fusion core correctly shifts its reliance to the relative pose estimation (Layer 2) and velocity estimation (Layer 3). While this strategy prevents total localization failure, these layers are inherently susceptible to cumulative drift. This explains why trajectories through these difficult areas, while often remaining below the 50 m failure threshold, are the primary contributors to the increase in the overall MLE, thereby defining the operational performance boundaries of the system under the most adverse conditions.
Sensor Failure: Although adaptive fusion can handle performance degradation in a single layer, if a critical sensor (like the downward-facing camera) completely fails or is occluded for an extended period, the system performance will severely degrade or even fail. Detection of sensor failures and more graceful degradation handling need to be strengthened.
Computational Resources and System Latency: A notable limitation of the current framework is the significant processing latency of the absolute localization layer, which is approximately 485 ms. This performance bottleneck mainly comes from the power consumption and computing resource limitations of the edge computing hardware. This delay introduces a temporal misalignment, as the position measurement derived from an image captured at time t is only available for fusion at a later time, $t + Δ t$ delay. Applying this delayed measurement directly to the state estimate can introduce significant errors, particularly during dynamic maneuvers. However, this effect can be effectively mitigated. By leveraging the high-frequency data from the onboard Inertial Measurement Unit (IMU), the UAV’s displacement during the $Δ t$ delay interval can be estimated through inertial integration. This estimated displacement can then be used to propagate the delayed position measurement forward in time, yielding a corrected observation that is temporally consistent with the current filter state. While the implementation of this measurement compensation is a standard technique in advanced sensor fusion systems and represents a clear direction for future work, it was considered beyond the scope of this baseline study.

4.4. Future Work

Based on the above discussion, future research directions may include the following:

Improving Environmental Adaptability: Research domain adaptation techniques or continual learning methods to enhance the system’s robustness to changes in lighting, weather, and seasons. Explore the fusion of other sensors (such as thermal imaging cameras) to enhance all-weather operational capabilities.
Enhancing Sensor Fault Tolerance: Develop more sophisticated sensor fault detection and isolation mechanisms, and achieve smoother performance degradation when sensors fail.
Online Map Updating and Validation: Investigate methods for online updating or validating reference maps using real-time UAV observation data to cope with environmental changes.
Uncertainty Quantification and Fusion Optimization: Explore more advanced uncertainty representation methods (such as non-Gaussian distributions) and fusion algorithms (such as particle filters, more complex factor graph optimization) to further improve accuracy and robustness.
Multi-UAV Cooperative Localization: Extend this framework to multi-UAV systems, utilizing inter-vehicle communication and relative measurements to achieve more accurate and robust cooperative localization [36].
Broader Platform and Scene Testing: Conduct tests on more types of small-scale UAV platforms with different payloads, and validate system performance in broader and more challenging real-world environments.

5. Conclusions

This paper addresses the urgent need for high-precision, real-time, and robust localization of small-scale rotorcraft UAVs in GNSS-denied environments by proposing an innovative solution called SatLoc-Fusion. Key contributions include: the construction of a dedicated, diverse, real-world SatLoc dataset; the design of a three-layer localization pipeline combining absolute localization (DinoV2), relative localization (XFeat), and velocity estimation (optical flow); the proposal of an adaptive fusion strategy based on real-time confidence metrics to effectively integrate multi-source information and enhance robustness; and verification of the system’s ability to achieve real-time throughput (>2 Hz) on edge computing hardware with 6 TFLOPS of computational power.

Experimental results demonstrate that SatLoc-Fusion achieves SOTA performance on the SatLoc dataset, with an absolute localization error of less than 15 m, a valid trajectory localization rate exceeding

90 %

, and an operating frequency of no less than 2 Hz. The system effectively overcomes the limitations of single methods, showcasing significant potential for reliable autonomous navigation in complex, dynamic, GNSS-denied environments. The SatLoc dataset and the SatLoc-Fusion pipeline together provide an important benchmark and a practical solution for research in this field.

Author Contributions

Conceptualization, X.Z. (Xiang Zhou) and F.S.; methodology, X.Z. (Xiang Zhou) and X.Z. (Xiangkai Zhang); software, X.Z. (Xiang Zhou) and X.Z. (Xiangkai Zhang); validation, X.Z. (Xiang Zhou) and X.Z. (Xiangkai Zhang); formal analysis, X.Y. and J.Z.; investigation, X.Z. (Xiang Zhou); resources, X.Z. (Xiang Zhou); data curation, X.Z. (Xiang Zhou); writing—original draft preparation, X.Z. (Xiang Zhou) and X.Z. (Xiangkai Zhang); writing—review and editing, J.Z. and X.Y.; visualization, X.Z. (Xiang Zhou); supervision, Z.L. and F.S.; project administration, X.Z. (Xiang Zhou). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62206065, and Guangxi Young Elite Scientist Sponsorship Program (GXYESS2025158).

Data Availability Statement

The datasets presented in this article will be openly available in Github at https://github.com/ameth64/SatLoc, accessed on 14 August 2025. The source code for the SatLoc-Fusion framework will be available on GitHub at https://github.com/ameth64/SatLoc-Fusion, before 1 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

VO	Visual Odometry
UAV	Unmanned Aerial Vehicle
CVGL	Cross-View Geo-Localization
SLAM	Simultaneous Localisation and Mapping
EKF/KF	(Extended) Kalman Filter
GSD	Ground Sampling Distance (in meter/pix)
IMU	Inertial Measurement Unit
GNSS	Global Navigation Satellite System

References

Jarraya, I.; Al-Batati, A.; Kadri, M.B.; Abdelkader, M.; Ammar, A.; Boulila, W.; Koubaa, A. Gnss-Denied Unmanned Aerial Vehicle Navigation: Analyzing Computational Complexity, Sensor Fusion, and Localization Methodologies. Satell. Navig. 2025, 6, 9. [Google Scholar] [CrossRef]
Yao, Y.; Sun, C.; Wang, T.; Yang, J.; Zheng, E. UAV Geo-Localization Dataset and Method Based on Cross-View Matching. Sensors 2024, 24, 6905. [Google Scholar] [CrossRef] [PubMed]
He, M.; Chen, C.; Liu, J.; Li, C.; Lyu, X.; Huang, G.; Meng, Z. AerialVL: A Dataset, Baseline and Algorithm Framework for Aerial-Based Visual Localization with Reference Map. IEEE Robot. Autom. Lett. 2024, 9, 8210–8217. [Google Scholar] [CrossRef]
Akhihiero, D.; Olawoye, U.; Das, S.; Gross, J. Cooperative Localization for GNSS-Denied Subterranean Navigation: A UAV–UGV Team Approach. NAVIGATION J. Inst. Navig. 2024, 71, navi.677. [Google Scholar] [CrossRef]
Durgam, A.; Paheding, S.; Dhiman, V.; Devabhaktuni, V. Cross-View Geo-Localization: A Survey. IEEE Access 2024, 12, 192028–192050. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar]
Chen, Z.; Yang, Z.X.; Rong, H.J. Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization. arXiv 2025, arXiv:2502.11381. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Fan, J.; Zheng, E.; He, Y.; Yang, J. A Cross-View Geo-Localization Algorithm Using UAV Image and Satellite Image. Sensors 2024, 24, 3719. [Google Scholar] [CrossRef]
Wei, G.; Liu, Y.; Yuan, X.; Xue, X.; Guo, L.; Yang, Y.; Zhao, C.; Bai, Z.; Zhang, H.; Xiao, R. From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection. arXiv 2025, arXiv:2505.03334. [Google Scholar]
Huang, G.; Zhou, Y.; Zhao, L.; Gan, W. Cv-Cities: Advancing Cross-View Geo-Localization in Global Cities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 1592–1606. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Huang, Z.; Xu, Q.; Sun, M.; Zhu, X.; Fan, S. Adaptive Kalman Filtering Localization Calibration Method Based on Dynamic Mutation Perception and Collaborative Correction. Entropy 2025, 27, 380. [Google Scholar] [CrossRef]
Zhan, Q.; Shen, R.; Mao, Y.; Shu, Y.; Shen, L.; Yang, L.; Zhang, J.; Sun, C.; Guo, F.; Lu, Y. Adaptive Federated Kalman Filtering with Dimensional Isolation for Unmanned Aerial Vehicle Navigation in Degraded Industrial Environments. Drones 2025, 9, 168. [Google Scholar] [CrossRef]
Dellaert, F.; Kaess, M. Factor Graphs for Robot Perception. In Foundations and Trends^® in Robotics; Now Foundations and Trends: Norwell, MA, USA, 2017; Volume 6, pp. 1–139. [Google Scholar]
Warburg, F.; Hauberg, S.; Lopez-Antequera, M.; Gargallo, P.; Kuang, Y.; Civera, J. Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2626–2635. [Google Scholar]
Weyand, T.; Araujo, A.; Cao, B.; Sim, J. Google Landmarks Dataset V2-a Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2575–2584. [Google Scholar]
Ji, Y.; He, B.; Tan, Z.; Wu, L. Game4loc: A Uav Geo-Localization Benchmark from Game Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 20–27 February 2024; Volume 39, pp. 3913–3921. [Google Scholar]
Chu, M.; Zheng, Z.; Ji, W.; Wang, T.; Chua, T.S. Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 213–231. [Google Scholar]
Deuser, F.; Mansour, W.; Li, H.; Habel, K.; Werner, M.; Oswald, N. Temporal Resilience in Geo-Localization: Adapting to the Continuous Evolution of Urban and Rural Environments. In Proceedings of the Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 28 February–4 March 2025; pp. 479–488. [Google Scholar]
Li, H.; Xu, C.; Yang, W.; Mi, L.; Yu, H.; Zhang, H.; Xia, G.S. Unsupervised Multi-View UAV Image Geo-Localization via Iterative Rendering. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5625015. [Google Scholar] [CrossRef]
Guan, F.; Zhao, N.; Fang, Z.; Jiang, L.; Zhang, J.; Yu, Y.; Huang, H. Multi-Level Representation Learning via ConvNeXt-Based Network for Unaligned Cross-View Matching. Geo-Spat. Inf. Sci. 2025, 1–14. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-View Multi-Source Benchmark for Drone-Based Geo-Localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar]
Zhao, D.; Andrews, J.; Papakyriakopoulos, O.; Xiang, A. Position: Measure Dataset Diversity, Don’t Just Claim It. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 60644–60673. [Google Scholar]
Izquierdo, S.; Civera, J. Optimal Transport Aggregation for Visual Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17658–17668. [Google Scholar]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. Xfeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
Qiu, Y.; Chen, Y.; Zhang, Z.; Wang, W.; Scherer, S. MAC-VO: Metrics-Aware Covariance for Learning-Based Stereo Visual Odometry. arXiv 2024, arXiv:2409.09479. [Google Scholar]
Dong, Q.; Cao, C.; Fu, Y. Rethinking Optical Flow from Geometric Matching Consistent Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1337–1347. [Google Scholar]
Chen, Y.H.; Wu, C.T. Reynoldsflow: Exquisite Flow Estimation via Reynolds Transport Theorem. arXiv 2025, arXiv:2503.04500. [Google Scholar] [CrossRef]
Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Pwc-Net: Cnns for Optical Flow Using Pyramid, Warping, and Cost Volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8934–8943. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent All-Pairs Field Transforms for Optical Flow. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23 August 2020; pp. 402–419. [Google Scholar]
Ma, D.; Imamura, K.; Gao, Z.; Wang, X.; Yamane, S. Hierarchical Motion Field Alignment for Robust Optical Flow Estimation. Sensors 2025, 25, 2653. [Google Scholar] [CrossRef]
Shi, K.; Miao, Y.; Li, X.; Li, W.; Nie, S.; Wang, X.; Li, D.; Sheng, Y. Fast Recurrent Field Transforms for Optical Flow on Edge GPUs. Meas. Sci. Technol. 2025, 36, 035409. [Google Scholar] [CrossRef]
Wikipedia. Kalman Filter. Available online: https://en.wikipedia.org/wiki/Kalman_filter (accessed on 17 June 2025).
Wikipedia. Extended Kalman Filter. Available online: https://en.wikipedia.org/wiki/Extended_Kalman_filter (accessed on 17 June 2025).
PX4 Development Team. PX4 Autopilot Software, Version 1.14.0. 2024. Available online: https://github.com/PX4/PX4-Autopilot (accessed on 17 June 2025).
Teed, Z.; Deng, J. DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Visual, 6–14 December 2021; pp. 16558–16569. [Google Scholar]
Hoshino, Y.; Rathnayake, N.; Dang, T.L.; Rathnayake, U. Flow Velocity Analysis of Rivers Using Farneback Optical Flow and STIV Techniques with Drone Data. In Proceedings of the International Symposium on Information and Communication Technology; Springer: Singapore, 2024; pp. 17–26. [Google Scholar]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 611–625. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Xia, Z.; Alahi, A. FG²: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 6362–6372. [Google Scholar]

Figure 1. UAV platforms used for data acquisition.

Figure 2. The overall area distribution and sample aerial images in SatLoc: (a) Overall trajectory distribution. (b) Different types of geological scenaries.

Figure 3. Sample images from the dataset show the different classifications: (a) represents the “hard” category, with a difficulty index above the 66th percentile; (b) represents the “normal” category, with a difficulty index between the 66th and 33rd percentile; and (c) represents the “easy” category, with a difficulty index below the 30th percentile.

Figure 4. SatLoc-Fusion system block diagram, showing the three localization layers and the adaptive fusion module.

Figure 5. Diagram of the image matching stage, where the red arrows indicate feature extraction and storage process in the offline stage, while the green arrows represent the UAV instant image retrieval process in the online stage.

Figure 6. Sample images from the flight tests. In the first row, the green dot in the left image represents the projection of the center point of the drone image on the right onto the satellite image. The location of this dot allows for the estimation of the drone’s geographic coordinates. The green circles and the lines connecting them in the left and right images of the second row represent the feature point pairs matched by XFeat.

Figure 7. Comparison of localization trajectories of different methods in SatLoc dataset scenarios.

Figure 8. An illustration of the fused output from the hierarchical algorithm framework on a local trajectory. The different colors of the trajectory positioning points indicate the algorithm layer with the dominant confidence in the corresponding segment: green = Layer 1 absolute positioning layer, orange = Layer 2 XFeat relative positioning layer, and red = Layer 3 optical flow layer. As can be seen, in the feature-poor scene highlighted by the red box, low-level pixel information can still be utilized for horizontal motion estimation. This compensates for the gaps where absolute localization fails, ensuring a high trajectory localization success rate.

Figure 9. Time series curves of confidence, state covariance, and position estimation. When the confidence of Layer 1 falls below the threshold, the fusion method continues to use the estimation results of Layer 2 and Layer 3, whose uncertainty is reflected in the state covariance and uncertainty ellipse.

Table 1. SatLoc dataset specifications.

Feature	Specification
Platform and Sensors
Primary Platform	DJI Mavic Air 2
Secondary Platform	Custom PX4-based Quadrotor
UAV Image Resolution	1920 × 1080 pixels
UAV Image Format	JPEG
Satellite Imagery Source	Google Earth/Siwei Earth (Level-18 Tiles)
Dataset Scale and Scope
Total Trajectories	50
Total Trajectory Length	395 km
Total UAV Images	48,162
Total Dataset Size	∼529.1 GB
Geographic Coverage	136.3 sq km
Flight and Imaging Parameters
Flight Altitude Range	100 m–300 m
UAV GSD Range	∼0.38 m/px–∼0.57 m/px
Satellite GSD	∼0.52 m/px (for Level-18)
Temporal Variation Analysis
UAV Data Collection Period	September 2024–February 2025 (Example)
Satellite Imagery Vintage	2022–2024
Typical Temporal Gap	3–18 months
Seasonal Mismatches	Yes (∼40% of trajectories feature significant seasonal differences, e.g., green foliage vs. bare trees, to test robustness)

The dataset can be accessed by https://github.com/ameth64/SatLoc (accessed on 14 August 2025).

Table 2. Comparison of SatLoc dataset with existing related datasets.

Feature	SatLoc (This Paper)	VDUAV	University-1652	AerialVL
Platform Focus	Small-scale rotorcraft UAV	General UAV (virtual)	UAV/Ground/Satellite	UAV (rotorcraft)
Data Source	Real-world	Digital Twin (simulation) [2]	Real-world/Simulated [23]	Real-world [3]
Environment Diversity	High (urban/rural/lake and wetland/mountainous forest/road network, etc., different times, different weather)	High (city/plain/hills, etc.) [2]	Medium (mainly campus) [23]	Medium (urban/farmland/road network, different times) [3]
Scale	395 km	12.4k Images (Virtual Reality Scene)	1652 location types [23]	~70 km [3]
Ground Truth	GPS/Fusion (accuracy 1.5 m)	Virtual coordinate mapping (sub-meter) [2]	GPS tags/Simulated [23]	GPS/Fusion (accuracy 1.5 m) [3]
Satellite Map	Provides corresponding tiles and reference full map	Provides corresponding tiles [2]	Provides corresponding images [23]	Provides reference map database [3]
Main Limitations Addressed	Real small-scale platform data, diverse real scenes	Low simulation cost, easy scene expansion [2]	Multi-view data [23]	Large-scale real trajectory framework comparison [3]

Table 3. Overall performance comparison of SatLoc-Fusion with SOTA methods on the SatLoc dataset.

Method	Metrics	Traj.1	Traj.2	Traj.3	Avg.
SatLoc-Fusion (ours)	MLE (m)	14.05	14.12	16.57	14.91
SatLoc-Fusion (ours)	Succ.Rate (%)	96	97	87	93
ORB-SLAM3 [41]	MLE (m)	23.57	23.51	24.72	23.93
ORB-SLAM3 [41]	Succ.Rate (%)	60	55	54	56
DSO [40]	MLE (m)	24.44	23.98	24.78	24.4
DSO [40]	Succ.Rate (%)	58	54	51	54
Farneback [42]	MLE (m)	30.08	29.86	30.56	30.17
Farneback [42]	Succ.Rate (%)	32	34	31	32

Table 4. Overall performance comparison of SatLoc-Fusion method with SOTA methods on the AerialVL dataset.

Method	Short Trajectory		Long Trajectory
Method	MLE (m) ↓	Succ.Rate (%) ↑	MLE (m) ↓	Succ.Rate (%) ↑
SatLoc-Fusion (ours)	18.32	92.1	14.05	90.5
ORB-SLAM3 [41]	28.93	38.4	27.39	42.5
DSO [40]	27.25	46.4	25.72	48.5
Farneback [42]	26.53	50.4	27.31	46
AerialVL.Comb. Method 1 [3]	20.01	71	22.41	55.5
AerialVL.Comb. Method 2 [3]	22.27	80	15.86	85.5

Note: ↓ indicates smaller is better, ↑ indicates larger is better, The same applies below.

Table 5. SatLoc-Fusion ablation study results (on SatLoc dataset).

Configuration	ATE (m) ↓	Success Rate (%) ↑	Frequency (Hz) ↑
Full Pipeline	14.05	94	2.03
Remove Layer 1 (Absolute Loc.)	27.84	55	24.5
Remove Layer 2 (Relative Loc.)	18.81	57	2.8
Remove Layer 3 (Velocity Est.)	16.42	74	2.23
Remove Adaptive Fusion (Static Fusion)	17.85	85	2.11

Table 6. Runtime performance of each layer in SatLoc-Fusion.

Configuration	Time Cost (m/s)	Remark
Layer 1 (Absolute Loc.)	482	-
Layer 2 (Relative Loc.)	28	Input image resized to $640 \times 480$
Layer 3 (Velocity Est.)	20	Uses a pyramidal Lucas-Kanade (LK)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Zhang, X.; Yang, X.; Zhao, J.; Liu, Z.; Shuang, F. Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework. Remote Sens. 2025, 17, 3048. https://doi.org/10.3390/rs17173048

AMA Style

Zhou X, Zhang X, Yang X, Zhao J, Liu Z, Shuang F. Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework. Remote Sensing. 2025; 17(17):3048. https://doi.org/10.3390/rs17173048

Chicago/Turabian Style

Zhou, Xiang, Xiangkai Zhang, Xu Yang, Jiannan Zhao, Zhiyong Liu, and Feng Shuang. 2025. "Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework" Remote Sensing 17, no. 17: 3048. https://doi.org/10.3390/rs17173048

APA Style

Zhou, X., Zhang, X., Yang, X., Zhao, J., Liu, Z., & Shuang, F. (2025). Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework. Remote Sensing, 17(17), 3048. https://doi.org/10.3390/rs17173048

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework

Abstract

1. Introduction

2. Materials and Methods

2.1. SatLoc Dataset

2.1.1. Data Acquisition

UAV Platforms

Sensors

Time Synchronization

Route Planning

Environment

Satellite Imagery

2.1.2. Dataset Characteristics

2.2. Methodology: Three-Layer Adaptive Fusion Pipeline

2.2.1. System Overview

2.2.2. Layer 1: Absolute Localization (Aerial-Satellite Image Matching)

2.2.3. Layer 2: Relative Pose Estimation (Visual Odometry)

2.2.4. Layer 3: Velocity Estimation

2.2.5. Adaptive Fusion Strategy

3. Results

3.1. Implementation Details

3.2. Evaluation Metrics

3.3. Experimental Setup

3.4. Experimental Results

3.4.1. Quantitative Analysis

3.4.2. Qualitative Analysis

3.4.3. Analysis of System Robustness and Component Failure

4. Discussion

4.1. Results Interpretation

4.2. In-Depth Comparison with SOTA Methods

4.3. Limitations Analysis

4.4. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI