3.1. Dataset and Pre-Processing Analysis
This study evaluates GeoNR-PSW using the public Raymobtime channel–ray-tracing dataset. Specifically, this work utilizes scenario s007, as it uniquely provides data for two carrier frequencies (2.8 GHz and 60 GHz), includes modeling of ten car-mounted receivers, and represents a dense urban boulevard environment in Beijing.
Each simulation episode (e0–e49) contains 40 sequential scenes captured at a cadence of 1 Hz. The episodes are spaced five seconds apart, uniquely identifying each receiver via the triplet .
The analysis begins by examining the dynamic movement patterns of receivers. This also includes the resulting diversity in azimuth angles and transmitter–receiver ranges. To illustrate this,
Figure 2 depicts the trajectories of receivers for three selected simulation episodes (Episode 0, Episode 15, and Episode 30) along a fan-shaped elevated road. The scatter plots reveal a wide range of movement directions and distances. This diversity showcases the varied spatial relationships between the receivers and the transmitter. Specifically, the trajectories demonstrate how receivers traverse different paths. These range from horizontal movements near the base of the road to vertical shifts in the middle and consistent horizontal paths at higher elevations. Consequently, receivers encounter a broad spectrum of angles and distances relative to the transmitter.
This study explores how receiver heights and environmental shadowing influence signal quality, specifically focusing on transitions between line-of-sight (LOS) and non-line-of-sight (NLOS) conditions.
Figure 3 visualizes this phenomenon, presenting a three-dimensional distribution of received power. Within this visualization, color-coded trajectories indicate varying power levels across different elevations and positions.
The visualization clearly highlights the impact of environmental occlusions. Distinct transitions from LOS to NLOS conditions are observable as receivers move through regions with differing levels of shadowing. Furthermore, the received power levels, indicated by the color bar, notably decrease when receivers move to higher elevations or encounter obstructions, underscoring the critical role of spatial positioning in signal quality.
This section provides a comprehensive overview of the Raymobtime dataset. It details all available data fields, which include both metadata and detailed channel-path characteristics. Metadata encompasses spatial coordinates, episode identifiers, and receiver indices. Channel-path characteristics cover received power, time of arrival, angles of departure and arrival, and line-of-sight indicators. This detailed breakdown offers essential context for understanding the dataset’s structure and the specific parameters used in the analysis of receiver trajectories and signal quality. A full list of these data fields is presented in
Table 1.
A quantitative analysis of multipath characteristics at 2.8 GHz and 60 GHz reveals distinct differences between the two frequencies.
Figure 4a presents a histogram of the valid path count. This histogram indicates that 60 GHz links have significantly fewer valid paths than 2.8 GHz, with a higher concentration of frames observed at lower path counts for 60 GHz.
Figure 4b displays the cumulative distribution function (CDF) of the power of the ten strongest rays. Here, the 60 GHz curve rises more sharply at lower power levels compared to the gradual increase for 2.8 GHz. This sharper CDF and reduced path count confirm that 60 GHz links are sparser and characterized by fewer and weaker multipath components. These path statistics are depicted in
Figure 4.
The angular-domain energy distribution for 2.8 GHz and 60 GHz is investigated to understand their spatial characteristics.
Figure 5a and
Figure 5b present heat-maps of the angle of departure (AoD) in the azimuth-elevation domain for 2.8 GHz and 60 GHz, respectively. The 60 GHz heat-map shows a more focused energy distribution within narrower angular sectors. In contrast, the 2.8 GHz heat-map reveals broader angular coverage. Additionally, line-of-sight (LOS) conditions across episodes are examined.
Figure 5c plots the episode-wise LOS percentage, demonstrating that 60 GHz channels consistently exhibit fewer LOS frames compared to 2.8 GHz. These findings highlight the constrained angular spread and reduced LOS availability at 60 GHz. These spatial characteristics are illustrated in
Figure 5.
The filtering process exclusively removes links flagged as invalid in the Raymobtime metadata. This also includes entries with placeholder coordinates.Notably, LOS/NLOS labels and path statistics are not employed as filtering criteria. While the count of receiver–scene pairs decreases (e.g., to approximately 45% after filtering), each retained link keeps its ten strongest rays, thereby preserving principal multipath energy and geometric information.
Post-filter distributions remain aligned with the raw dataset. Path count histograms and per-ray power CDFs continue to show the expected FR1/FR2 contrast. Similarly, AoA/AoD heat-maps maintain their angular structure. Coupled with a near-zero missing-path percentage after filtering; this indicates that diversity is preserved while label quality improves for supervision. As a practical future direction, methods to better utilize lower-quality frames will be explored. This includes noise-aware weighting and semi-supervised pretraining on low-confidence frames, all under the same split protocol.
Because the filter is metadata-driven rather than channel-statistic–driven, the retained set preserves the LOS/NLOS composition across episodes. Truncating to the top-10 rays retains the majority of link power while preserving angular trends. Terminal orientation is not explicitly annotated in Raymobtime; robustness to orientation variation is approximated through the natural angular diversity induced by the drive trajectories. This topic is further addressed in the
Section 5, where a planned field-validation protocol using IMU-conditioned features is outlined.
The pre-processing pipeline’s impact on multipath ray data is evaluated to enhance data quality and retain essential information.
Figure 6a,
Figure 6b and
Figure 6c present power matrices for raw, filtered, and normalized data, respectively. The raw data exhibits significant variability with weak rays in low-power regions. Filtering addresses this by removing these weak rays, thereby reducing noise and enhancing clarity. Subsequently, normalization stabilizes variance, leading to a consistent power distribution.
Following filtering,
Figure 6d and
Figure 6e display the mean angle of departure (AoD) and angle of arrival (AoA) in the azimuth direction. These indicate that central paths align closely with the line of sight, while outer paths exhibit greater angular deviations. Furthermore,
Figure 6f illustrates that the missing path percentage approaches 0% for most episodes and paths post-filtering, confirming the pipeline’s effectiveness. Overall, the pre-processing retains over 95% of principal information, significantly improving data quality.
Initially, the dataset comprises 20,000 receiver–scene combinations; after filtering invalid data (Val = V), 9078 valid samples (45.4%) remain. Each link retains the ten strongest multipath rays, resulting in a 70-dimensional channel vector after concatenating seven attributes per path with the receiver’s coordinates.
The episodes are divided into training, validation, and testing subsets as detailed in
Table 2, ensuring temporal independence between subsets. Results presented primarily focus on the more challenging 2.8 GHz frequency due to its richer multipath environment.
3.3. Overall Model Performance
Accurate three-dimensional (3D) localization is the ultimate objective of this work; therefore, a thorough examination of the training dynamics and the final inference quality is indispensable. To provide a comprehensive benchmark that later few-shot and down-sampling studies can rely on, this study evaluates two frequency regimes—2.8 GHz and 60 GHz—using the full Raymobtime data split.
The evolution of the training loss together with the validation mean, median, and root-mean-square errors exhibits a steep decline during the first 25 epochs, followed by a plateau where all curves fluctuate within a narrow band—evidence that the networks have converged without pronounced overfitting, as shown in
Figure 7. Owing to the richer multipath structure and the larger antenna aperture available at millimeter-wave frequencies, the 60 GHz configuration attains noticeably lower error values than its 2.8 GHz counterpart.
A visual juxtaposition of predicted and ground-truth coordinates confirms that the model output is unbiased along each Cartesian axis; this is indicated by the substantial overlap between the predicted and actual point clouds, as presented in
Figure 8. The predictions at 60 GHz form a distinctly tighter cluster, reflecting the improved numerical performance of this frequency band. Both frequency models achieve their lowest mean errors at epoch 171. Nevertheless, a limited number of outliers persist. These are primarily located at extended distances and higher elevations, corresponding to scenarios with diminished signal-to-noise ratios.
A concise quantitative overview of the best validation and test results is presented in
Table 3. The model trained and evaluated at 60 GHz achieves a validation mean 3D error of 4.718 m, significantly improving upon the 8.746 m error observed at 2.8 GHz. This improvement is consistently reflected across median, root-mean-square, and 90th-percentile error metrics. Similar trends are observed in the test performance, where the 60 GHz model maintains substantially lower localization errors compared to the 2.8 GHz configuration.
Conversely, cross-frequency testing demonstrates markedly degraded accuracy. Applying the trained 2.8 GHz model weights directly to the 60 GHz test set yields a mean error of 58.208 m, while transferring the 60 GHz model to the 2.8 GHz test set results in a mean error of 101.776 m. These findings highlight the limited transferability of learned spatial features between different frequency domains, underscoring the necessity of dedicated training for each targeted operating frequency.
The proposed model demonstrated competitive localization accuracy and robustness, consistently outperforming baseline methods across both the 2.8 GHz and 60 GHz bands. At 2.8 GHz, the model achieved lower mean and 90th percentile errors, indicating a more effective approach for handling complex signal propagation compared to structured models like graph neural networks (GNNs).
This advantage was particularly pronounced at 60 GHz, where the proposed model exhibited a balanced performance. This contrasts sharply with kNN, which, despite its low median error, showed a high root-mean-square error (RMSE). This discrepancy suggests that while kNN performs adequately on familiar patterns, it struggles with generalization to unseen data—a critical limitation effectively addressed by the proposed model. The improved performance, as detailed in
Table 4, is attributed to the translation of signal measurements into pseudo-signal word sequences, which enables the large language model (LLM) to leverage contextual reasoning and generalization capabilities beyond those of traditional models.
3.4. Few-Shot Adaptation
In a realistic deployment, collecting a full-scale training set for every new site is impractical. This study therefore investigates few-shot adaptation: this work exposes the network to only spatial reference frames per episode and then fine-tunes the prediction head. The remaining samples are kept for evaluation. All experiments are conducted independently on the sub-6 band (2.8 GHz) and the mmWave band (60 GHz).
This work reports five localization error metrics—mean, median, RMSE, 90th-percentile (p90), and maximum (max) distance error—averaged over 40 training episodes. The baseline (no adaptation) corresponds to the cross-scene zero-shot setting. All errors are measured in three-dimensional Euclidean distance (meters).
The results of few-shot adaptation experiments are visually summarized in
Figure 9, which comprises two subfigures: (a) plots the mean and median localization errors (in meters) against the number of reference frames
k on a log-scale, and (b) illustrates the relative percentage change in all five error metrics compared to the
baseline, with blue indicating improvements (lower is better).
Table 5 provides a detailed numerical breakdown of these metrics for both frequency bands across the range of
k values. Three key observations emerge.
Incorporating just one reference frame () yields substantial improvements. For the 2.8 GHz band, the mean error decreases by 64.6% (from 24.70 m to 8.75 m), and the RMSE drops by 73.0% (from 53.06 m to 14.34 m). For the 60 GHz band, the reductions are 52.9% (from 10.01 m to 4.72 m) for the mean and 49.3% (from 13.85 m to 7.02 m) for the RMSE. This demonstrates that a single in situ calibration point can effectively halve the error budget, underscoring the potential of few-shot learning to achieve rapid adaptation with minimal data.
Additional reference frames up to continue to enhance performance, with a notable 70.6% RMSE reduction at 2.8 GHz. However, beyond this point, the benefits diminish, and slight regressions are observed for and (e.g., mean error increases from 11.38 m at to 15.25 m at for 2.8 GHz). This plateau effect suggests mild overfitting to the sparse anchor points, indicating an optimal adaptation range around .
The 60 GHz band exhibits a lower initial median error (6.91 m) compared to 2.8 GHz (13.09 m) at , reflecting its narrower angular spread and reduced multipath richness. However, the relative improvement diminishes beyond , with percentage changes stabilizing (e.g., ΔMedian improves from −52.97% at to −20.20% at ). This highlights a band-specific response to increasing reference frames, where mmWave benefits less from additional data due to its inherent signal properties.
These findings robustly validate the central hypothesis of the system: accurate localization is achievable with minimal site-specific data. The ability to significantly reduce localization errors with as few as one or two reference frames is a critical attribute, enabling scalable and efficient low-altitude navigation solutions. This adaptability not only enhances deployment flexibility but also supports the practical implementation of the framework in resource-constrained environments. Diminishing returns emerge as k increases, with occasional degradation for . Longer prompts may introduce heterogeneous or conflicting anchors that dilute in-context signals and increase the risk of overfitting to sparse anchor geometry. Under a fixed prompt budget, a small anchor set selected via validation is advisable, with (typically ) to stabilize performance.
3.5. Training Data Efficiency
Accurate navigation in dense low-altitude air-spaces must remain robust when only a small fraction of the available measurements are labeled. To quantify this requirement, localization models are trained with five progressively smaller portions of the labeled set (train ratios from 1.0 down to 0.2). Two carrier frequencies are evaluated independently: 2.8 GHz and 60 GHz. For every ratio the network is trained until validation performance saturates; the epoch with the lowest validation median error is then used to compute all statistics on a disjoint test set.
Localization error distributions for five key statistics (mean, median, root-mean-square error, 90th percentile, and maximum) are depicted in
Figure 10. A box plot summarizes 2.8 GHz, whereas a violin plot illustrates the density for 60 GHz. Medians and inter-quartile ranges contract as the training ratio increases, confirming that additional supervision reduces both central tendency and spread. Dispersion is consistently smaller for 60 GHz, highlighting the benefit of the richer multipath structure at 60 GHz.
Empirical cumulative distribution functions (CDFs) of the three-dimensional positional error for all training ratios are presented in
Figure 11. As more labeled data are provided, the curves move steadily leftwards; however, beyond a ratio of 0.6, the additional gain is marginal. For 60 GHz, the curve obtained with only 40% of the training data nearly overlaps the full-data curve up to the 95th percentile, indicating that the high-frequency model is markedly data-efficient.
Table 6 lists the best-epoch statistics extracted from
Figure 10. At 2.8 GHz the mean error rises from 8.75 m to 16.77 m (+92 %) when the ratio is reduced from 1.0 to 0.2, while the corresponding increase at 60 GHz is from 4.72 m to 8.19 m (+74 %). Maximum errors for 2.8 GHz remain between 142 m and 182 m across all ratios, whereas the largest errors for 60 GHz drop sharply once the ratio exceeds 0.2. High-frequency sensing therefore not only improves average accuracy but also suppresses extreme outliers under limited supervision.
This study confirms that the 60 GHz model achieves near-optimal performance with as little as 40% of the labeled data, whereas the 2.8 GHz model experiences notable degradation under the same constraint. In scenarios where annotation is costly, prioritizing high-frequency measurements and semi-supervised strategies emerges as a promising pathway towards efficient large-scale deployment.
3.6. Model Depth Ablation
The preceding analyses imply that the transformer backbone and the graph neural network (GNN) refinement blocks exert complementary effects on the localization task. This work therefore examines how the depth of each branch influences accuracy on two frequency regimes. For GPT-2, depth corresponds to the number of transformer layers ; for the GNN branch, a depth of denotes the number of message-passing blocks inserted at pipeline positions “0”, “1/4”, or “2/4”, encoded as #LLM*#pos_GNN*#pos_L_GNN in the log. All remaining hyper-parameters are held constant so that performance differences arise solely from architectural depth.
Each depth combination is trained to convergence and assessed with four navigation-relevant metrics: mean error, median error, root-mean-square error (RMSE), and
-percentile error (
). Lower values indicate superior localization. The complete raw scores are listed in
Table 7.
The dual-polarity radar plots in
Figure 12 summarize the outcomes.
Figure 12a groups results by GNN depth and marginalizes over transformer layers, whereas
Figure 12b groups by GPT-2 depth and marginalizes over GNN configurations.
Both panels exhibit a distinct U-shaped trend: shallow models () lack sufficient receptive field, whereas overly deep models () suffer from optimization difficulties and over-smoothing, which elevates RMSE and . A moderate configuration—four GPT-2 layers combined with four GNN blocks inserted at the second quartile of the pipeline (4*0_2_4*0_2_4)—attains the lowest median error (3.25 m) and the best (8.48 m) at 60 GHz, while maintaining competitive accuracy at 2.8 GHz. These findings indicate that moderate depth paired with mid-pipeline insertion offers the best balance between accuracy and computational cost.
From an architectural perspective, mid-pipeline insertion lets message-passing over the path/anchor graph inject geometry-aware constraints after the transformer has formed local token aggregates but before global decoding, avoiding early-stage washout and late-stage under-propagation. Empirically, input-only or output-only placements underperform: omitting the input-stage GNN degrades accuracy, whereas coupling an input-stage GNN (for initial context alignment) with a mid-stage GNN (for refinement) yields the most consistent gains; placing GNNs only near the head risks over-smoothing and limited receptive depth. Consequently, this configuration is adopted as the default architecture in subsequent experiments.