**4. Discussion**

Autonomous high-precision real-time localization in the case of GNSS and other navigation module failures is extremely important for the safe and full-scene application of UAVs. By matching aerial images with satellite maps, vision-based UAV localization can acquire precise coordinates without additional auxiliary information. However, the real-time collected aerial images are limited by non-ideal effects such as shadow occlusion, few pixels occupied by the target to be matched, and blurred edges caused by flight jitter, resulting in vision-based localization that is still challenging. The experimental results show that the M-O SiamRPN with weight adaptive joint multiple intersection over union loss can achieve accurate pure visual localization with an accuracy of 0.974 and a success rate of 0.732.

ResNet50 embedded with second-order information as the backbone of feature extraction can significantly improve the feature representation of the framework. The increase in precision and success rate can be attributed to two aspects: (1) edges are essentially mutations of pixels, and second-order information is more sensitive to mutations. This is consistent with the effectiveness of second-order pooling [45] and second-order features in bilinear CNN [46] in fine-grained classification. In addition, the proposed spatial continuity as an evaluation criterion for first-order features can select feature maps that retain more adequate local information. (2) The fusion of multi-order features enhances the overall nonlinearity of the neural network at the feature map level, enabling better performance of network fitting. In the optimization phase of the framework, we designed a new loss function to address the sample imbalance problem. For the classification branch of the Siamese frame, the penalty coefficient of cross-loss is automatically adjusted to increase the contribution of a few samples to the loss by comparing the distribution of positive samples within the batch to the ideal equilibrium distribution. In the regression branch, the performance of the prediction box and groundtruth box inclusion, as well as non-overlapping cases, is improved by constraining from both the non-overlapping area and the diagonal of the anchor boxes.

Although the proposed model has proven to be accurate and efficient, it does have limitations. By analyzing the failure samples, the precision and success rate of localization are lower in densely vegetated mountainous areas. This is due to the small spacing and dense canopy in mountainous and forested areas, where even different tree species have similar canopy shapes and colors. In addition, the different slopes of the terrain are not clearly reflected in the aerial images due to the absence of other significant references, further increasing the difficulty of localization. Currently, UAV vision provides mainly visible images. In some special applications, it can be supplemented with information from other wavelength bands, such as adding multispectral through hyperspectral cameras. With other spectral information, the UAV obtains both flight altitude and terrain slope features in visual localization, enriching the information contained in aerial images in complex environments. Among them, the reconstruction of targets by multispectral information and the effective fusion of multi-source features will be worth further investigation.
