**1. Introduction**

UAVs are widely used in national defense, agriculture, mapping, and other industries, with the advantages of autonomous flight, high flexibility, extensible function, low costs, and so on [1,2]. Precise overhead localization is the foundation for UAV navigation and other extended functions [3]. Currently, the dominant UAV localization and navigation technologies in use are inertial navigation, Global Navigation Satellite Systems (GNSS), and combinations of these technologies [4]. Inertial navigation is commonly considered an auxiliary navigation technology attributed to high short-term accuracy but with large longterm errors [5]. GNSS is susceptible to an electromagnetic environment and interference attacks, with unstable signals and poor autonomy [6,7]. Without relying on external information, scene matching-based UAV visual localization has the advantage of strong independence and anti-interference ability, which has become a research hotspot [8,9].

The visual localization is realized by matching the aerial image collected by a UAV in real time with the pre-stored reference image (usually a satellite image), where the coordinates of the matched block in the reference image are the specific location of the UAV [10]. The key to visual localization is feature extraction and matching of images, whose performance directly determines the performance of the localization and navigation system. The combination of local features and classifiers or clusters enables visual

**Citation:** Wen, K.; Chu, J.; Chen, J.; Chen, Y.; Cai, J. M-O SiamRPN with Weight Adaptive Joint MIoU for UAV Visual Localization. *Remote Sens.* **2022**, *14*, 4467. https://doi.org/ 10.3390/rs14184467

Academic Editor: Gwanggil Jeon

Received: 30 July 2022 Accepted: 3 September 2022 Published: 7 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

localization. Liu et al. [11] designed a visual compass based on point and line features for UAV high-altitude orientation estimation, using the appearance and geometry structure of the point and line features in the remote sensing images. Majdik et al. [12] proposed a textured three-dimensional model and similar discrete places in the topological map and to address the air–ground matching problem. Obviously, it is necessary to design the local features for specific tasks considering specific invariants, so experienced researchers and time consumption are indispensable. Therefore, deep learning-based visual localization is proposed to automate feature extraction and detection end-to-end [13,14]. In particular, the convolutional neural network-based architecture approach with strong feature extraction capability has shown excellent performance in tasks such as target detection and image retrieval, and has been applied as a general-purpose feature extractor for visual localization tasks [15,16]. Wu et al. [17] introduced information theoretic regularization into reinforcement learning for visual navigation and improved the success rate by 10% over some state-of-the-art models. Bertinetto et al. [18]. first proposed a fully convolutional Siamese network (SiamFC) by converting target matching into similarity learning. Subsequently, a series of CNN-based Siamese networks has been proposed because of their simple structure, end-to-end training, and high matching efficiency and speed. The improved DSiam [19], SA-Siam [20], etc., have achieved excellent performance, where the backbone networks are usually AlexNet [21], ResNet [22], or DenseNet [23] for feature extraction and correlation. Among them, Li et al. [24] constructed SiamRPN by combining the region proposal network (RPN) [25] in the Siamese network, and the matching accuracy and speed were improved at the same time, which outputs the location and prediction score of the target by box regression.

Although local feature and semantic-driven approaches are capable of visual localization, the challenge of tiny targets in a large overall image has not been addressed. There are still obvious drawbacks as follows:


To overcome these insufficiencies, we propose a stretched Wallis shadow compensation method and a multi-order Siamese region proposal network (M-O SiamRPN) with weight adaptive joint multiple intersection over union loss function. The former is used for aerial image preprocessing, and the latter improves the edge detection of tiny targets and the robustness of information imbalance. The contributions are summarized as follows:

• An improved Wallis shadow automatic compensation method is proposed. A pixel contrast-based stretching factor is constructed to increase the effectiveness of shadow compensation of the Wallis filter. The recovered images are used for searching and matching to reduce the effect of shadows on localization results.


The rest of this paper is organized as follows. Section 2 is dedicated to describing the proposed pre-processing shadow compensation and M-O SiamRPN with weight adaptive joint multiple intersection over union loss function framework. In Section 3, the effectiveness of the proposed framework is verified using aerial images acquired by a self-built UAV platform with satellite images to construct the dataset. The discussion and conclusions are in Sections 4 and 5, respectively.
