WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization

Wang, Guirong; Chen, Jiahao; Dai, Ming; Zheng, Enhui

doi:10.3390/rs15040910

Open AccessArticle

WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization

School of Mechanical and Electrical Engineering, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(4), 910; https://doi.org/10.3390/rs15040910

Submission received: 26 November 2022 / Revised: 27 January 2023 / Accepted: 4 February 2023 / Published: 7 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

UAV localization in denial environments is a hot research topic in the field of cross-view geo-localization. The previous methods tried to find the corresponding position directly in the satellite image through the UAV image, but they lacked the consideration of spatial information and multi-scale information. Based on the method of finding points with an image, we propose a novel architecture—a Weight-Adaptive Multi-Feature fusion network for UAV localization (WAMF-FPI). We treat this positioning as a low-level task and achieve more accurate localization by restoring the feature map to the resolution of the original satellite image. Then, in order to enhance the ability of the model to solve multi-scale problems, we propose a Weight-Adaptive Multi-Feature fusion module (WAMF), which introduces a weighting mechanism to fuse different features. Finally, since all positive samples are treated in the same way in the existing methods, which is very disadvantageous for accurate localization tasks, we introduce Hanning loss to allow the model to pay more attention to the central area of the target. Our model achieves competitive results on the UL14 dataset. When using RDS as the evaluation metric, the performance of the model improves from 57.22 to 65.33 compared to Finding Point with Image (FPI). In addition, we calculate the actual distance errors (meters) to evaluate the model performance, and the localization accuracy at the 20 m level improves from 57.67% to 69.73%, showing the powerful performance of the model. Although the model shows better performance, much remains to be done before it can be applied.

Keywords:

UAV localization; geo-localization; deep learning; transformer

Graphical Abstract

1. Introduction

With the development of remote sensing technology, sensors gradually cover the full wavelength of the electromagnetic spectrum. Applications in related fields are also increasing. A single image taken by a satellite can cover more than 30,000 km

^{2}

of land, and people can cooperate with drones and satellites to achieve search and rescue [1], terrain mapping [2], agricultural monitoring [3], drone navigation [4], and so on. Since visible light remote sensing can only be used during clear daylight hours, it is very detrimental to scientific research. So, researchers introduced infrared and microwave remote sensing, among others [5,6], and one of their common specifics is the ability to work day and night. LIDAR is one of them. Given the advantages of LIDAR in data collection and discontinuous mapping, its application in the field of geological exploration is becoming more and more widespread. For example, the characteristics of rock masses can be measured by LIDAR [7,8,9], and further rock masses can be analyzed by LIDAR data. Researchers can also obtain accurate 3D information through LiDAR, and in [10], the authors achieved automatic road extraction by analyzing and processing radar point cloud data, providing a direction for automated analysis using remote sensing data. In addition, UAVs can carry LIDAR equipment, which can achieve autonomous positioning of UAVs by scanning ground 3D data [11,12]; however, limited by the size of LIDAR equipment, it is generally necessary to use large UAVs to carry it.

The diversity of remote-sensing technology platforms also brings great convenience to these areas. People can not only obtain large-scale image data through satellites, but also obtain clearer local images through UAV platforms. At present, UAVs mainly rely on satellite signals to achieve navigation and localization in the flight process. However, in the actual application process, the satellite signal will become quite weak after long-distance transmission, so it is relatively easy to interfere with the satellite signal received by the UAV. Especially, in the military field, the phenomenon of satellite signal loss is more common [13]. It becomes increasingly important to achieve autonomous localization and navigation of UAVs in denial environments. With the rapid development of computer vision, geo-localization of UAVs based on satellite images comes into being. Just like the synergy of human eyes and brain, this method can search for the corresponding location in the search map (satellite image) through a query (UAV image). After finding the location of the query in the search map, the current position of the UAV can be deduced from the latitude and longitude information of the search map.

To solve the problem of autonomous UAV navigation in a denial environment, previous methods have mainly been implemented by image retrieval, where the localization of the device is achieved by matching the UAV image with each image in the satellite image database. During the training process, they continuously shortened the distance between UAV images and satellite images of similar regions through metric learning. The method of image retrieval has achieved excellent results on some datasets. However, it has several problems: (1) before practical application, an image database must be prepared in advance for retrieval, and all images in the database are sent to the model for feature extraction. (2) To achieve more accurate positioning, the database needs to cover as much range as possible and the query image needs to be calculated with all the images in the database. This brings greater storage and computing pressure to the computer. (3) When the model is updated, the corresponding database also needs to be updated.

In summary, the method of image retrieval requires a large number of preprocessing operations. At the same time, the storage capacity and computing power of the computer are required, which will bring challenges to practical applications.

How to achieve more accurate and fast positioning? In [14], the authors proposed Finding Point with Image (for the convenience of description, we call it FPI), which is a brand-new end-to-end benchmark. It draws on the method of object tracking to find the corresponding position directly in the satellite image through the UAV image. Firstly, the features of UAV images and satellite images are extracted separately through a Siamese network without sharing weights; then, the UAV feature map and the satellite feature map are calculated by similarity to generate a heat map. Finally, it maps the largest value in the heat map to the satellite image, and one can directly find the position of the UAV in the satellite image. FPI provides a new idea for cross-view geo-localization. However, the features are compressed 16 times after passing through the feature extraction network, which leads to irreparable errors in this method at the source. A deeper network can learn more abstract semantic information, but smaller feature maps, which will also result in the loss of a large amount of location information. This is very terrible for low-level tasks. In addition, the multi-scale problem is also a hot research topic in practical applications. During the application process, the height of the drone and the coverage of the satellite image will change, and it is not enough to consider a single scale. In FPI, the last layer of the feature map is used for similarity calculation as the output of the model, which will undoubtedly hinder the improvement of model performance.

In this paper, we propose a Weight-Adaptive Multi-Feature fusion network for UAV localization. Firstly, the first three stages of PCPVT-S [15] are used to replace Deit-S [16] in FPI as the feature extraction module of the network to extract features from satellite images and UAV images, respectively. The feature map of the model is then set to a larger size using a pyramid structure. Immediately, we need to calculate the similarity between the satellite feature map and the UAV feature map. In practical applications, UAV images and satellite images have different scales, and the improvement of model performance is inseparable from the solution of multi-scale problems. In the experiment, we found that after calculating the similarity between feature maps of different scales, the information concerned by the model is also different, so we naturally thought of merging different features. After experiments, we used a weight-adaptive method to fuse the different features. We will elaborate on the specific steps in Section 4. In addition, in the final stage of training, we use the method of adjacent point interpolation to restore the final output prediction map to the same size as the input satellite image. This is effective for improving model performance.

The following is a summary of our contributions.

We propose a new end-to-end framework called WAMF-FPI, which considers positioning as a low-level task. It alleviates the loss of location information due to the feature map being compressed. In addition, we enhance the ability of the model to solve multi-scale problems through the WAMF module.
We develop a new Hanning loss that assigns different weights to positive samples to make the model pay more attention to the center of the target region, which proves to be effective in experiments.
Based on the model proposed in this paper, when RDS is used as the evaluation metric, the performance of the model improves from 57.22 to 65.33 by 8.11 points compared to FPI. We also used MA to evaluate the model, and the positioning accuracy of the model at the 5 m level, 10 m level, and 20 m level reached 26.99%, 52.63%, and 69.73%, respectively, which achieved the state-of-the-art (SOTA) on the Ul14 dataset.

2. Related Work

2.1. Overview of UAV Geo-Localization Methods

Cross-view geo-localization refers to the matching and positioning between images from different views, thus enabling the transfer of location information between different views. Due to the huge variation of viewpoints between different perspectives, traditional methods of manual feature extraction (SIFT and SURF) face great challenges, and neural networks offer new directions for cross-view geo-location. With the growth of mobile devices and the popularity of the Internet, it provides a convenient way to obtain images with GPS information. At first, people tried to use a large number of ground images available on the Internet as a database for querying images [17,18,19,20,21,22,23,24]. For example, Torsten Sattler et al. [23] proposed an algorithm for detecting geometric bursts, which integrates the geometric bursts of images and then ranks them to improve the recall rate of image retrieval. Amir Roshan Zamir et al. [24] used the matching method of GMCP to match the street view map for geo-localization. The authors used GMCP to enhance the consistency between global features as a way to select the image closest to the target image. However, most of the available images are of famous locations and buildings, and these databases cover a small part of the area, for example, only a small part of the data in the urban fringe and rural areas, which greatly limits the development of geo-localization.

With the development of computational vision, UAV technology, and remote sensing technology, people try to find the same semantic information from different perspectives used for geo-localization, and aerial images are one of them. Compared with ground images, aerial images can cover a large area, and the development of remote sensing technology has enabled aerial images to cover even the global scale. Since then, researchers have started to obtain location information by matching street view images with aerial images [25,26,27,28,29,30]. Yicong Tian et al. [25] proposed to first detect each building in the query image using Faster R-CNN and then perform retrieval matching afterwards. In 2015, Scott Workman et al. [29] proposed the use of convolutional neural networks to solve the cross-view problem by learning the same semantic information in ground pictures and aerial pictures through convolutional neural networks. Tsung-Yi Lin et al. [30] used a Siamese network in cross-view matching to extract features from ground images and UAV images separately for mutual matching, and borrowed methods from face-matching to calculate the similarity between UAV images and ground images. Traditional cross-view matching uses metric learning, however, due to the drastic change in viewpoint between the aerial image and the ground street view, even two buildings in the same location will still show huge differences between their UAV images and street view images, which makes cross-view matching extremely difficult.

In order to reduce the difficulty of cross-view matching, in 2019, Yujiao Shi et al. [31] proposed to use the correspondence between ground and aerial images to change the aerial image to align with the ground image through polar coordinate transformation and embed the relative position in training. Thus, the difficulty of image retrieval is reduced. However, the polar coordinate transformation requires center alignment between image pairs, which is not suitable in practical applications. Thereafter, due to the powerful image generation capability of Generative Adversarial Networks (GANs), in 2021, Aysim Toker et al. [32] proposed to reduce the difficulty of image retrieval by transforming satellite images into ground-based street images using GAN, while the method proposed by the authors is an end-to-end approach and their model achieved advanced performance on CVUSA as well as CVACT. In 2022, Zedong Zeng [33] proposed a peer-to-peer learning and cross-diffusion (PLCD) framework using UAV images from lateral views as an intermediate bridge for matching ground and satellite images.

In 2020, Zhen et al. [34] proposed a method based on matching of UAV images and satellite images to match related images by image retrieval, and the model used ResNet-50 pre-trained on ImageNet as the backbone, modified the classification layer after the backbone network, and forced images from different sources to the same space by sharing model weights. The model has also shown good performance in real environments. In 2021, Wang et al. [35] argued that most of the previous methods focused on the extraction of central information and ignored the contextual information of adjacent regions, for which they proposed LPN to enhance the utilization of adjacent region information by extracting features of images through partitioning, further improving the performance of cross-view geo-location. Dai et al. [36] proposed to compare the similarity of images by calculating the cosine distance between UAV images and satellite images in vertical view. The authors cropped a whole satellite image into countless small copies by sliding windows in different steps, and the model extracted the features of all cropped satellite images separately and saved them to generate a satellite gallery. After that, the features of the input UAV images are calculated and the most similar images are determined by calculating the cosine distance between the UAV images and the data in the satellite gallery to achieve UAV positioning and navigation.

Recent work has made great achievements in retrieval. However, most of these results can only give an approximate position, but not a more precise one. To enable the application of visual geo-localization in real environments, Ref. [37] proposed a new dataset VIGOR, which is no longer one-to-one image retrieval. During the training process, the authors used the Siamese network to extract the image features and later introduced bias prediction to correct the model’s localization. Finally, the authors also used a meter-level localization accuracy for evaluating the model performance, which allows the final results to be presented more intuitively.

In summary, the current cross-view geo-location efforts have adopted two main approaches. On the one hand, Siamese networks are used, where the model is trained to learn the same features in different views, and then, the distance between images of the same location is continuously brought closer. On the other hand, the researchers used polar coordinate transformation and GAN to transform the images as a way to reduce the difference between images from different perspectives, thus reducing the difficulty of matching.

2.2. Vision Transformer

At present, deep learning has shown strong potential in the fields of object detection, image classification, semantic segmentation, object tracking, etc. Deep learning models in the field of vision can be broadly classified into two types of convolution-based neural networks and Transformer-based neural networks. Traditional convolutional neural networks (CNNs) use different kernels to extract features of images such as LeNet [38], Google Inception Net [39], AlexNet [40], and VGG [41]. However, when researchers increase the number of layers in a neural network and want to extract further image features, it often results in gradient disappearance and gradient explosion. In order to continuously increase the number of layers of the network to better extract the features of the images, Resnet [42] proposed the residual learning framework which provided the possibility of generating ultra-deep network structures. However, because the receptive field of the convolutional neural network is fixed, it is difficult to connect the global information of the picture. Although atrous convolution [43] and Deformable convolutional [44] are introduced, it is still difficult to improve the model performance.

Due to the excellent performance of Transformer in the field of NLP, people introduced it to the field of computer vision. Since then, Transformer has continued to make its presence felt in image processing which includes semantic segmentation, object detection, object tracking, and other fields [45,46,47,48,49,50]. Compared with convolutional neural networks, Transformer is able to better extract global information of the image. However, the image domain differs from NLP in that a single image has thousands of pixels. Using the global attention mechanism would entail a huge amount of computation, and researchers have done a lot of research on this. Vision Transformer [51], proposed by Google in 2020, shows powerful capabilities for processing in image classification. Due to the cylindrical structure of VIT, it is not suitable for dense prediction, and it brings about a large computational amount. To reduce the computational amount of the Vision Transformer, Wang et al. [52] proposed that Pyramid Vision Transformer compressed K and V in the long-head attention mechanism through a convolution operation, which not only reduced the computational amount but also introduced the pyramid structure into the Transformer for the first time-based model, enabling its application to the task of density prediction. The subsequently proposed networks such as TNT [53], ResT [54], CVT [55], and Swin Transformer [56] all use a pyramid structure to improve the model’s capability for intensive tasks. By varying Q, K, and V, Ref. [57] reduced the huge computation caused by Transformer while adding consideration of multi-scale problems, and employing separable self-attention to enhance the model’s ability to solve multi-scale problems. In FPI, the authors used Deit-S as the backbone of the model, which allowed the output prediction maps to be compressed by a factor of 16 in resolution. Our goal is to design a model with a pyramid structure to restore the model output to a larger size and reduce the inaccurate positioning due to reduced output resolution. The generation of models such as Pyramid Vision Transformer offers the possibility of development for this purpose.

3. Materials and Methods

In this section, we first describe the main points and problems of the work in FPI, and then, in Section 3.2, we introduce the overall structure of the WAMF-FPI. Section 3.3 introduces how the model extracts features from UAV and satellite images. In Section 3.4, the Weight-Adaptive Multi-Feature fusion module (WAMF) is proposed, which introduces a weighted fusion mechanism to improve the model’s ability to solve multi-scale problems. Finally, in Section 3.5, we explain how to use Hanning loss to discriminate between different positive samples.

3.1. The Previous Methods

In object tracking, the researchers perform tracking by calculating the similarity between the template and the search region in the current frame. The method of finding points with an image is borrowed from the method in the field of object tracking, but it has more difficulties than object tracking. Because its template image (UAV images) and query image (satellite images) are from different platforms, there is a large variability.

As shown in Figure 1, the method of finding points with an image takes the satellite image as the search image and the UAV image as the query image. Then, the images captured by the UAV and the satellite images of the corresponding areas are transmitted into an end-to-end network. After processing, the output is a heat map and the point with the highest maximum value in the heat map is the location of the UAV predicted by the model. Finally, we map it to the satellite image. The position of the UAV can be obtained according to the latitude and longitude information retained by the satellite image. In FPI, the authors use two Deit-S without shared weights as feature extraction modules for vertical views of UAV images and satellite images, respectively, and subsequently, the extracted features are subjected to the similarity calculation to obtain a heat map. Finally, we map the location of the maximum value of the heat map to the satellite image to determine the location of the UAV. FPI has creatively proposed a new way of visual positioning for UAVs. However, there is always room for progress in new things. In FPI, the last layer of feature maps is used for similarity calculation. Since the final output prediction map is compressed by 16 times, the model loses a lot of spatial information. The loss of spatial information will bring irreparable loss to the final positioning accuracy and there is still room for improvement in the ability of the FPI to solve multi-scale problems. It is worth mentioning that in order to avoid the above problems, we restore the final prediction map to the original satellite image size in WAMF-FPI to reduce the loss of spatial information. In addition, the use of the WAMF module and Hanning loss further improves the performance of the model.

3.2. The Framework of WAMF-FPI

In this section, we introduce the structure of the WAMF-FPI. To improve the localization performance of the model, we introduced the feature pyramid structure, WAMF module and Hanning loss. Together, these structures form the powerful model that is WAMF-FPI. As shown in Figure 2, WAMF-FPI can be roughly divided into three parts: feature extraction module, WAMF module, and prediction head. We have made improvements in the backbone by using two more powerful PCPVT-S as the feature extraction modules for the UAV map and the satellite image, respectively. After that, to better extract multi-scale information and retain more spatial information, the initially extracted features are sent to the feature pyramid network for further feature extraction, and then, the WAMF module is used for similarity calculation and multi-feature fusion. Finally, the fused feature maps are upsampled to generate the final output prediction maps. It is worth mentioning that the size of the final output prediction map is the same as the size of the input satellite image in the WAMF-FPI. Figure 2 shows the flowchart of the whole model.

3.3. Feature Extraction Module

WAMF-FPI also adopts a structure similar to the Siamese network, but it is different from traditional object tracking. There is a huge difference between satellite-view images and UAV-view images due to the fact that they come from different devices. Therefore, the UAV-view branch and the satellite-view branch in WAMF-FPI do not use the method of weight-sharing. WAMF-FPI uses satellite images (400 × 400 × 3) and UAV images (128 × 128 × 3) as the input of the model, and then, the features of the images are extracted by the PCPVT-S, respectively. Specifically, we removed the last stage of PCPVT-S and only used the first three stages for feature extraction. When the size of the input images is 400 × 400 × 3 and 128 × 128 × 3, we can obtain feature maps with the shape of 25 × 25 × 320 and 8 × 8 × 320, respectively, from the two branches. Different from Deit-S used in FPI, PCPVT-S has a pyramid structure design, which can better adapt to the task of dense prediction. The use of the pyramid structure lays the foundation for the subsequent use of the WAMF module. At the same time, the network using the pyramid structure can effectively reduce the amount of computation and improve the computation speed. This is very important for the implementation of the project.

After using the PCPVT-S to extract information from the image, if the similarity calculation is performed directly on the last feature maps, the low resolution of the output feature map will directly affect the accuracy of the model’s final results (map 25 × 25 to 400 × 400). For this purpose, we used a feature pyramid structure to fuse the original feature maps by upsampling and a lateral join structure, and the final output was only compressed by a factor of four compared to the input. The localization bias caused by the low resolution of the feature map is avoided at the source. Because the shallow feature map with high resolution has more spatial information, the deep feature map with rich semantic information is fused by the lateral connection structure.

WAMF-FPI first uses a kernel of size one to adjust the channel dimension of the three-stage feature map obtained after the PCPVT-S. In actual use, we set the number of output channels to 64. Then, the upsampling operation is performed on the feature maps of the last two stages, and the obtained feature maps are fused with the feature maps of the same scale output by the backbone. Finally, the features are further extracted by a kernel of size 3. The fused feature map has both shallow semantic features and deep semantic features, which will be more conducive to improving the localization performance of the model.

After the feature extraction module, each pixel of the feature map is no longer an RGB value, but has highly abstract image information. Taking the UAV branch as an example, the UAV image produces three feature maps, U1, U2, and U3, after the feature extraction module, and the information contained in these three feature maps is different, and their sizes are also inconsistent. After the feature extraction module, we need to fuse the UAV feature map with the satellite feature map. On the one hand, the connection between the two branches can be established through fusion, and on the other hand, by processing feature maps of different scales, the model’s ability to deal with multi-scale problems can be enhanced.

3.4. Weight-Adaptive Multi-Feature Fusion Module

In this section, we introduce the WAMF module. In order to improve the ability of the model to deal with multi-scale problems. Instead of using a single feature map for similarity calculation as the output of the model, WAMF uses the form of multi-feature fusion. Just as humans use their eyes to find similarities in two pictures, different people focus on different information. Similarly, after using different feature maps for correlation operations, the output features are also different in terms of the information of interest. In addition, there are satellite images of different scales in the UL14 dataset, so it is not reliable to use a single prediction result. Based on this, we thought of fusing different features. However, it is not reasonable to directly add up the feature map for fusion. For this purpose, we introduced learnable parameters in the module and implemented a weighted fusion of different feature maps.

As shown in Figure 2, we used the feature maps S3 extracted from the satellite view branch and the feature maps U1, U2, and U3 extracted from the UAV view branch. Firstly, the similarity is calculated using S3 with U1, U2, and U3, respectively. Here, we set the padding to half of the scale of the U1, U2, and U3 to reduce the loss of edge information. In this way, we obtain three feature maps A1, A2, and A3, which focus on different information. After obtaining the three feature maps, we performed a weighted fusion of these three feature maps, and it is worth mentioning that we also normalized the weighted coefficients during the training process.

3.5. Hanning Loss

Since the size of the final output prediction map in WAMF-FPI is the same as the size of the input satellite image, the labels are also set to the same size during the training process. In the model, Center-R is used to distinguish between positive and negative samples. As shown in Figure 3c, when the Center-R is set to 33, the area covered by the positive sample is square. In addition, the pixel closest to the true position is set to the center of the square, and the side length is set to 33. Finally, the rest of the area is set to negative samples.

In FPI, the scale of the output prediction map is 25 × 25 and the Center-R is set to 1. After the experiments, the authors found that using a larger Center-R would reduce the localization accuracy of the model. We believe that one of the reasons is that all positive samples are given the same weight and the model cannot distinguish the importance of different regions. The importance of the center position is much greater than the edge position, which is obviously logical. However, in FPI, the prediction map is compressed by 16 compared to the original map, so it is not possible to set a larger Center-R during the application, and it is impossible to divide the different positive samples more finely.

WAMF-FPI restores the size of the predicted map to the size of the input satellite image (400 × 400) through the WAMF module and upsampling. After that, in order to balance the ratio of positive and negative samples and reduce the difficulty of training, we adjusted the size of Center-R. After the experiments, the Center-R is finally set to 33 in practical applications. As shown in Figure 3b, when Center-R is set to 33, the red box represents the area covered by the positive samples. After using a larger Center-R, the positive samples at different positions can be given different weights in a more refined way. Based on this, we improved the calculation of the loss function and proposed the Hanning loss.

First, we keep the sum of the weights of positive and negative samples equal. Specifically, the weight of negative samples is first set to

\frac{1}{N N}

, where NN is the number of negative samples. The weight of the positive sample is assigned using the normalized Hanning window (the values at different locations are noted as HN(n)). That is to say, the sum of the weights of the positive samples and the sum of the weights of the negative samples are both 1. Since the number of negative samples is much larger than positive samples, resulting in a small weight of negative samples, we introduce a hyperparameter Negative Weight (NG) to adjust the weight of negative samples in training. Finally, we performed a normalization operation on all the weights. Therefore, the weight of negative samples is finally set to

\frac{N G}{N N \times (N W + 1)}

, and the weight of positive samples is set to

\frac{H N (n)}{N W + 1}

. We refer to such weights as Hanning weight, and the loss function using the Hanning weight is called Hanning loss. Equation (1) describes the Hanning window function. Figure 3d shows the weight assignment of positive samples in different regions when Center-R is set to 33. The pixel in the center will be given the most weight because it is the closest point to the real point. In this way, the model can pay more attention to the central area of the real location to achieve more accurate positioning.

H a n n i n g (n) = \{\begin{matrix} 0.5 + 0.5 cos (\frac{2 π n}{M - 1}), 0 \leq n \leq M - 1 \\ 0, e l s e \end{matrix}

(1)

4. Experiment

4.1. Implementation Details

We use the pretrained PCPVT-S that is publicly available as our backbone model. Specifically, we removed the last stage of PCPVT-S and only used the first three stages for feature extraction. The size of the satellite images and the UAV images is adjusted to 400 × 400 × 3 and 128 × 128 × 3, respectively, as the input of the model. The model training is conducted on a single 1080TI. We train our models with an initial learning rate of 0.0004 and a total batch size of 16 for 24 epochs. The model is optimized with AdamW, which is reduced by a factor of 5 for the 10th, 14th, and 16th epoch, respectively.

It is worth mentioning that during the training process, we set different learning rates for the backbone and neck, and the learning rate of the model after the backbone increased by 1.5 times. The reason for this is that we think that the backbone model introduces pre-trained weights, which do not exist in the neck, so we need to treat it differently. The experimental results show that this is effective.

4.2. Evaluation Metrics and Dataset

In order to compare more clearly with the previous model, we still use the MA [14] and RDS [14] as the evaluation metrics of the model. The MA metric calculates the distance between the real position and the predicted position by using the latitude and longitude between them. It can visually show the model’s meter-level positioning accuracy. However, due to the existence of different scales of data in the dataset, each pixel in different satellite images represents a different distance. For example, when a satellite image covers a large area, even if the model finds a point that is close to the real location in the satellite image, it may have a large error in real space. MA is unable to evaluate the performance of the model itself, and for this reason, RDS is used as complementary. The metric avoids problems due to the scale variation by calculating the relative distance at the pixel level between the predicted and real points. Equation (2) describes how the RDS is calculated, w denotes the pixel width of the satellite image, h denotes the pixel height of the satellite image,

d x

is the pixel distance between the horizontal coordinates of the predicted position and the real position, and

d y

is the pixel distance between the vertical coordinates of the predicted position and the real position. k is the scale factor which is set to 10 in this paper.

R D S = e^{- k \times \sqrt{\frac{{(\frac{d_{x}}{w})}^{2} + {(\frac{d_{y}}{h})}^{2}}{2}}}

(2)

For the dataset, we use UL14 as the dataset for model evaluation. As shown in Table 1, UL14 is composed of UAV images and satellite images, where the training set consists of 6768 UAV images and corresponding satellite images from 10 universities, and the test set consists of 2331 UAV images and corresponding satellite images of different scales from 4 universities. More specifically, the UAV images were taken by the UAV at three altitudes of 80 m, 90 m, and 100 m with 20 m intervals, so UL14 is a densely sampled dataset. After the center cropping of the picture taken by the UAV, they are all reduced to a size of 128 × 128 × 3. According to the position information preserved in the UAV picture, we can easily find the position in the satellite image, and then make related datasets through different cropping methods. The ratio of UAV image to satellite image is 1:1 in the training set and 1:12 in the test set. As shown in Figure 4, the red circles in the satellite image represent the location of the UAV. The test set contains 12 satellite images of different scales, and their image size is 400 × 400 × 3, but the length of each side of the satellite images in reality is distributed between 180 m and 463 m, which means that the satellite image with the largest scale contains a larger area. The satellite images in the training set were not cropped and used the original size of 1280 × 1280 × 3. The position of the UAV was located in the middle of the satellite image. The satellite images will be randomly cropped during the training process, and the excess will be filled with average pixels (one can see that some images in the test set have gray padding, and the training set uses the same method). The size of the satellite images in the final training set will be set to 400 × 400 × 3, which is consistent with the test set. The size of the satellite map during training will be set to 400 × 400 × 3, which is consistent with the test set, and in this way, data enhancement can be achieved to improve the capability of the model.

4.3. Main Result

We compared the proposed method with previous methods on the UL14 dataset. Table 2 shows the results of FPI compared to WAMF-FPI on the UL14 dataset when using RDS as the evaluation metric. WAMF-FPI shows a strong performance. At lower GFLOPs, WAMF-FPI delivers an 8-point performance improvement compared to FPI in the UL14 dataset. At the same time, we verified the performance of WAMF-FPI in meter-level positioning accuracy, as shown in Figure 5. Compared with FPI, the positioning accuracy improvements of our model for the 5 m level, 10 m level, and 20 m level are 145% (18.63% → 26.99%), 137% (38.36% → 52.63%), and 121% (57.67% → 69.73%), respectively.

We also counted the performance of the model at different scales. As shown in Figure 6, the performance of FPI and WAMF-FPI is demonstrated. The three pictures in the figure respectively represent the positioning accuracy of the model at 10 m level, 20 m level, and 30 m level when facing satellite images of different scales. The abscissa represents the scale of the satellite image, and the ordinate represents the positioning accuracy of the model at the corresponding level. The green line in Figure 6 indicates the result after removing the multi-feature fusion in WAMF-FPI (only using the UAV feature map U3 and satellite feature map S3 for similarity calculation). We can see that WAMF-FPI has greatly improved the overall performance compared with FPI, and the performance at different scales showing drastic jitter has been alleviated. In addition, the ability of the model to solve multi-scale problems has been improved by using the WAMF module, especially for larger-scale satellite images. The model performance no longer degrades sharply.

5. Ablation Experiment

5.1. The Effect of Feature Pyramid Structure

The shallow feature maps have larger resolution and more spatial information, while deep feature maps have a smaller resolution but contain more complex semantic information. If a smaller-resolution feature map is used for similarity calculation as the final output, a lot of spatial information will be lost, which will lead to the deviation of UAV positioning from the source. However, both spatial information in the shallow network and complex semantic information in the deep network are quite important for UAV image localization tasks. To this purpose, we introduce the feature pyramid structure. Through the feature pyramid structure, we can enlarge the final prediction map. After adding the feature pyramid structure, the model can obtain a feature map that is more than three times larger. In the process of mapping from the prediction map to the satellite images, the location loss due to smaller resolution can be reduced at the source compared to the network without a feature pyramid structure. At the same time, a large amount of semantic information from the deep feature map can be obtained. This information is crucial for the UAV’s visual localization which is a low-level task.

For the sake of fairness, Center-R was set to about four percent of the width of the output feature map in the experiment, and other settings were kept consistent. The results are shown in Figure 7. We compared the changes in model performance before and after using the feature pyramid structure and found that the model with the pyramid structure improved localization accuracy by 2.46% at the 3 m level and by 5.18% at the 5 m level. Table 3 shows that the model achieves a 3.31-point improvement in the RDS metric after using the pyramid structure.

5.2. The Effect of Multi-Feature Fusion

On the one hand, satellite images show a complex scene where there is a lot of environmental information, and it is a difficult task to find the most relevant areas through the UAV image. On the other hand, the UL14 dataset contains satellite images of different scales, which requires the model to have a better ability to solve multi-scale problems. The WAMF module was born for this.

In order to verify the validity of the WAMF module, we compared different methods which include results that rely on a single scale as well as results from the fusion of feature maps at different scales (for the sake of fairness, we normalize the results of multi-feature fusion). From the comparison in Table 4, it can be seen that the model has a significant improvement in performance after fusing different features. The multi-feature fusion approach has the best performance compared to other methods in terms of localization accuracy at 3 m level, 5 m level, and 10 m level, reaching 11.56%, 24.94%, and 49.18%, respectively.

To further demonstrate the effectiveness of the WAMF module, as shown in Figure 8, we draw heat maps based on the single feature maps A1, A2, and A3 and the fusion results of all feature maps. It can be seen from the heat map that after the similarity calculation between the UAV feature maps of different scales and the satellite feature map (S3), the information that the model pays attention to is different. By comparing the positioning accuracy of different results, we found that the result of multi-feature fusion can effectively improve the localization performance of the model on the UL14 dataset. In addition, we surprisingly found that the localization accuracy of the model using multi-feature fusion will be better than any single-scale result, even if the single-scale localization is not particularly accurate.

5.3. The Effect of Learnable Parameters

Does treating different features differently improve model performance? To test this idea, we introduce a set of learnable parameters for the weighted fusion of different features. The results of the experiments show the correctness of the idea. Table 5 shows the ablation results of the WAMF module before and after using the learnable parameters. The model with learnable parameters shows superior performance, with a 2.05% improvement in accuracy at the 5 m level and a 3.45% improvement in accuracy at the 10 m level compared to the direct fusion approach. For UAVs in flight, various buildings, vegetation, roads, and other objects on the ground will change drastically. Different objects have different scales. Multi-feature fusion is one of the key technologies to alleviate the multi-scale problem. On this basis, the weighted fusion method can further improve the model’s performance.

5.4. The Effect of Hanning Loss

In the UAV localization task, if all positive samples are given the same weight, the model will not be able to distinguish which position is more accurate in the region, but if the number of positive samples is reduced, it will be more difficult to train. Since the final output feature map size is too small, the previous methods cannot make fine distinctions for different positive samples. As the feature map is scaled up, WAMF-FPI sets the Center-R to 33. This makes it possible for us to distinguish different positive samples. Here, the Hanning window is used to assign different weights to positive samples from different regions. Positive samples that are closer to the target location will be given more weight.

As shown in Figure 9, this is a comparison between four pairs of heat maps, which used Hanning weight and average weight, respectively. It can be seen that the content of the model using the Hanning weight is more concentrated. In addition, the content of the model using the average weight shows a state of diffusion, which is the reason for the imprecise positioning. As shown in Table 6, the model with Hanning loss improved the localization accuracy at the 3 m level, 5 m level, and 10 m level by 2.28%, 4.78%, and 5.64%, respectively, while the performance was improved by 1.77 points when using RDS as the evaluation metric.

5.5. The Effect of Upsampling

After the satellite and UAV images are processed by the WAMF module, the size of the final output feature map is 101 × 101. The previous method is used to restore the feature map to the size of the original satellite image, and then find the position of the pixel with the largest value in it. Although the prediction map is only compressed by a factor of four, this method still results in the loss of spatial information. Therefore, WAMF-FPI introduces adjacent point interpolation in the training process to restore the feature map to the size of the satellite image. That is, after the prediction map is restored to the same size as the original image through upsampling, the loss calculation and forward propagation are performed. It is proved by experiments that this method can further improve the model performance. As shown in Figure 10, compared with the model without restoring the feature map to the size of the original satellite image, the localization accuracy of our model at 3 m level, 5 m level, and 10 m level is improved by 1.81%, 2.35%, and 3.97%, respectively.

6. Discussion

By comparing the above experiments with previous methods, we have obtained the best performance. We treat this UAV localization task as a combination of the object tracking task as well as the semantic segmentation task. From the experimental results, it can be seen that the spatial information in the feature map is very important for UAV localization tasks. Using PCPVT-S as the backbone and combining the pyramid structure, the resolution of the feature map can be enlarged to the size of the original image, which can reduce the loss from the source. In addition, it can be seen from the ablation experiments of the WAMF module that the multi-scale capability of the model has been improved, especially on larger-scale satellite images. Increasing the coverage of positive samples in the training process can reduce the difficulty of model training, but it also brings the problem of imprecise localization. By assigning different weights (Hanning weight) to different positive samples, the accuracy of localization can be ensured while increasing the number of positive samples. Although WAMF-FPI has achieved certain achievements, it is still difficult to achieve the real landing needs in terms of positioning accuracy, and more post-processing work is needed as an aid.

7. Conclusions

In this paper, we propose a simple and effective model, the Weight-Adaptive Multi-Feature map fusion Network (WAMF-FPI) for UAV localization in denial environments. Our model makes full use of the characteristics of the feature pyramid structure and adaptively fuses image information of different scales through the form of weighted fusion. In addition, the feature map is restored to the original size by interpolation of neighboring points, which alleviates the loss of location information due to the compression of the feature map.

In addition, the new Hanning loss function, with the introduction of Hanning weights, allows the model to focus more on the center of the target area and thus improve the accuracy of localization. Our method achieves excellent results on UL14, achieving 12.50%, 26.99%, and 52.63% localization accuracy at 3 m level, 5 m level, and 10 m level, respectively. Compared to previous models, it is a significant improvement. In the future, we will try to apply this model to real UAVs for experiments. In addition, we can use the obtained positioning data for UAV navigation.

This project considers the UAV vision localization task as a low-level task, and finds the location of the UAV image in the satellite map by combining object tracking with semantic segmentation. The experimental results show the feasibility of this method of finding points with images, and to a certain extent, promote the development of UAV visual positioning and navigation technology. However, this study has some shortcomings in terms of the dataset. The current dataset only covers data from urban areas and lacks data from mountainous and suburban areas, which are a constraint on the generalization of the method. In addition, we believe that there is room to improve the positioning accuracy and running speed of the model. In the final part of the similarity calculation of the UAV images to the satellite images, the model takes a lot of time which is not good for the application of the model. In the future, we will expand the dataset so that it can adequately cover a variety of different scenarios and try to apply the model to real UAVs for experiments. In an increasingly complex environment, we believe that our proposed approach will contribute to cross-view geo-localization.

Author Contributions

Conceptualization, G.W., J.C. and E.Z.; methodology, G.W. and J.C.; software, M.D.; validation, G.W., J.C. and M.D.; resources, M.D.; data curation, M.D.; writing—original draft preparation, G.W.; writing—review and editing, G.W. and J.C.; visualization, J.C.; supervision, E.Z.; project administration, E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Zhejiang Province, grant number LGG22F030001.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Karaca, Y.; Cicek, M.; Tatli, O.; Sahin, A.; Pasli, S.; Beser, M.F.; Turedi, S. The potential use of unmanned aircraft systems (drones) in mountain search and rescue operations. Am. J. Emerg. Med. 2018, 36, 583–588. [Google Scholar] [CrossRef] [PubMed]
Li, Y.C.; Ye, D.M.; Ding, X.B.; Teng, C.S.; Wang, G.H.; Li, T.H. UAV Aerial Photography Technology in Island Topographic Mapping. In Proceedings of the 2011 International Symposium on Image and Data Fusion, Tengchong, China, 9–11 August 2011. [Google Scholar] [CrossRef]
Radoglou-Grammatikis, P.; Sarigiannidis, P.; Lagkas, T.; Moscholios, I. A compilation of UAV applications for precision agriculture. Comput. Netw. 2020, 172, 107148. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
Dannenberg, M.; Wang, X.; Yan, D.; Smith, W. Phenological characteristics of global ecosystems based on optical, fluorescence, and microwave remote sensing. Remote Sens. 2020, 12, 671. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, J.; Liang, S.; Wang, D. A practical reanalysis data and thermal infrared remote sensing data merging (RTM) method for reconstruction of a 1-km all-weather land surface temperature. Remote Sens. Environ. 2021, 260, 112437. [Google Scholar] [CrossRef]
Lee, S.; Jeon, S. A Study on the roughness measurement for joints in rock mass using LIDAR. Tunn. Undergr. Space 2017, 27, 58–68. [Google Scholar] [CrossRef]
Lato, M.J.; Vöge, M. Automated mapping of rock discontinuities in 3D lidar and photogrammetry models. Int. J. Rock Mech. Min. Sci. 2012, 54, 150–158. [Google Scholar] [CrossRef]
Ge, Y.; Tang, H.; Xia, D.; Wang, L.; Zhao, B.; Teaway, J.W.; Chen, H.; Zhou, T. Automated measurements of discontinuity geometric properties from a 3D-point cloud based on a modified region growing algorithm. Eng. Geol. 2018, 242, 44–54. [Google Scholar] [CrossRef]
Lin, Y.C.; Habib, A. Quality control and crop characterization framework for multi-temporal UAV LiDAR data over mechanized agricultural fields. Remote Sens. Environ. 2021, 256, 112299. [Google Scholar] [CrossRef]
Opromolla, R.; Fasano, G.; Rufino, G.; Grassi, M.; Savvaris, A. LIDAR-inertial integration for UAV localization and mapping in complex environments. In Proceedings of the 2016 International Conference on Unmanned Aircraft Systems (ICUAS), Arlington, VA, USA, 7–10 June 2016; pp. 649–656. [Google Scholar]
Pritzl, V.; Vrba, M.; Štĕpán, P.; Saska, M. Cooperative navigation and guidance of a micro-scale aerial vehicle by an accompanying UAV using 3D LiDAR relative localization. In Proceedings of the 2022 International Conference on Unmanned Aircraft Systems (ICUAS), Dubrovnik, Croatia, 21–24 June 2022; pp. 526–535. [Google Scholar]
Meng, F.; Yang, D. Research of UAV Location Control System Based on SINS, GPS and Optical Flow. In Proceedings of the 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 6–8 November 2020. [Google Scholar] [CrossRef]
Dai, M.; Chen, J.; Lu, Y.; Hao, W.; Zheng, E. Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization. arXiv 2022, arXiv:2208.06561. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef]
Kim, H.J.; Dunn, E.; Frahm, J.M. Learned Contextual Feature Reweighting for Image Geo-Localization. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Kim, H.J.; Dunn, E.; Frahm, J.M. Predicting Good Features for Image Geo-Localization Using Per-Bundle VLAD. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
Saurer, O.; Baatz, G.; Köser, K.; Ladický, L.; Pollefeys, M. Image Based Geo-localization in the Alps. Int. J. Comput. Vis. 2015, 116, 213–225. [Google Scholar] [CrossRef]
Hays, J.; Efros, A.A. IM2GPS: Estimating geographic information from a single image. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar] [CrossRef]
Vo, N.; Jacobs, N.; Hays, J. Revisiting IM2GPS in the Deep Learning Era. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Sattler, T.; Havlena, M.; Schindler, K.; Pollefeys, M. Large-Scale Location Recognition and the Geometric Burstiness Problem. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef]
Zamir, A.R.; Shah, M. Image Geo-Localization Based on MultipleNearest Neighbor Feature Matching UsingGeneralized Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1546–1558. [Google Scholar] [CrossRef]
Tian, Y.; Chen, C.; Shah, M. Cross-View Image Matching for Geo-Localization in Urban Environments. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Lin, T.Y.; Belongie, S.; Hays, J. Cross-View Image Geolocalization. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Zhai, M.; Bessinger, Z.; Workman, S.; Jacobs, N. Predicting Ground-Level Scene Layout from Aerial Imagery. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 7–13 December 2015. [Google Scholar] [CrossRef]
Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-aware feature aggregation for image based cross-view geo-localization. Adv. Neural Inf. Process. Syst. 2019, 32, 10090–10100. [Google Scholar]
Toker, A.; Zhou, Q.; Maximov, M.; Leal-Taixe, L. Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Zeng, Z.; Wang, Z.; Yang, F.; Satoh, S. Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval. IEEE Trans. Multimed. 2022. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 867–879. [Google Scholar] [CrossRef]
Dai, M.; Huang, J.; Zhuang, J.; Lan, W.; Cai, Y.; Zheng, E. Vision-Based UAV Localization System in Denial Environments. arXiv 2022, arXiv:2201.09201. [Google Scholar]
Zhu, S.; Yang, T.; Chen, C. VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Kim, M.; Lee, Y.; Lim, C. Deformable convolutional networks based Mask R-CNN. J. Korean Data Inf. Sci. Soc. 2020, 31, 993–1008. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Springer International Publishing: New York, NY, USA, 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Jiang, Y.; Chang, S.; Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. Adv. Neural Inf. Process. Syst. 2021, 34, 14745–14758. [Google Scholar]
Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransReID: Transformer-based Object Re-Identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Zhang, Q.; Yang, Y.B. ResT: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing Convolutions to Vision Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021. [Google Scholar] [CrossRef]
Ren, S.; Zhou, D.; He, S.; Feng, J.; Wang, X. Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 10853–10862. [Google Scholar]

Figure 1. The process of finding point with image. ➀ Take ground pictures through the UAV. ➁ Feed UAV images and satellite images into the model to generate a heat map. ➂ Find the corresponding position in the satellite map through the point with the largest value in the heat map.

Figure 2. The overview of the WAMF-FPI structure diagram. WAMF-FPI uses a structure similar to the Siamese network for feature extraction, and the loss of spatial information can be reduced through the pyramid structure. The WAMF module can further improve the model performance. In the figure, the part inside the blue dashed box is the feature extraction module, and the part inside the green dashed box is the WAMF module.

Figure 3. The pentagram in (a) represents the central area of the UAV image. The red rectangular box in (b) indicates the area covered by positive samples in the satellite image when Center-R is set to 33; (c) shows the labeling schematic when Center-R is set to 33, the red area represents positive samples, the white area represents negative samples; (d) shows the assignment of positive sample weights when Center-R is set to 33 in WAMF-FPI.

Figure 4. The ratio of drone image to satellite image is 1:1 in the training set and 1:12 in the test set, and the red circles in the satellite image represent the UAV’s location.

Figure 5. The performance comparison of FPI and WAMF-FPI under meter-level localization accuracy.

Figure 6. The performance of FPI and WAMF-FPI on different scale satellite images, including 10 m level, 20 m level, and 30 m level, respectively. The green line indicates the result after removing the multi-feature fusion in WAMF-FPI (only using the UAV feature map U3 and satellite feature map S3 for similarity calculation).

Figure 7. The influence of pyramid structure on meter-level localization accuracy.

Figure 8. The comparison of heat maps before and after multi-feature fusion. The white numbers represent the distance between the point with the largest value in the heat map and the real location. The unit is meter. The red circles in the figure indicate the real locations and the blue circles indicate the model predicted locations.

Figure 9. The first row is the heat map output by the model after the Hanning weight is used in training, and the second row is the heatmap output by the model after the average weight is used in training. White values represent the distance between the predicted position and the true position in meters. The red circles in the figure indicate the real locations and the blue circles indicate the model-predicted locations.

Figure 10. During the training process, the comparison of model performance before and after using upsampling to restore the feature map to the size of the original satellite image.

Table 1. The training set is from 10 different universities and the test set is from 4 universities, each containing approximately 600 UAV images. The UAV images are composed of three different height datasets, 80 m, 90 m, and 100 m.

DATASET	SATELLITE	UAV	UNIVERSITIES
Train	6768	6768	10
Test	2331	27,972	4

Table 2. The Comparison of GFLOPs and RDS scores of FPI and WAMF-FPI.

MODEL	RDS	GFLOPs	Param(M)	Inference Time
FPI	57.22	14.88	44.48	1×
WAMF-FPI	65.33	13.32	48.5	1.69×

Table 3. The performance of the model before and after using the pyramid structure on the UL14 dataset. ✓ indicates that the pyramid structure is used and × indicates that the pyramid structure is not used.

Pyramid Structure	Outputsize	RDS
✓	101 × 101	61.32
×	26 × 26	58.01

Table 4. The comparison of single prediction results and results after fusion of multiple feature maps.

Feature	Outputsize	RDS	3 m(%)	5 m(%)	10 m(%)	20 m(%)	30 m(%)	40 m(%)	50 m(%)
A3	400 × 400	61.32	9.49	20.52	43.89	63.65	69.08	71.82	74.58
A2	400 × 400	58.74	9.00	20.44	42.88	61.45	65.99	68.66	71.20
A1	400 × 400	49.62	5.64	13.19	30.71	48.28	54.96	58.58	61.93
A1 + A2 + A3	400 × 400	63.70	11.56	24.94	49.18	67.28	71.16	73.54	75.84

Table 5. The effect of using learnable parameters on model performance during multi-feature fusion. ✓ indicates that the learnable parameters is used and × indicates that the learnable parameters is not used.

Learnable Parameters	RDS	3 m (%)	5 m (%)	10 m (%)	20 m (%)	50 m (%)
✓	65.33	12.50	26.99	52.63	69.73	77.24
×	63.70	11.56	24.94	49.18	67.28	75.84

Table 6. The effect of Hanning loss on model performance. ✓indicates that the Hanning Weight is used and × indicates that the Hanning Weight is not used.

Hanning Weight	Outputsize	RDS	3 m(%)	5 m(%)	10 m(%)	20 m(%)	30 m(%)	40 m(%)	50 m(%)
✓	400 × 400	65.33	12.50	26.99	52.63	69.73	73.11	75.12	77.24
×	400 × 400	63.56	10.22	22.21	46.99	66.77	71.72	74.52	76.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, G.; Chen, J.; Dai, M.; Zheng, E. WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization. Remote Sens. 2023, 15, 910. https://doi.org/10.3390/rs15040910

AMA Style

Wang G, Chen J, Dai M, Zheng E. WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization. Remote Sensing. 2023; 15(4):910. https://doi.org/10.3390/rs15040910

Chicago/Turabian Style

Wang, Guirong, Jiahao Chen, Ming Dai, and Enhui Zheng. 2023. "WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization" Remote Sensing 15, no. 4: 910. https://doi.org/10.3390/rs15040910

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization

Abstract

1. Introduction

2. Related Work

2.1. Overview of UAV Geo-Localization Methods

2.2. Vision Transformer

3. Materials and Methods

3.1. The Previous Methods

3.2. The Framework of WAMF-FPI

3.3. Feature Extraction Module

3.4. Weight-Adaptive Multi-Feature Fusion Module

3.5. Hanning Loss

4. Experiment

4.1. Implementation Details

4.2. Evaluation Metrics and Dataset

4.3. Main Result

5. Ablation Experiment

5.1. The Effect of Feature Pyramid Structure

5.2. The Effect of Multi-Feature Fusion

5.3. The Effect of Learnable Parameters

5.4. The Effect of Hanning Loss

5.5. The Effect of Upsampling

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI