1. Introduction
With the development of remote sensing technology, sensors gradually cover the full wavelength of the electromagnetic spectrum. Applications in related fields are also increasing. A single image taken by a satellite can cover more than 30,000 km
of land, and people can cooperate with drones and satellites to achieve search and rescue [
1], terrain mapping [
2], agricultural monitoring [
3], drone navigation [
4], and so on. Since visible light remote sensing can only be used during clear daylight hours, it is very detrimental to scientific research. So, researchers introduced infrared and microwave remote sensing, among others [
5,
6], and one of their common specifics is the ability to work day and night. LIDAR is one of them. Given the advantages of LIDAR in data collection and discontinuous mapping, its application in the field of geological exploration is becoming more and more widespread. For example, the characteristics of rock masses can be measured by LIDAR [
7,
8,
9], and further rock masses can be analyzed by LIDAR data. Researchers can also obtain accurate 3D information through LiDAR, and in [
10], the authors achieved automatic road extraction by analyzing and processing radar point cloud data, providing a direction for automated analysis using remote sensing data. In addition, UAVs can carry LIDAR equipment, which can achieve autonomous positioning of UAVs by scanning ground 3D data [
11,
12]; however, limited by the size of LIDAR equipment, it is generally necessary to use large UAVs to carry it.
The diversity of remote-sensing technology platforms also brings great convenience to these areas. People can not only obtain large-scale image data through satellites, but also obtain clearer local images through UAV platforms. At present, UAVs mainly rely on satellite signals to achieve navigation and localization in the flight process. However, in the actual application process, the satellite signal will become quite weak after long-distance transmission, so it is relatively easy to interfere with the satellite signal received by the UAV. Especially, in the military field, the phenomenon of satellite signal loss is more common [
13]. It becomes increasingly important to achieve autonomous localization and navigation of UAVs in denial environments. With the rapid development of computer vision, geo-localization of UAVs based on satellite images comes into being. Just like the synergy of human eyes and brain, this method can search for the corresponding location in the search map (satellite image) through a query (UAV image). After finding the location of the query in the search map, the current position of the UAV can be deduced from the latitude and longitude information of the search map.
To solve the problem of autonomous UAV navigation in a denial environment, previous methods have mainly been implemented by image retrieval, where the localization of the device is achieved by matching the UAV image with each image in the satellite image database. During the training process, they continuously shortened the distance between UAV images and satellite images of similar regions through metric learning. The method of image retrieval has achieved excellent results on some datasets. However, it has several problems: (1) before practical application, an image database must be prepared in advance for retrieval, and all images in the database are sent to the model for feature extraction. (2) To achieve more accurate positioning, the database needs to cover as much range as possible and the query image needs to be calculated with all the images in the database. This brings greater storage and computing pressure to the computer. (3) When the model is updated, the corresponding database also needs to be updated.
In summary, the method of image retrieval requires a large number of preprocessing operations. At the same time, the storage capacity and computing power of the computer are required, which will bring challenges to practical applications.
How to achieve more accurate and fast positioning? In [
14], the authors proposed Finding Point with Image (for the convenience of description, we call it FPI), which is a brand-new end-to-end benchmark. It draws on the method of object tracking to find the corresponding position directly in the satellite image through the UAV image. Firstly, the features of UAV images and satellite images are extracted separately through a Siamese network without sharing weights; then, the UAV feature map and the satellite feature map are calculated by similarity to generate a heat map. Finally, it maps the largest value in the heat map to the satellite image, and one can directly find the position of the UAV in the satellite image. FPI provides a new idea for cross-view geo-localization. However, the features are compressed 16 times after passing through the feature extraction network, which leads to irreparable errors in this method at the source. A deeper network can learn more abstract semantic information, but smaller feature maps, which will also result in the loss of a large amount of location information. This is very terrible for low-level tasks. In addition, the multi-scale problem is also a hot research topic in practical applications. During the application process, the height of the drone and the coverage of the satellite image will change, and it is not enough to consider a single scale. In FPI, the last layer of the feature map is used for similarity calculation as the output of the model, which will undoubtedly hinder the improvement of model performance.
In this paper, we propose a Weight-Adaptive Multi-Feature fusion network for UAV localization. Firstly, the first three stages of PCPVT-S [
15] are used to replace Deit-S [
16] in FPI as the feature extraction module of the network to extract features from satellite images and UAV images, respectively. The feature map of the model is then set to a larger size using a pyramid structure. Immediately, we need to calculate the similarity between the satellite feature map and the UAV feature map. In practical applications, UAV images and satellite images have different scales, and the improvement of model performance is inseparable from the solution of multi-scale problems. In the experiment, we found that after calculating the similarity between feature maps of different scales, the information concerned by the model is also different, so we naturally thought of merging different features. After experiments, we used a weight-adaptive method to fuse the different features. We will elaborate on the specific steps in
Section 4. In addition, in the final stage of training, we use the method of adjacent point interpolation to restore the final output prediction map to the same size as the input satellite image. This is effective for improving model performance.
The following is a summary of our contributions.
We propose a new end-to-end framework called WAMF-FPI, which considers positioning as a low-level task. It alleviates the loss of location information due to the feature map being compressed. In addition, we enhance the ability of the model to solve multi-scale problems through the WAMF module.
We develop a new Hanning loss that assigns different weights to positive samples to make the model pay more attention to the center of the target region, which proves to be effective in experiments.
Based on the model proposed in this paper, when RDS is used as the evaluation metric, the performance of the model improves from 57.22 to 65.33 by 8.11 points compared to FPI. We also used MA to evaluate the model, and the positioning accuracy of the model at the 5 m level, 10 m level, and 20 m level reached 26.99%, 52.63%, and 69.73%, respectively, which achieved the state-of-the-art (SOTA) on the Ul14 dataset.
3. Materials and Methods
In this section, we first describe the main points and problems of the work in FPI, and then, in
Section 3.2, we introduce the overall structure of the WAMF-FPI.
Section 3.3 introduces how the model extracts features from UAV and satellite images. In
Section 3.4, the Weight-Adaptive Multi-Feature fusion module (WAMF) is proposed, which introduces a weighted fusion mechanism to improve the model’s ability to solve multi-scale problems. Finally, in
Section 3.5, we explain how to use Hanning loss to discriminate between different positive samples.
3.1. The Previous Methods
In object tracking, the researchers perform tracking by calculating the similarity between the template and the search region in the current frame. The method of finding points with an image is borrowed from the method in the field of object tracking, but it has more difficulties than object tracking. Because its template image (UAV images) and query image (satellite images) are from different platforms, there is a large variability.
As shown in
Figure 1, the method of finding points with an image takes the satellite image as the search image and the UAV image as the query image. Then, the images captured by the UAV and the satellite images of the corresponding areas are transmitted into an end-to-end network. After processing, the output is a heat map and the point with the highest maximum value in the heat map is the location of the UAV predicted by the model. Finally, we map it to the satellite image. The position of the UAV can be obtained according to the latitude and longitude information retained by the satellite image. In FPI, the authors use two Deit-S without shared weights as feature extraction modules for vertical views of UAV images and satellite images, respectively, and subsequently, the extracted features are subjected to the similarity calculation to obtain a heat map. Finally, we map the location of the maximum value of the heat map to the satellite image to determine the location of the UAV. FPI has creatively proposed a new way of visual positioning for UAVs. However, there is always room for progress in new things. In FPI, the last layer of feature maps is used for similarity calculation. Since the final output prediction map is compressed by 16 times, the model loses a lot of spatial information. The loss of spatial information will bring irreparable loss to the final positioning accuracy and there is still room for improvement in the ability of the FPI to solve multi-scale problems. It is worth mentioning that in order to avoid the above problems, we restore the final prediction map to the original satellite image size in WAMF-FPI to reduce the loss of spatial information. In addition, the use of the WAMF module and Hanning loss further improves the performance of the model.
3.2. The Framework of WAMF-FPI
In this section, we introduce the structure of the WAMF-FPI. To improve the localization performance of the model, we introduced the feature pyramid structure, WAMF module and Hanning loss. Together, these structures form the powerful model that is WAMF-FPI. As shown in
Figure 2, WAMF-FPI can be roughly divided into three parts: feature extraction module, WAMF module, and prediction head. We have made improvements in the backbone by using two more powerful PCPVT-S as the feature extraction modules for the UAV map and the satellite image, respectively. After that, to better extract multi-scale information and retain more spatial information, the initially extracted features are sent to the feature pyramid network for further feature extraction, and then, the WAMF module is used for similarity calculation and multi-feature fusion. Finally, the fused feature maps are upsampled to generate the final output prediction maps. It is worth mentioning that the size of the final output prediction map is the same as the size of the input satellite image in the WAMF-FPI.
Figure 2 shows the flowchart of the whole model.
3.3. Feature Extraction Module
WAMF-FPI also adopts a structure similar to the Siamese network, but it is different from traditional object tracking. There is a huge difference between satellite-view images and UAV-view images due to the fact that they come from different devices. Therefore, the UAV-view branch and the satellite-view branch in WAMF-FPI do not use the method of weight-sharing. WAMF-FPI uses satellite images (400 × 400 × 3) and UAV images (128 × 128 × 3) as the input of the model, and then, the features of the images are extracted by the PCPVT-S, respectively. Specifically, we removed the last stage of PCPVT-S and only used the first three stages for feature extraction. When the size of the input images is 400 × 400 × 3 and 128 × 128 × 3, we can obtain feature maps with the shape of 25 × 25 × 320 and 8 × 8 × 320, respectively, from the two branches. Different from Deit-S used in FPI, PCPVT-S has a pyramid structure design, which can better adapt to the task of dense prediction. The use of the pyramid structure lays the foundation for the subsequent use of the WAMF module. At the same time, the network using the pyramid structure can effectively reduce the amount of computation and improve the computation speed. This is very important for the implementation of the project.
After using the PCPVT-S to extract information from the image, if the similarity calculation is performed directly on the last feature maps, the low resolution of the output feature map will directly affect the accuracy of the model’s final results (map 25 × 25 to 400 × 400). For this purpose, we used a feature pyramid structure to fuse the original feature maps by upsampling and a lateral join structure, and the final output was only compressed by a factor of four compared to the input. The localization bias caused by the low resolution of the feature map is avoided at the source. Because the shallow feature map with high resolution has more spatial information, the deep feature map with rich semantic information is fused by the lateral connection structure.
WAMF-FPI first uses a kernel of size one to adjust the channel dimension of the three-stage feature map obtained after the PCPVT-S. In actual use, we set the number of output channels to 64. Then, the upsampling operation is performed on the feature maps of the last two stages, and the obtained feature maps are fused with the feature maps of the same scale output by the backbone. Finally, the features are further extracted by a kernel of size 3. The fused feature map has both shallow semantic features and deep semantic features, which will be more conducive to improving the localization performance of the model.
After the feature extraction module, each pixel of the feature map is no longer an RGB value, but has highly abstract image information. Taking the UAV branch as an example, the UAV image produces three feature maps, U1, U2, and U3, after the feature extraction module, and the information contained in these three feature maps is different, and their sizes are also inconsistent. After the feature extraction module, we need to fuse the UAV feature map with the satellite feature map. On the one hand, the connection between the two branches can be established through fusion, and on the other hand, by processing feature maps of different scales, the model’s ability to deal with multi-scale problems can be enhanced.
3.4. Weight-Adaptive Multi-Feature Fusion Module
In this section, we introduce the WAMF module. In order to improve the ability of the model to deal with multi-scale problems. Instead of using a single feature map for similarity calculation as the output of the model, WAMF uses the form of multi-feature fusion. Just as humans use their eyes to find similarities in two pictures, different people focus on different information. Similarly, after using different feature maps for correlation operations, the output features are also different in terms of the information of interest. In addition, there are satellite images of different scales in the UL14 dataset, so it is not reliable to use a single prediction result. Based on this, we thought of fusing different features. However, it is not reasonable to directly add up the feature map for fusion. For this purpose, we introduced learnable parameters in the module and implemented a weighted fusion of different feature maps.
As shown in
Figure 2, we used the feature maps S3 extracted from the satellite view branch and the feature maps U1, U2, and U3 extracted from the UAV view branch. Firstly, the similarity is calculated using S3 with U1, U2, and U3, respectively. Here, we set the padding to half of the scale of the U1, U2, and U3 to reduce the loss of edge information. In this way, we obtain three feature maps A1, A2, and A3, which focus on different information. After obtaining the three feature maps, we performed a weighted fusion of these three feature maps, and it is worth mentioning that we also normalized the weighted coefficients during the training process.
3.5. Hanning Loss
Since the size of the final output prediction map in WAMF-FPI is the same as the size of the input satellite image, the labels are also set to the same size during the training process. In the model, Center-R is used to distinguish between positive and negative samples. As shown in
Figure 3c, when the Center-R is set to 33, the area covered by the positive sample is square. In addition, the pixel closest to the true position is set to the center of the square, and the side length is set to 33. Finally, the rest of the area is set to negative samples.
In FPI, the scale of the output prediction map is 25 × 25 and the Center-R is set to 1. After the experiments, the authors found that using a larger Center-R would reduce the localization accuracy of the model. We believe that one of the reasons is that all positive samples are given the same weight and the model cannot distinguish the importance of different regions. The importance of the center position is much greater than the edge position, which is obviously logical. However, in FPI, the prediction map is compressed by 16 compared to the original map, so it is not possible to set a larger Center-R during the application, and it is impossible to divide the different positive samples more finely.
WAMF-FPI restores the size of the predicted map to the size of the input satellite image (400 × 400) through the WAMF module and upsampling. After that, in order to balance the ratio of positive and negative samples and reduce the difficulty of training, we adjusted the size of Center-R. After the experiments, the Center-R is finally set to 33 in practical applications. As shown in
Figure 3b, when Center-R is set to 33, the red box represents the area covered by the positive samples. After using a larger Center-R, the positive samples at different positions can be given different weights in a more refined way. Based on this, we improved the calculation of the loss function and proposed the Hanning loss.
First, we keep the sum of the weights of positive and negative samples equal. Specifically, the weight of negative samples is first set to
, where NN is the number of negative samples. The weight of the positive sample is assigned using the normalized Hanning window (the values at different locations are noted as HN(n)). That is to say, the sum of the weights of the positive samples and the sum of the weights of the negative samples are both 1. Since the number of negative samples is much larger than positive samples, resulting in a small weight of negative samples, we introduce a hyperparameter Negative Weight (NG) to adjust the weight of negative samples in training. Finally, we performed a normalization operation on all the weights. Therefore, the weight of negative samples is finally set to
, and the weight of positive samples is set to
. We refer to such weights as Hanning weight, and the loss function using the Hanning weight is called Hanning loss. Equation (
1) describes the Hanning window function.
Figure 3d shows the weight assignment of positive samples in different regions when Center-R is set to 33. The pixel in the center will be given the most weight because it is the closest point to the real point. In this way, the model can pay more attention to the central area of the real location to achieve more accurate positioning.
5. Ablation Experiment
5.1. The Effect of Feature Pyramid Structure
The shallow feature maps have larger resolution and more spatial information, while deep feature maps have a smaller resolution but contain more complex semantic information. If a smaller-resolution feature map is used for similarity calculation as the final output, a lot of spatial information will be lost, which will lead to the deviation of UAV positioning from the source. However, both spatial information in the shallow network and complex semantic information in the deep network are quite important for UAV image localization tasks. To this purpose, we introduce the feature pyramid structure. Through the feature pyramid structure, we can enlarge the final prediction map. After adding the feature pyramid structure, the model can obtain a feature map that is more than three times larger. In the process of mapping from the prediction map to the satellite images, the location loss due to smaller resolution can be reduced at the source compared to the network without a feature pyramid structure. At the same time, a large amount of semantic information from the deep feature map can be obtained. This information is crucial for the UAV’s visual localization which is a low-level task.
For the sake of fairness, Center-R was set to about four percent of the width of the output feature map in the experiment, and other settings were kept consistent. The results are shown in
Figure 7. We compared the changes in model performance before and after using the feature pyramid structure and found that the model with the pyramid structure improved localization accuracy by 2.46% at the 3 m level and by 5.18% at the 5 m level.
Table 3 shows that the model achieves a 3.31-point improvement in the RDS metric after using the pyramid structure.
5.2. The Effect of Multi-Feature Fusion
On the one hand, satellite images show a complex scene where there is a lot of environmental information, and it is a difficult task to find the most relevant areas through the UAV image. On the other hand, the UL14 dataset contains satellite images of different scales, which requires the model to have a better ability to solve multi-scale problems. The WAMF module was born for this.
In order to verify the validity of the WAMF module, we compared different methods which include results that rely on a single scale as well as results from the fusion of feature maps at different scales (for the sake of fairness, we normalize the results of multi-feature fusion). From the comparison in
Table 4, it can be seen that the model has a significant improvement in performance after fusing different features. The multi-feature fusion approach has the best performance compared to other methods in terms of localization accuracy at 3 m level, 5 m level, and 10 m level, reaching 11.56%, 24.94%, and 49.18%, respectively.
To further demonstrate the effectiveness of the WAMF module, as shown in
Figure 8, we draw heat maps based on the single feature maps A1, A2, and A3 and the fusion results of all feature maps. It can be seen from the heat map that after the similarity calculation between the UAV feature maps of different scales and the satellite feature map (S3), the information that the model pays attention to is different. By comparing the positioning accuracy of different results, we found that the result of multi-feature fusion can effectively improve the localization performance of the model on the UL14 dataset. In addition, we surprisingly found that the localization accuracy of the model using multi-feature fusion will be better than any single-scale result, even if the single-scale localization is not particularly accurate.
5.3. The Effect of Learnable Parameters
Does treating different features differently improve model performance? To test this idea, we introduce a set of learnable parameters for the weighted fusion of different features. The results of the experiments show the correctness of the idea.
Table 5 shows the ablation results of the WAMF module before and after using the learnable parameters. The model with learnable parameters shows superior performance, with a 2.05% improvement in accuracy at the 5 m level and a 3.45% improvement in accuracy at the 10 m level compared to the direct fusion approach. For UAVs in flight, various buildings, vegetation, roads, and other objects on the ground will change drastically. Different objects have different scales. Multi-feature fusion is one of the key technologies to alleviate the multi-scale problem. On this basis, the weighted fusion method can further improve the model’s performance.
5.4. The Effect of Hanning Loss
In the UAV localization task, if all positive samples are given the same weight, the model will not be able to distinguish which position is more accurate in the region, but if the number of positive samples is reduced, it will be more difficult to train. Since the final output feature map size is too small, the previous methods cannot make fine distinctions for different positive samples. As the feature map is scaled up, WAMF-FPI sets the Center-R to 33. This makes it possible for us to distinguish different positive samples. Here, the Hanning window is used to assign different weights to positive samples from different regions. Positive samples that are closer to the target location will be given more weight.
As shown in
Figure 9, this is a comparison between four pairs of heat maps, which used Hanning weight and average weight, respectively. It can be seen that the content of the model using the Hanning weight is more concentrated. In addition, the content of the model using the average weight shows a state of diffusion, which is the reason for the imprecise positioning. As shown in
Table 6, the model with Hanning loss improved the localization accuracy at the 3 m level, 5 m level, and 10 m level by 2.28%, 4.78%, and 5.64%, respectively, while the performance was improved by 1.77 points when using RDS as the evaluation metric.
5.5. The Effect of Upsampling
After the satellite and UAV images are processed by the WAMF module, the size of the final output feature map is 101 × 101. The previous method is used to restore the feature map to the size of the original satellite image, and then find the position of the pixel with the largest value in it. Although the prediction map is only compressed by a factor of four, this method still results in the loss of spatial information. Therefore, WAMF-FPI introduces adjacent point interpolation in the training process to restore the feature map to the size of the satellite image. That is, after the prediction map is restored to the same size as the original image through upsampling, the loss calculation and forward propagation are performed. It is proved by experiments that this method can further improve the model performance. As shown in
Figure 10, compared with the model without restoring the feature map to the size of the original satellite image, the localization accuracy of our model at 3 m level, 5 m level, and 10 m level is improved by 1.81%, 2.35%, and 3.97%, respectively.
7. Conclusions
In this paper, we propose a simple and effective model, the Weight-Adaptive Multi-Feature map fusion Network (WAMF-FPI) for UAV localization in denial environments. Our model makes full use of the characteristics of the feature pyramid structure and adaptively fuses image information of different scales through the form of weighted fusion. In addition, the feature map is restored to the original size by interpolation of neighboring points, which alleviates the loss of location information due to the compression of the feature map.
In addition, the new Hanning loss function, with the introduction of Hanning weights, allows the model to focus more on the center of the target area and thus improve the accuracy of localization. Our method achieves excellent results on UL14, achieving 12.50%, 26.99%, and 52.63% localization accuracy at 3 m level, 5 m level, and 10 m level, respectively. Compared to previous models, it is a significant improvement. In the future, we will try to apply this model to real UAVs for experiments. In addition, we can use the obtained positioning data for UAV navigation.
This project considers the UAV vision localization task as a low-level task, and finds the location of the UAV image in the satellite map by combining object tracking with semantic segmentation. The experimental results show the feasibility of this method of finding points with images, and to a certain extent, promote the development of UAV visual positioning and navigation technology. However, this study has some shortcomings in terms of the dataset. The current dataset only covers data from urban areas and lacks data from mountainous and suburban areas, which are a constraint on the generalization of the method. In addition, we believe that there is room to improve the positioning accuracy and running speed of the model. In the final part of the similarity calculation of the UAV images to the satellite images, the model takes a lot of time which is not good for the application of the model. In the future, we will expand the dataset so that it can adequately cover a variety of different scenarios and try to apply the model to real UAVs for experiments. In an increasingly complex environment, we believe that our proposed approach will contribute to cross-view geo-localization.