In this section, we first provide a detailed discussion of our proposed IML-Net. Subsequently, we illustrate the methodology and principles behind constructing cross-domain descriptors using the IRN. Finally, we present the architecture of the MUML module for building augmented training data.
3.1. IML-Net
The key focus and challenge of cross-view geo-localization technology lie in extracting as complete features of the target building as possible from input drone-view images taken at different angles. Based on this, as illustrated in
Figure 2, we proposed the IML-Net to better accomplish cross-view geo-localization tasks.
The framework is primarily composed of the IRN and the MUML module which can assist the model in achieving better cross-domain matching between multi-angle drone images and satellite images. The detailed introductions to these two modules are as follows.
The IRN. The IRN consists of a decoder composed of a 2D encoder and a deconvolutional network. As our research focuses on the multi-domain image-matching problem for geographic localization, our input images can be obtained from drone perspectives at different angles. The images taken at different angles reconstructed by the IRN from the input images are compared to satellite images corresponding to the buildings, and the reconstruction loss is calculated. This process is utilized to construct feature descriptors extracted by the 2D encoder.
Our task is to construct a more robust and distinctive feature descriptor
from input images of drone views at different angles. It can be confirmed that building the feature descriptor
can be achieved by increasing the similarity between the reconstructed images and the corresponding satellite views of the target buildings. In short, when the reconstructed images closely resemble the features of the original images captured from a vertical perspective., the extracted feature descriptors from the original images can encompass more comprehensive details. The IRN is influenced by the multi-layer deconvolution network proposed by Zeiler et al. [
46], which reconstructs the original images through the reverse process of the backbone network, utilizing the transpose of convolution kernels and the reverse feature map calculation. In the deconvolution process, the transpose of the convolution kernels and the reverse feature map calculation are used to return to the previous layer. As for the un-pooling process, we adopt an approximate method, activating the value at the coordinate position of the maximum activation in the pooling process and setting other values to 0. Regarding the activation function, the ReLU function ensures that the activation values of each layer’s output are positive. Therefore, for the reverse process, we also use the ReLU function to guarantee that the feature maps of each layer are positive. Our current backbone network employs ResNet50. Due to ResNet50 having residual blocks, it cannot achieve a completely symmetrical reverse process, but the image reconstruction strategy still yields good results.
The MUML module. Modern CNNs typically focus on extracting deep-level image features. However, we contend that the shallow-level features extracted by the network are equally important. By facilitating mutual learning between shallow-level and deep-level features, we believe the network can acquire more valuable information. Meanwhile, a small training dataset makes it challenging to build a more robust model, so we employ the MUML to leverage feature information at diverse depths as augmented data during the training process to assist the task. Therefore, we divide the backbone network into four different units to categorize the feature layers of the MUML. The input image passes through different units, generating feature maps of varying sizes. At each step (step i), regions of interest are identified in these feature maps, and these regions are cropped from the original image. The cropped images are then reintroduced into the training process as augmented data. Theoretically speaking, through the MUML module, features in the diverse depths of the layers are utilized to learn from each other, constructing attention maps, learning more noteworthy details, and then expanding the training data. This approach serves as a means to enhance the training process and cope with challenges associated with limited training data for building a more robust model.
Through the above two modules, it is possible to effectively extract key details of the target buildings while expanding the training data. This approach can achieve good results.
Our final training loss consists of the reconstruction loss and distance regularization constraint loss:
Reconstruction Loss. The reconstruction loss, defined by mean squared error, represents the loss of the autoencoder network. Specifically, it is the mean squared error between the reconstructed multi-angle drone images
and the corresponding target satellite images
, formulated as follows:
In this formula,
and
represent the
i-th pixel in the reconstructed multi-angle drone images and the corresponding target satellite images.
Distance Regularization Constraint Loss (DRCL). To enhance the similarity between target buildings in different domains, i.e., ensure that the multi-angle drone images and their corresponding target satellite images have similar embeddings, most researchers have employed the Triplet Loss function. This type of loss function minimizes the distance between positive samples and the anchor sample while maximizing the distance between negative samples and the anchor sample.
For incorrectly matched noisy samples, when different target buildings are erroneously considered as the same target building, the optimization objective of the Triplet Loss would force the model to learn an infinitely small distance between them. This can lead to overfitting to the noise samples, resulting in a deteriorated final matching performance. To address this issue, we were inspired by [
51] to introduce
regularization and applied distance regularization constraints to optimize the Triplet Loss. We normalize the
Triplet Loss feature descriptors under the
norm to lie on a hypersphere with a fixed radius; the aim is to prevent the distance between positive samples and the anchor sample from being minimized, and likewise, to avoid the maximization of the distance between negative samples and the anchor sample, as illustrated in
Figure 3:
The DRCL is expressed in the following formula:
where
is the margin,
is the distance function, and
represent the distances corresponding to the anchor sample, positive sample, and the most difficult negative sample, respectively.
Training Loss. Our overall training loss is obtained by multiplying the reconstruction loss and the DRCL by different weights.
3.2. IRN for Cross-Domain Descriptors
In the context of multi-angle drone-view images, where the loss of building details is particularly severe, our main objective is to enable the model to learn and construct a more robust feature descriptor. This feature descriptor should contain more detailed information and be able to find the mapping relationship between different views of remote sensing images. Inspired by the deconvolutional network [
46], we believe that reconstructing the original input image using a deconvolutional network and calculating the reconstruction loss can help build descriptors that are more favorable for matching. It is important to reconstruct the original image through a deconvolutional network using such feature descriptors. As the reconstructed image features become closer to the target image features, the feature descriptor can contain more comprehensive target information.
The following introduces how the mapping relationship between cross-domain data and feature descriptors is constructed.
Let
be a set of multi-domain images for the same target location, where
is a color image block of size
, represented in the traditional RGB color space. Each point is represented by its coordinates
and RGB color. The goal of learning cross-domain descriptors is to find multiple mappings
,
, which map the data space of different domains to a shared latent space
. Here,
contains common features of different domain data, ensuring that for each set of corresponding relationships between different domains, their mappings are as similar as possible. Mathematically, given the distance function
and descriptors
, where
, if
and
represent the same underlying geometry,
where
is a pre-defined margin.
In addition to constructing the mapping relationship between data in different domains and descriptors, we also aim to learn the inverse functions and . Since these inverse mappings can reconstruct data from descriptors, they prove beneficial in downstream applications, such as visualizing features extracted from diverse depths of the network. In this paper, we utilize the learned cross-domain feature descriptors to reconstruct the original images. Reconstruction is achieved by minimizing the reconstruction loss between the original images and the reconstructed images, serving the downstream task of cross-view geo-localization.
3.3. The Method of MUML for Constructing Augmented Training Data
Modern CNN architectures are typically composed of units [
50,
51,
52], where a unit refers to a set of layers processing on feature maps with identical spatial dimensions. As depicted in
Figure 4, we use units to divide feature layers of diverse depths. The spatial dimensions of the feature maps progressively reduce from the superficial layers to the profound stages. As an illustration, the ResNet50 layers (excluding the fully connected classifier) are organized into four distinct units. When presented with an input image of size 256 × 256 for ResNet50, the spatial dimensions of the output feature maps for layers within the four units decrease from 128 × 128, 64 × 64, 32 × 32, to 16 × 16. After generating discriminative regions through Class Activation Maps (CAMs) [
52] on these feature maps of different sizes, attention maps are produced through down-sampling. By normalizing the attention maps of these four units, we can identify and crop regions in the original image that are discriminative. These regions are then used as augmented data in the training process. The specific principles are as follows:
Let be the backbone network of a neural convolutional network, which can be any CNN developed to date, such as ResNet50, ResNext, etc. has M layers, where represents the layers of from shallow to deep. are features in diverse depths of the layer of the network based on layers. Each feature is composed of features output from a certain layer from to ; for example, is composed of layers from to , where . Features gradually cover the layers of from shallow to deep, and the deepest feature covers all layers from to .
Let
represent the feature maps at an intermediate stage generated by
for features
, respectively, in diverse depths of the layer of the network, and
, where
,
, and
represent the height, width, and number of channels of the feature map, respectively. We use a set of functions
to ensure the reliability of the process of generating feature maps
. The functions
used to generate feature maps
and
are defined as follows:
where 3 × 3 refers to the spatial dimension,
is the number of input channels, and
is the number of output channels.
represents the batch sample normalization operation,
represents the Elu operation, and
represents the convolution operation with different kernel sizes. For example,
represents a two-dimensional convolution operation with a kernel size of
. The method based on CAM can be used to identify discriminative regions of the image, denoted as
and
.
We define the discriminative region
generated by the CAM method for features
in diverse depths of the layer as follows:
In this formula, the coordinates represent the spatial positions of and , and explains the importance of the spatial position .
At the same time, we further elaborate on the CAM, which is essentially a linear weighted combination of visual motifs occurring at various spatial positions. These visual patterns are obtained by activating every unit within the intermediate feature map , contributing to discriminative regions for image recognition. By up-sampling CAM to obtain regions consistent with the size of input images, we have the capability to understand the most discriminative regions in the image from diverse depths of the feature layers.
Therefore, after obtaining
, we perform down-sampling on
using a bilinear sampling kernel to generate an attention map
, where
and
are the height and width of the input image. Subsequently, we apply min–max normalization to
, and every spatial component within the normalized attention map
is defined as follows:
We obtain the normalized attention map
to find and crop out the discriminative regions in the features by standardizing the attention map, providing guidance for the matching task. First, we set elements in
exceeding the threshold
to 1, and the rest of the elements to 0, generating a mask
. In summary, each spatial element of this mask is given by the following equation:
Similar to the mutual learning mechanism used for fine-grained visual classification [
53,
54], we also locate a bounding box capable of covering the region of interest highlighted by the mask
. Simultaneously, we find and crop this region from the input image. Subsequently, we resize the cropped region to match the dimensions of the input images through up-sampling. The attention region
obtained through this method is treated as additional data introduced during the training process.
Through this process, not only can we assist in training and help the network extract more robust feature descriptors, but it also helps us expand the training data.