Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

Ye, Famao; Wu, Kunlin; Zhang, Rengao; Wang, Mengyao; Meng, Xianglong; Li, Dajun

doi:10.3390/rs15194729

Open AccessArticle

Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

by

Famao Ye

^1,2

,

Kunlin Wu

²

,

Rengao Zhang

²

,

Mengyao Wang

²,

Xianglong Meng

³ and

Dajun Li

^2,*

¹

Key Laboratory of Mine Environmental Monitoring and Improving around Poyang Lake of Ministry of Natural Resources, East China University of Technology, Nanchang 330013, China

²

School of Surveying and Geoinformation Engineering, East China University of Technology, Nanchang 330013, China

³

Geographic Information Engineering Brigade of Geological Bureau of Jiangxi Provincial, Nanchang 330002, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4729; https://doi.org/10.3390/rs15194729

Submission received: 10 July 2023 / Revised: 20 September 2023 / Accepted: 22 September 2023 / Published: 27 September 2023

Download

Browse Figures

Versions Notes

Abstract

:

For high-resolution remote sensing image retrieval tasks, single-scale features cannot fully express the complexity of the image information. Due to the large volume of remote sensing images, retrieval requires extensive memory and time. Hence, the problem of how to organically fuse multi-scale features and enhance retrieval efficiency is yet to be resolved. We propose an end-to-end deep hash remote sensing image retrieval model (PVTA_MSF) by fusing multi-scale features based on the Pyramid Vision Transformer network (PVTv2). We construct the multi-scale feature fusion module (MSF) by using a global attention mechanism and a multi-head self-attention mechanism to reduce background interference and enhance the representation capability of image features. Deformable convolution is introduced to address the challenge posed by varying target orientations. Moreover, an intra-class similarity (ICS) loss is proposed to enhance the discriminative capability of the hash feature by minimizing the distance among images of the same category. The experimental results show that, compared with other state-of-the-art methods, the proposed hash feature could yield an excellent representation of remote sensing images and improve remote sensing image retrieval accuracy. The proposed hash feature can gain an increase of 4.2% and 1.6% in terms of mAP on the UC Merced and NWPU-RESISC45 datasets, respectively, in comparison with other methods.

Keywords:

remote sensing image retrieval; PVTv2; multi-scale feature fusion; hash learning; attention mechanism

Graphical Abstract

1. Introduction

With the rapid advancement of Earth observation technology, the number of remote sensing satellites has increased significantly, resulting in a rapid growth in the volume of remote sensing images [1]. Effectively locating and retrieving the desired remote sensing images from massive databases, as well as efficiently managing and utilizing the remote sensing image data, pose formidable challenges [2]. Remote sensing image retrieval (RSIR) aims to retrieve the required remote sensing images accurately and efficiently from extensive databases and can be categorized into text-based RSIR and content-based RSIR [3]. Text-based RSIR retrieves tagged images from the remote sensing database based on query keywords or labels, but it requires extensive manual annotation of each image in the dataset during the initial phase. Content-based RSIR, on the other hand, performs image retrieval by searching for images in the database that closely resemble the query image. This approach closely aligns with human visual perception and is currently the dominant retrieval method. Due to the complex scene and rich background information of remote sensing images, it is difficult to extract effective retrieval features and accurately measure the similarity of features, which is a problem that needs to be solved.

CBRSIR comprises three main components: feature extraction, reduction of feature dimensionality, and similarity calculation. Initial features for CBRSIR were limited to basic patterns in the images, such as lines, shapes, and textures. These features, known as low-level features, were manually designed. Low-level features such as SIFT [4], LBP [5], and HOG [6] provide typical examples. Low-level features describe local image representation and are aggregated to form mid-level features using descriptor aggregation techniques such as BoW [7], VLAD [8], FK [9], and EMK [10]. With the development of deep learning technology and the introduction of image retrieval, convolutional neural networks (CNNs) [11] are typically used as feature extractors to obtain abstract features of remote sensing images [12], referred to as high-level features. However, the high dimensionality of these deep features has led to challenges such as high computational costs and storage requirements. Therefore, dimensionality reduction techniques are necessary to improve retrieval speed and minimize memory usage. Various studies have shown that encoding or pooling methods can be used to achieve dimensionality reduction on the characteristics. One such technique is hashing, which produces binary hash codes through coding, significantly reducing retrieval time and memory use.

The primary challenge in CBRSIR is the vast area covered by remote sensing images, which depict multiple object categories and complex background information. Retrieval accuracy is affected by the high similarity between images of different categories, significant differences between images of the same category, and diversity in the orientation of image targets. Remote sensing images can be represented from various perspectives using features at different scales. The multi-scale feature fusion methods have been applied in multiple domains [13,14,15], such as hyperspectral image classification [14] and pedestrian detection [15], and have demonstrated significant effectiveness. Inspired by this, some studies in CBRSIR use a feature fusion technique to overcome the limitations of single-feature expression capability by extracting multiple features from the same or different models. Nonetheless, in these methods, the feature fusion and feature extraction processes are usually separated, making it difficult to uniformly learn features at varying scales and perform an end-to-end multi-feature fusion.

Recently, Transformer models have gained significant attention in the field of computer vision. Dosovitskiy et al. [16] proposed the Vision Transformer (ViT) model, which employs a pure Transformer-based approach and is suitable for image classification tasks. After being trained on large datasets, ViT outperformed traditional convolutional neural network (CNN) models and demonstrated stronger generalization capability. However, ViT only generates feature maps of a single resolution, which results in high computational complexity, as global self-attention needs to be computed. To address these issues, Liu et al. [17] proposed the Swin Transformer, which adopts a hierarchical structure similar to CNNs and can process multi-scale images. Moreover, it employs a sliding window operation to calculate local window attention, reducing the computational complexity from quadratic to linear, as in ViT. Wang et al. [18] proposed Pyramid Vision Transformer (PVT), which is the first Transformer-based architecture using a feature pyramid. PVT features a progressive shrinking pyramid structure and a spatial reduction attention mechanism (SRA). Compared to ViT, PVT significantly reduces computational complexity. PVTv2 [19] further improves the original PVT by introducing overlapping patch embeddings and a linear spatial reduction attention mechanism, making the feature pyramid Transformer architecture a viable backbone network for visual tasks. Other than image classification, Transformer models have demonstrated stronger feature extraction capabilities than CNNs in fields such as object detection, semantic segmentation, and image processing. Therefore, this paper proposes an end-to-end hash retrieval of remote sensing images by fusing multi-scale features based on PVTv2. The main contributions and innovations of this paper are as follows:

(1) This paper proposes a multi-scale feature fusion (MSF) module to address the limitation of single-scale feature representation. The MSF module fuses high-level and low-level features extracted from different scales of the PVTv2 model. It uses both the global attention mechanism (GAM) and multi-head self-attention (MHSA) to organically fuse the features of four scales of the PVTv2 model. The MSF module also introduces a lightweight deformable convolution to overcome the issue of various target orientations and angle transformations in remote sensing images.

(2) To address the problem of larger computation and storage overheads caused by high-dimensional features, this paper introduces a hash encoding layer to extract deep hash features for remote sensing image retrieval in an end-to-end manner. Additionally, the multi-scale features and hash features are trained together to enhance the representation ability of hash features.

(3) To address the issue of large differences between images of the same class while achieving a balance between accuracy and computation, this paper proposes an intra-class similarity loss (ICS loss) inspired by deep metric learning (DML) loss. The ICS loss reduces the distance between same-class images and is computed in each batch of training samples to enhance the discriminative power of remote sensing image features.

2. Related Work

This section provides a comprehensive review of related works in CBRSIR, including methods that utilize CNN features and deep hashing features, feature fusion techniques to enhance feature representation, the use of deep metric learning (DML) to optimize convolutional neural networks and extract discriminative features, and an introduction to the Transformer network PVTv2.

2.1. CBRSIR Based on CNN Features

Deep features extracted from CNNs have been increasingly utilized in CBRSIR. For instance, Li et al. [20] designed four unsupervised convolutional neural networks that generate four types of deep features at different layers. By combining these deep features with traditional handcrafted features, they provided more effective features for CBRSIR. Raffaele et al. [21] extracted deep local convolutional features from fine-tuned CNN models and aggregated the local convolutional features into global descriptors using the vector of locally aggregated descriptors (VLAD). They utilized multiplication and addition attention mechanisms to overcome irrelevant background interference. Hou et al. [22] fine-tuned the MobileNet model to extract deep convolutional features and obtained low-dimensional feature representation by changing the dimension of the final fully connected layer. They compared the retrieval accuracy with the principal component analysis (PCA) method of dimensionality reduction. In cross-dataset remote sensing image retrieval, Wang et al. [23] proposed a learnable joint spatial and spectral transformation (JSST) model to correct spatial and spectral distortions in images. This model embedded the spatially and spectrally modified inputs at the front end of the ResNet34 network, thereby improving generalization and adaptability. Wu et al. [24] proposed two rotation-aware networks, namely the feature-map-transformation-based rotation-aware network (FMT-RAN) and spatial-transformer-based rotation-aware network (ST-RAN), to address the issue of images appearing at arbitrary rotation angles.

However, the aforementioned methods extract deep features from convolutional neural networks (CNNs) for retrieval without utilizing the features of Transformer models. In contrast to CNN models, Transformer models can perform global context modeling and better comprehend the semantic relationships of the entire input sequence. Therefore, they can capture global contextual information and extract richer features.

2.2. CBRSIR Based on Deep Hashing Features

Hashing has been widely used in large-scale remote sensing image retrieval due to its prominent advantages in storage and retrieval speed. Li et al. [25] proposed the deep hashing neural network (DHNN), which utilizes deep feature learning neural networks to learn high-dimensional embedding features and hash learning neural networks to learn low-dimensional hashing features. This model can be optimized end-to-end. To address the overfitting issue caused by a limited number of labeled images in remote sensing datasets, Roy et al. [26] proposed a deep hashing network based on metric learning. Liu et al. [27] introduced a deep supervised hashing model using a loss function composed of classification, similarity, and bit entropy terms based on the framework of generative adversarial networks (GANs) to learn compact and effective hash codes. Cheng et al. [28] proposed the semantic consistency deep hashing model, which applies deep hashing to multi-label remote sensing image retrieval. It introduces a paired label similarity loss that fully utilizes multi-label information, demonstrating the effectiveness of hashing methods in multi-label remote sensing image retrieval. Tan et al. [29] proposed deep contrastive self-supervised hashing for remote sensing image retrieval, which utilizes unlabeled images for training. This method assumes that hash codes generated from different views of the same image should be similar, while those generated from different images should be dissimilar. They designed a loss function to preserve the similarity of hash codes. Jing et al. [30] presented a deep unsupervised weighted hashing model that utilizes a pretrained Swin Transformer to extract feature representations. This model uses an adaptive weight-based loss function that assigns weights adaptively to positive and negative samples and combines it with quantization loss, resulting in improved model performance. Although these deep hashing methods have achieved good retrieval results, they extract single-layer features without employing methods for fusing multiple features. Single-feature extraction is insufficient to fully express the rich detailed information and semantic information of remote sensing images. The adoption of multi-feature fusion in hashing methods has the potential to improve the accuracy of remote sensing image retrieval.

2.3. Methods Based on Multi-Feature Fusion

Currently, several studies have focused on the limitations of using a single-feature representation to fully express both visual and semantic information about images. These studies employ feature fusion techniques to enhance feature discriminability. For example, Yang et al. [31] combined the benefits of mid-level and high-level features in a convolutional neural network (CNN) by fusing the convolutional and fully connected layer features, resulting in improved retrieval performance by simultaneously utilizing global and local image information. Li et al. [32] extracted high-level features from the ResNet50 and VGG16 networks and concatenated them, increasing feature representation proficiency and resulting in enhanced classification performance of remote sensing images by utilizing learned parameters from two networks. Yin et al. [33] introduced a mean-max pooling weighted fusion technique to merge high-level features, which effectively led to improved retrieval performance by enhancing feature representation proficiency. Li et al. [20] formulated four CNN models with different feature layers and developed collaborative affinity metric fusion (CAMF) to merge features from different layers and improve retrieval performance. Alhichri et al. [34] utilized three pre-trained SqueezeNet models that could take input images of various scales and fuse the outputs of three CNN models in a cascaded manner. Minakshi et al. [35] invented a fused CNN architecture that merged features extracted from VGG16, VGG19, and ResNet to obtain efficient and accurate features. Additionally, they presented an optimal feature selection model based on Joint MI_RFO to choose the best features for improved retrieval accuracy. Nevertheless, many current fusion methodologies rely on simple summation or concatenation operations, and most research focuses on merging CNN features. Moreover, current methods separate the multi-scale feature extraction process from the feature fusion process, hindering automatic adjustment and merging of multi-scale features based on remote sensing image retrieval requirements and limiting the representation ability of retrieval features. In addition, there is limited research on adaptive weighted fusion of multiple features extracted from Transformer models, which will be the focus of this study.

2.4. Methods Based on Deep Metric Learning

At present, deep metric learning (DML) methods are extensively utilized to enhance the retrieval functionality of networks. Contrastive loss [36] is a previous metric learning method that measures the distance between two samples, narrowing the gap between paired samples of the same category and increasing the gap between samples of different categories. Triplet loss [37] selects a sample as an anchor and two additional samples categorized as positive and negative. It requires distances between samples of the same class to become more compact and distances between samples of different classes to increase. N-pair loss [38] measures the correlation between samples using cosine similarity and matches each anchor with a positive sample and multiple negative samples. Proxy-NCA [39] is an initial proxy-based loss that connects each sample with a proxy assigned for each category. It aims to draw samples closer to proxies of the same category and encourage distance from proxies of different categories. SoftTriple loss [40] enhances softmax loss by connecting multiple proxies to each category, efficiently capturing the hidden distribution of samples and maintaining a wider intra-class spread. Liu et al. [41] substituted the Hinge function with the softmax function to obtain global optimization, successfully overcoming the local optimization problem of triplet loss. Xue et al. [42] proposed a hash retrieval method using proxy-based metric learning mixed with hash coding learning to enhance retrieval speed while maintaining accuracy and minimizing storage space. However, these methods also have notable limitations. For example, metric loss based on image pairs requires a growing number of sample pairs to be formed with more training samples. This leads to additional computation and longer convergence times for the network. Proxy-based losses are successful in addressing issues related to network convergence rate and time complexity. However, they cannot fully utilize sample information, and proxies assigned for each category have a fixed number and cannot be assigned adaptively, resulting in a lack of generalization ability. Therefore, the research direction of this paper will be focused on improving metric loss based on image pairs.

2.5. Transformer Network (PVTv2)

In this paper, the proposed method improves upon the b2 version of PVTv2 [19], which is pre-trained on ImageNet-1K. PVTv2 is an enhanced version of the pyramid-structured Transformer network PVT. It employs overlapping patch embedding to squeeze images, ensuring the continuity of local images. The second modification replaces the fixed position encoding in PVT with the position encoding mechanism using zero padding to enable the network to handle images of all sizes more efficiently. Additionally, the linear space reduction attention mechanism replaces the original spatial reduction attention mechanism to optimize computational costs, restricting computational complexity within a linear range. The PVTv2 network model is a multi-layer structure with four different stages, each consisting of a patch embedding layer and a Transformer encoder. These stages implement four different scales of feature maps. As the network depth increases, the resolution of the feature maps gradually decreases, and the channel dimension of the features gradually increases. The Transformer encoder mainly comprises LayerNorm, MLP, and linear spatial reduction attention. The output features of the four stages have different scales and channel numbers. Specifically, the output features of Stage 1 are 64 × 56 × 56, the output features of Stage 2 are 128 × 28 × 28, the output features of Stage 3 are 320 × 14 × 14, and the output features of Stage 4 are 512 × 7 × 7. In the subsequent study, we also extracted feature matrices from the initial PVTv2 model for retrieval purposes, serving as a comparative baseline in the subsequent comparative experiments to validate the effectiveness of the proposed method in this paper.

3. Proposed Methods

The method proposed in this study is an improvement upon the PVTv2 model. Firstly, we developed a multi-scale feature fusion (MSF) module that includes the global attention mechanism, multi-head self-attention mechanism, feature fusion module, and deformable convolution. Next, we constructed a hashing layer to transform the multi-scale fused features into compact binary hash codes. Finally, we proposed an intra-class similarity loss (ICS loss) function that reduces the distance between samples of the same class and designed a five-branch loss function to train the model. By employing the aforementioned approach, we have made enhancements to the original PVTv2 model, thereby obtaining a novel model, which we have designated as PVTA_MSF.

3.1. Multi-Scale Feature Fusion Module (MSF)

The multi-scale feature fusion module is illustrated in Figure 1a. Its specific construction and operation are as follows: Based on the original PVTv2 model, a global attention module is added after its first stage. The global attention module enhances cross-dimensional interactions, capturing correlations and importance between channels and spatial dimensions. This helps reduce background interference and improve feature representation capability. Multi-head self-attention modules are added after the second and third stages to simultaneously focus on different feature subspaces, extracting richer feature representations. Then, a feature fusion module is designed to fuse the feature maps extracted from all four stages. The principle behind this module is to use high-level features to weight and fuse the low-level features within the module. Subsequently, the output of the feature fusion module is downsampled by a factor of four and subjected to deformable convolutions to enhance the model’s ability to handle deformations, resulting in multi-scale fused features. The specific process is as follows:

3.1.1. Feature Extracting Based on Global Attention Mechanism (GAM) in Stage 1

In the PVTv2 model, Stage 1 possesses the minimum channel number and the maximum resolution of 64 × 56 × 56, making it abundant in low-level image features. Therefore, this paper introduces the global attention mechanism (GAM) that combines channel and spatial attention to process the features in this layer. The GAM [43] is an improvement in the CBAM module [44], which can capture significant features in three dimensions: channel, spatial height, and spatial width. In practical applications, channel attention is used to weight different channels of the input features, giving higher weights to important channels and lower weights to less important ones. This reduces the influence of background channels and focuses the attention on channels that contain useful information. On the other hand, spatial attention is used to weight different spatial positions of the input features, giving higher weights to important positions and lower weights to less important ones. This reduces the impact of background interference and noise, directing attention toward spatial positions that contain valuable information. Therefore, the global attention mechanism allows for interaction between channel attention and spatial attention, enhancing their expressive power across different dimensions. Through this cross-dimensional interaction, the mechanism can better capture the correlations and importance between channels and spatial positions, leading to reduced background interference and improved feature representation capability. The framework of GAM is shown in Figure 2.

The processing of the global attention mechanism can be summarized in (1) and (2) as follows.

F_{1}

represents the input feature map,

F_{2}

represents the output feature map from the channel attention submodule, and

F_{3}

represents the output feature map from the spatial attention submodule;

M_{c}

represents the channel attention submodule, while

M_{S}

represents the spatial attention submodule.

F_{2} = M_{c} (F_{1}) ⨂ F_{1}

(1)

F_{3} = M_{s} (F_{2}) ⨂ F_{2}

(2)

(1) Figure 3 illustrates the channel attention submodule, which utilizes a three-dimensional array to store information among the three dimensions. Two multi-layer perceptrons (MLPs) are utilized to facilitate interaction between channels and spatial dimensions. The ReLU activation function is applied between the two layers to prevent issues such as gradient vanishing and exploding. Finally, the sigmoid function is ultimately employed to generate channel weight coefficients of the feature map. The input feature map is subsequently multiplied by the channel weight coefficients to accomplish weighted operations. The specific process is summarized in the equations below:

y = a_{1} F_{1}^{T} + b_{1}

(3)

F_{2} = {s i g m o i d [R e L U (a_{2} y + b_{2})]}^{T} F_{1}

(4)

In (3) and (4),

a_{1}

and

a_{2}

are the randomly initialized weight values of the two MLPs;

b_{1}

and

b_{2}

are the bias terms of the two MLPs.

(2) Figure 4 illustrates the spatial attention submodule, which employs two convolutional layers with a kernel size of 7, padding of 3, and a stride of 1 to integrate spatial information. This process effectively reduces the number of feature map channels with a compression ratio r, thereby decreasing computational costs. Spatial attention weight coefficients are generated using the sigmoid function and multiplied with the input feature map to implement the weighted operation. The calculation process for the feature map size in the convolutional layer is summarized in (5) and (6), while that of the spatial attention submodule is summarized in (7):

W_{o u t} = \frac{(W_{i n} - k + 2 p)}{S} + 1

(5)

H_{o u t} = \frac{(H_{i n} - k + 2 p)}{S} + 1

(6)

F_{3} = M_{S} (F_{2}) ⨂ F_{2} = s i g m o i d [C o n v B N (C o n v B N R e L U (F_{2}))]

(7)

In (5) and (6),

W_{i n}

and

H_{i n}

represent the width and height of the input feature map,

W_{o u t}

and

H_{o u t}

represent the width and height of the output feature map,

p

and

S

, respectively, represent zero-padding and stride, and

k

represents the size of the convolutional kernel. In Figure 4,

r

represents the compression ratio of the number of channels.

The introduction of channel attention effectively suppresses background information interference on the target. Spatial attention is also being applied to fully utilize the spatial data of the original image, assigning greater weight to significant features in a supervised manner. In this way, the discriminative power of the original image features is enhanced.

3.1.2. Feature Extracted by MHSA in Stage 2 and Stage 3

In deep network structures, the local structural information in the extracted features of the backbone network tends to be lost with increasing depth. This problem negatively impacts the global semantic feature learning of remote sensing images. To address this issue, this paper introduces a multi-head self-attention mechanism (MHSA) [45] in Stage 2 and Stage 3. The MHSA establishes relationships between individual elements in the abstract feature map, enhances the interaction between channel dimension and spatial dimensions (height and width), and achieves a receptive field corresponding to the entire image. Moreover, the multi-head self-attention mechanism can simultaneously focus on different feature subspaces, thereby extracting more diverse and comprehensive feature representations. Each attention head can learn different feature attention patterns, enabling the capture of distinct information within the data. By combining the outputs of multiple attention heads, a more comprehensive and diversified feature representation can be obtained, thereby reducing the loss of local structural information and enhancing the expressive power of image features. Figure 5 illustrates the MHSA, which comprises multiple self-attention modules that take three inputs: query matrix Q, key matrix K, and value matrix V. The calculation formula for the self-attention modules is expressed as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V = A V

(8)

The three inputs of the self-attention modules are different learned embeddings of the input feature maps. After applying the softmax function to the product of Q and K transposes, an attention coefficient matrix A is generated. Each row of A corresponds to the similarity between elements in Q and all elements in K. The attention coefficient matrix A is then multiplied by the value matrix V to obtain the final attention feature output.

3.1.3. Feature Fusion Module (FFM)

As shown in Figure 6, this paper designed a feature fusion module that the output features from the four stages of the PVTv2 model. The fundamental principle of the fusion module is to use high-level features to weight the low-level features within the module and then concatenate the high-level features with the weighted low-level features. To achieve a balance between accuracy and computational cost, we followed the popular solutions [46,47,48] and employed four 1 × 1 convolution layers to reduce the dimension of the output features from the four stages, decreasing the channel numbers from 512, 320, 128, and 64 to 64, 64, 16, and 8, respectively. For high-level features to multiply the low-level features, they must have the same size and channel number. Similarly, when concatenating high-level and low-level features, they must also be of the same size. Therefore, high-level features undergo upsampling and convolution before weighting and fusion. We defined a series of 3 × 3 convolution layers with batch normalization [49] and ReLU activation function [50] for processing the high-level features. The specific parameters of the convolution layer

C_{(x)}

are listed in Table 1.

The feature fusion module (FFM) comprises three stages, and for convenience of reference, we denote the output feature maps from high to low as

X_{4}

,

X_{3}

,

X_{2}

, and

X_{1}

for Stage 4 to Stage 1, respectively. The specific implementation process of feature fusion is as follows:

The feature map $X_{4}$ is used to weight the feature map $X_{3}$ , followed by fusing the weighted $X_{3}$ with $X_{4}$ . This process involves the following specific steps: $X_{4}$ undergoes upsampling twice, and the $C_{1}$ convolution processes it to obtain $C_{1} (X_{4})$ , resulting in both $X_{3}$ and the processed $X_{4}$ having 64 channels and a size of 14 × 14. Subsequently, $C_{1} (X_{4})$ is utilized as the weight of $X_{3}$ and multiplied by $X_{3}$ , and the weighted result is concatenated with $C_{1} (X_{4})$ along the channel dimension. Finally, another convolution layer, $C_{2}$ , is applied to smooth the concatenated feature map, resulting in a fused feature map $X_{34} \in R^{\frac{H}{16} \times \frac{W}{16} \times 128}$ . This process is summarized in (9).

$X_{34} = C_{2} (C o n c a t (C_{1} (X_{4}) \times X_{3}, C_{1} (X_{4}))),$

(9)
The feature maps $X_{4}$ and $X_{3}$ are utilized to weight the feature map $X_{2}$ , and the weighted $X_{2}$ is then fused with the feature map $X_{34}$ . Specifically, $X_{4}$ is upsampled four times and processed by the $C_{3}$ convolution layer to reduce its dimensions, resulting in $C_{3} (X_{4})$ . Similarly, $X_{3}$ is upsampled two times and processed by the $C_{4}$ convolution layer to reduce its dimensions, resulting in $C_{4} (X_{3})$ . $X_{34}$ is upsampled and processed by the $C_{5}$ convolution layer to obtain $C_{5} (X_{34})$ . This process results in $X_{4}$ and $X_{3}$ having an increased size of 28 × 28 and reduced 16 channels, and $X_{34}$ having an increased size of 28 × 28. Subsequently, $C_{3} (X_{4})$ and $C_{4} (X_{3})$ are used to weight $X_{2}$ . The weighted result is then concatenated and fused with $C_{5} (X_{34})$ along the channel dimension. Finally, another convolution layer, $C_{6}$ , is applied to smooth the concatenated feature map, resulting in a fused feature map $X_{234} \in R^{\frac{H}{8} \times \frac{W}{8} \times 144}$ . This process is summarized in (10).

$X_{234} = C_{6} (C o n c a t (C_{3} (X_{4}) \times C_{4} (X_{3}) \times X_{2}, C_{5} (X_{34}))),$

(10)
The feature maps $X_{4}$ , $X_{3}$ , and $X_{2}$ are utilized to weight the feature map $X_{1}$ , and the weighted $X_{1}$ is then fused with the stage two fused feature map $X_{234}$ . Specifically, $X_{1}$ is downsampled by two times to reduce its size to 28 × 28. $X_{4}$ is upsampled four times and then processed by the $C_{7}$ convolution layer to reduce its dimensions, resulting in $C_{7} (X_{4})$ . Similarly, $X_{3}$ is upsampled twice and processed by the $C_{8}$ convolution layer to reduce its dimensions, resulting in $C_{8} (X_{3})$ . $X_{2}$ is processed by the $C_{9}$ convolution layer to reduce its dimensions, resulting in $C_{9} (X_{2})$ . This process results in $X_{4}$ , $X_{3}$ , and $X_{2}$ having an increased size of 28 × 28 and reduced eight channels. Subsequently, $C_{7} (X_{4})$ , $C_{8} (X_{3})$ , and $C_{9} (X_{2})$ are used to weight $X_{1}$ . The weighted result is then concatenated and fused with $X_{234}$ along the channel dimension. Finally, another convolution layer, $C_{10}$ , is applied to smooth the concatenated feature map, resulting in a fused feature map $X_{1234} \in R^{\frac{H}{8} \times \frac{W}{8} \times 152}$ . This process is summarized in Equation (11).

$X_{1234} = C_{10} (C o n c a t (C_{7} (X_{4}) \times C_{8} (X_{3}) \times {C_{9} (X_{2}) \times X}_{1}, X_{234})),$

(11)

3.1.4. Deformable Convolution (DCN Layer)

To enhance the network’s ability to handle deformations caused by different target orientations and angle transformations in CBRSIR, we introduce a deformable convolution layer [51] after the feature fusion module. Deformable convolution is an improved convolutional operation that adjusts the sampling positions of traditional convolution by introducing learnable offsets, in order to adapt to the deformations and directional variations of different targets. Its principle involves adding offsets to the traditional convolution operation and redefining the sampling positions of the convolution kernel, whereas traditional convolution operates on fixed sampling positions. With the learned offsets, deformable convolution can adjust the sampling positions to better accommodate target deformations. This allows the convolution kernel to sample input features at different locations, thereby better capturing the details and shape variations of the target. This means that regardless of whether the target is horizontal, vertical, or inclined, deformable convolution can adapt to and capture the features of the target, enabling the network to have better robustness and expressive power when dealing with targets with significant shape changes.

The output of the FFM,

X_{1234}

, is downsampled by four times to reduce its size and subsequently passed through a DCN layer. This process is summarized in (12):

X_{D C N} = D C N (X_{1234}),

(12)

Figure 7a shows a standard convolution with blue dots representing the regular sampling grid. Figure 7b–d show deformable convolutions with red dots representing the sampling locations of the deformable convolution and orange arrows representing the added offsets. It can be observed that the deformable convolution layer utilizes trainable offsets, enabling the sampling grid to deform independently and capture crucial features. The training process expands the range of the convolution kernel, thereby increasing the receptive field.

3.2. Hashing Layer

To enable fast retrieval in large-scale remote sensing databases, we utilize a binary hash method for the feature matrices used in CBRSIR. The multi-scale feature fusion module (MSF) is followed by an extra hash coding layer. The multi-scale feature fusion module generates a fused feature map with 152 channel dimensions. The channel dimension size of the fused feature map is fine-tuned by adjusting the deformable convolution layer to enable hash coding of varying lengths. This is advantageous when compared with other retrieval methods in subsequent work. The hash coding layer activation is the sigmoid function, and the resulting output values undergo a transformation to precisely 0 or 1, using the following threshold:

h (x) = \{\begin{matrix} 0, i f x < 0.5 \\ 1, o t h e r w i s e \end{matrix}

(13)

Here, x represents an element of the input feature vector, which is the input to the hash coding layer.

h (x)

represents the hash code in the corresponding position.

3.3. Intra-Class Similarity Loss Function for Mini-Batches (ICS Loss)

Remote sensing images face a challenge in which identical category samples exhibit noticeable differences, while samples from different categories demonstrate high similarity, affecting the retrieval performance of remote sensing image features. Similarity-based metric learning algorithms aim to address this problem by reducing the distance between similar-category samples and increasing the distance between different-category samples.

However, metric loss based on image pairs is effective in increasing the distance between different category features while reducing the distance between those of the same category. But as the number of training samples increases, the computational complexity increases significantly. To balance accuracy and computational complexity and address the problem of large differences between samples of the same category and high similarity between samples of different categories, we propose an intra-class similarity loss function that reduces the distance between samples of the same category in the mini-batches.

The calculation process of this loss function is as follows:

As shown in Figure 8, images in a mini-batch are separated by category based on the labels of the images. The intra-class distance $D i s (m)$ for category $m$ is calculated as follows:

$D i s (m) = \{\begin{matrix} \frac{1}{n \times (n - 1)} \sum_{i, j \in C l a s s (m)} D (i, j) & i f n > 1 \\ 0 & e l s e w i s e \end{matrix}$

(14)

Here, the image $i$ , $j$ belongs to the m-th category, and $n$ is the number of images in the current batch whose category is $m$ ; $D (i, j)$ represents the Euclidean distance between image $i$ and $j$ .
The calculation equation for the intra-class similarity loss function $L_{I C S}$ is as follows:

$L_{I C S} = \frac{1}{k} \sum_{D i s (m) > 0} D i s (m)$

(15)

Here, $k$ represents the number of categories whose intra-class distance $D i s (m)$ is greater than 0.

3.4. Design of Multiple Loss Functions

During the network learning stage, we adopt a multi-loss function learning method, which combines the intra-class similarity loss function and the cross-entropy loss function to update the model parameters.

Due to the slow change of hash features, directly using hash features to calculate losses can lead to slow changes in losses, thereby affecting the convergence speed of the model during training. Therefore, we attach a fully connected head module after the output of the MSF module to classify the images and calculate the cross-entropy loss to train the entire model.

The pyramid structure of PVTA_MSF consists of four stages. To reduce the effect of gradient disappearance during error backpropagation, we add an intra-class similarity loss function for each stage. Firstly, a fully connected head module is used to reduce the dimensions of the output of the four stages, and then the ICS loss is calculated using the features after dimensionality reduction.

Finally, we construct a five-branch loss, as shown in Figure 1b. The summary of the loss implementation equation is as follows:

L = L_{C E} + L_{I C S} (F_{4}) + L_{I C S} (F_{3}) + L_{I C S} (F_{2}) + L_{I C S} (F_{1})

(16)

In (16),

L_{C E}

represents the cross-entropy loss,

L_{I C S} (F_{4})

represents the intra-class similarity loss calculated by Stage 4,

L_{I C S} (F_{3})

represents the intra-class similarity loss calculated by Stage 3,

L_{I C S} (F_{2})

represents the intra-class similarity loss calculated by Stage 2, and

L_{I C S} (F_{1})

represents the intra-class similarity loss calculated by Stage 1.

4. Experiments and Analysis

This section presents the implementation details of our experiments. We extensively tested the effectiveness of our proposed method on two commonly utilized remote sensing image datasets: the UC Merced dataset [52] and the NWPU-RESISC45 dataset [53]. This section comprises eight parts: The first part describes the experimental settings and datasets employed. Parts 2 to 6 conduct comparative analyses of multi-feature retrieval performance. Part seven conducts ablation experiments, and part eight presents visualization experiments.

4.1. Dataset and Experimental Settings

We conducted experiments on two widely utilized remote sensing image databases, namely the UC Merced dataset and the NWPU-RESISC45 dataset.

The UC Merced dataset comprises a collection of remote sensing images downloaded from the U.S. Geological Survey (USGS) by a team from the University of California, Merced. It is widely used for retrieval and classification tasks in the field of remote sensing images and contains 2100 images, 21 geographic categories, and 100 remote sensing images per category. Each image has a size of 256 × 256 and a spatial resolution of 0.3 m. In the following sections, we refer to this dataset as UCMD.

The NWPU-RESISC45 dataset is a large-scale remote sensing image dataset proposed by Northwestern Polytechnical University for scene classification, comprising a total of 31,500 remote sensing images. This dataset has a size of 256 × 256 and spatial resolution ranging from 30 m to 0.2 m, with 45 geographic categories and 700 remote sensing images in each category. We shall refer to this dataset as NWPU for clarity and ease of reference. For fine-tuning the model, we randomly selected 100 images from each category of the NWPU dataset, and similarly, we randomly selected 50 images from each category of the UCMD dataset. The remaining images in each dataset were utilized as a query test set.

The performance of retrieval was assessed using several metrics, including the mean average precision (mAP), the average normalized modified retrieval rate (ANMRR), the Precision@N, and the precision–recall (P-R) curve. The mAP represents the average retrieval accuracy of every classification category, and the average retrieval accuracy of all categories is then computed based on it. The higher the mAP, the more accurate the query precision. ANMRR is employed to evaluate the situation where the correct image is ranked higher in the retrieval results, and the more accurate the retrieval, the smaller the ANMRR. Precision@N evaluates the accuracy of the top N images returned by the retrieval, whereas the PR curve considers both precision and recall.

Fine-tuning was performed on both the UC Merced dataset and the NWPU-RESISC45 dataset. The experiments were performed on a single NVIDIA GeForce RTX 3060 GPU with 32 G RAM, utilizing the Pytorch platform. The software environment consisted of Windows 10, CUDA-11.6, Pytorch-1.12.1, and Python3.9. During the fine-tuning process, all input images were resized uniformly to 224 × 224. We employed a learning rate decay strategy with a decay rate of 0.1, an initial learning rate of 5 × 10⁻⁴, and a batch size of 60. The optimizer used was Adam, and the number of iterations for this process was set to 200.

4.2. Comparative Analysis of Retrieval Performance

This study compares seven state-of-the-art remote sensing image retrieval methods. Liu et al.’s method is referred to as GOSLm [41], while the methods introduced by Tang et al. and Raffaele et al. are called DBOW [54] and V-DELF [21], respectively. ResNet50 [55] and D-CNN [56], both excellent methods from V-DELF, are also compared. The ResNet50 features are obtained through VLAD aggregation of the features extracted by ResNet50, while the D-CNN features are obtained using the VGG16 extraction technique. These two methods were fine-tuned on the UCMD and NWPU datasets. Hou et al. [22] utilized MobileNets to extract features and changed the final fully connected layer to obtain low-dimensional features for retrieval. Their method is referred to as MobileNets. In addition, Wang et al. [23] proposed the joint spatial and spectral transformation (JSST) model, which we will refer to as JSST. Li et al. [57] proposed an adaptive multi-proxy method that can allocate multiple proxies for samples. We refer to this method as Multi-Proxy. The datasets for the aforementioned methods were divided as follows: GOSLm, DBOW, V-DELF, ResNet50, and D-CNN split each dataset into a training set and a test set in a 4:1 ratio. MobileNets selected 50 images from each class in the dataset as query images, while the remaining images were randomly split into training and validation sets. JSST utilized the PatternNet dataset for training the network and evaluated the network on other datasets. Multi-Proxy utilizes 50% of the UCMD dataset for training and the remaining 50% for validation. To demonstrate the effectiveness of our PVTA_MSF, we conducted a series of experiments on the UCMD and NWPU datasets. Our improved PVTA_MSF model achieved the best performance on both datasets, as shown in Table 2. Specifically, PVTA_MSF outperformed other methods, achieving a 4.2% accuracy improvement over the best-performing Multi-Proxy method. On the NWPU dataset, PVTA_MSF surpassed other methods, attaining a 1.6% accuracy improvement over the best-performing GOSLm method. In conclusion, our article confirms the efficacy of the proposed improved method, PVTA_MSF, in enhancing the image retrieval performance of remote sensing.

4.3. Comparison of Multi-Feature Fusion Method

In order to validate the effectiveness of our multi-scale feature fusion module, we compared it with other multi-feature-fusion-based remote sensing image retrieval methods. Specifically, we referred to Yang et al.’s [31] method as VD16, Yin et al.’s [33] technique as Fused mean-max, Li et al.’s [20] approach as IRMFRCAMF, and Minakshi et al.’s [35] approach as Fused CNN_RFM. Our method achieved the highest accuracy in both mAP and ANMRR evaluation metrics on the UCMD dataset, as illustrated in Table 3, and outperformed the other methods by a significant margin. These results explicitly demonstrate that our multi-scale feature fusion method is more effective than other multi-feature-fusion-based methods.

4.4. Class mAP Comparative Analysis

In order to further validate the effectiveness of our PVTA_MSF model, we calculated the retrieval accuracy of each specific semantic category following the methods of Tang et al. [54], Raffaele et al. [21], and Liu et al. [41]. Table 4 and Table 5 display the category accuracy of two remote sensing datasets, and we used the experimental results obtained from the aforementioned referenced methods as our comparative benchmarks. The best results are highlighted in bold and black. Table 4 demonstrates that the PVTA_MSF method we propose achieves the best retrieval accuracy in 19 out of 21 categories in the UCMD dataset, with only Chaparral and Forest categories having slightly lower accuracy. Our method also attains a 100% retrieval accuracy in 15 categories. Specifically, while the GOSLm method only achieves 55% and 59% accuracy in the Dense and Medium-density categories, respectively, our method achieves 100% accuracy in both. In addition, while the DBOW and D-CNN methods attain only 50% and 60% accuracy in the Storage category, our method achieves 100% accuracy in that category. Table 5 indicates that our proposed PVTA_MSF method achieves the best retrieval accuracy in 21 out of 45 categories in the NWPU dataset. In contrast, the GOSLm method achieves the best retrieval accuracy in only 12 categories, the V-DELF method in 11 categories, and the DBOW method in 8 categories. While other methods may achieve the best accuracy in some individual categories, their retrieval performance is poor in certain categories, indicating noticeable shortcomings. For example, the D-CNN method only achieves low accuracy of 39%, 30%, 38%, and 47% in the Church, Palace, Tennis Court, and Thermal Power Station categories, respectively. The DBOW method only attains low accuracy of 57% and 48% in the Stadium and Storage Tank categories, respectively. The V-DELF method only obtains 64% and 56% accuracy in the Baseball Diamond and Palace categories, respectively. The GOSLm method only attains a poor accuracy of 51% in the Palace category. In contrast, our PVTA_MSF method achieves a high retrieval accuracy in most categories, with retrieval accuracy higher than 80% in 43 out of 45 categories. Only the Church and Palace categories have lower accuracy of 74% and 68%, respectively, but their retrieval accuracy is only lower than the DBOW method, achieving the second-best result, and higher than the precision of the other five methods.

4.5. Analysis of the Influence of Feature Dimensionality

Raffaele et al. [21] utilized principal component analysis (PCA) to reduce the dimensionality of VLAD vectors, resulting in low-dimensional descriptors. They compared the retrieval accuracy of different sizes of VLAD descriptors. Liu et al. [41] assessed the impact of various embedding sizes on retrieval performance by altering the embedding size during the training process. They concluded that an embedding size of 512 achieved the best performance. Hou et al. [22] obtained low-dimensional features by fine-tuning the final fully connected layer of MobileNets. They evaluated different dimensions for their effects on retrieval performance and found that the optimal low dimension was 32. Our study is based on the works of these researchers. We fine-tuned the deformable convolution layers of the model to output hash features of different dimensional sizes {32, 64, 128, 256, 512, 1024}. We compared our experimental results with those of the three aforementioned methods. Table 6 demonstrates that in the UCMD dataset, PVTA_MSF achieves superior accuracy at a hash feature dimension size of 1024, while significant drops in accuracy are witnessed as dimensions decrease to 32-bit low-dimensional features. According to Table 7 in the NWPU dataset, PVTA_MSF achieves the highest accuracy with a hash feature dimension of 1024. As the dimensionality of the hash feature decreases, accuracy continuously declines. However, PVT_v2 obtains an optimal accuracy at 256 dimensions. The results in Table 6 and Table 7 exhibit that our PVTA_MSF method outperforms other approaches at every feature dimension size. Additionally, our PVTA_MSF method outperforms the initial model PVT_v2 in all feature dimensions.

4.6. Precision@N and Precision–Recall (P-R) Curve Analysis

To further verify the effectiveness of the PVTA_MSF method, we utilized precision@N and precision–recall (P-R) curves as comparative experimental evaluation metrics. Figure 9 illustrates the comparison of precision@N between the initial model PVT_v2 and PVTA_MSF for varying numbers of returned images in both the UCMD and NWPU datasets. Precision@N was calculated using the top 35 returned images for the UCMD dataset, while the NWPU dataset utilized the top 350 returned images. As the number of returned images increased, both methods showed a gradual decrease in precision rates. However, PVTA_MSF outperformed PVT_v2 in both datasets in terms of precision@N. Figure 10 compares the P-R curves between PVT_v2 and PVTA_MSF. As the recall rate increased, the precision rate of both methods decreased. Nevertheless, at the same recall rate, PVTA_MSF exhibited higher precision rates than PVT_v2. Similarly, at the same precision rate, PVTA_MSF demonstrated higher recall rates than PVT_v2. Both evaluation metrics indicate that the PVTA_MSF method excels in retrieval performance, demonstrating its effectiveness and versatility in large-scale remote sensing image retrieval tasks.

4.7. Ablation Experiments

We conducted a series of ablation experiments on the UCMD and NWPU datasets to further evaluate the effectiveness of FFM, GAM, MHSA, DCN, and ICS loss. Table 8′s fifth row displays the retrieval accuracy after implementing different improvement methods, demonstrating that our method produced the highest mAP and the smallest ANMRR on both datasets. As displayed in the fourth row of Table 8, the retrieval accuracy lacking the ICS loss depicts a decrease in mAP of 0.37% and an increase in ANMRR of 0.003581 on the NWPU dataset and a decrease in mAP of 0.22% and an increase in ANMRR of 0.002126 on the UCMD dataset. The third row shows that removing both the ICS loss and DCN led to a decrease in mAP of 0.73% and an increase in ANMRR of 0.006136 on the NWPU dataset and a decrease in mAP of 0.29% and an increase in ANMRR of 0.00316 on the UCMD dataset. The second row shows the retrieval accuracy without ICS loss, DCN, MHSA, and GAM. On the NWPU dataset, this resulted in a decrease in mAP of 0.98% and an increase in ANMRR of 0.008631, while in the UCMD dataset, it led to a decrease in mAP of 0.47% and an increase in ANMRR of 0.003877. The first row of Table 8 illustrates the retrieval accuracy without FFM, ICS loss, DCN, MHSA, and GAM, which resulted in a decrease in mAP of 2.62% and an increase in ANMRR of 0.021852 in the NWPU dataset, and a reduction in mAP of 1.22% and an increase in ANMRR of 0.009013 in the UCMD dataset. The results in Table 8 demonstrate that excluding any improvement module lowers the retrieval accuracy, with a gradual decrease in mAP and a gradual increase in ANMRR. Notably, the removal of the feature fusion module results in the most significant drop in accuracy, providing evidence of the effectiveness of our proposed method. Integrating the aforementioned improvement methods can enhance the retrieval performance of model features.

4.8. Visualization Experiment

In this research, we utilized the Grad-CAM visualization approach [58] to obtain feature heatmaps of remote sensing images, including categories such as airplane, island, ship, stadium, thermal power station, church, dense residential, and roundabout. In Figure 11, the first row illustrates the input image, the second row displays the feature heatmap extracted using the PVT_v2 original model, and the third row shows the feature heatmap obtained through the PVTA_MSF method. The feature heatmap indicates higher feature response values in parts closer to red and lower feature response values in parts closer to blue. The feature heatmap extracted by PVT_v2 failed to accurately distinguish between targets and backgrounds. The widespread red areas on the heatmap represented the model paying attention to irrelevant background areas around the target, instead of focusing exclusively on the target. Conversely, PVTA_MSF, our improved method, outperformed PVT_v2 by effectively distinguishing targets from backgrounds, delimiting target contours more clearly, and minimizing the effect of complex background information on the targets. These effects were particularly noticeable in categories such as island and stadium, where the PVTA_MSF feature heatmap accurately focused on the target by suppressing most of the interference from background information that would otherwise contaminate the heatmap. This evidence supports the conclusion that our proposed improvement method offers better retrieval performance than the initial model.

5. Discussion

In this paper, we proposed a multi-scale feature fusion module to enhance the feature representation ability of remote sensing images. Our approach is in line with previous studies on multi-scale feature fusion for CBRSIR, such as IRMFRCAMF [20] and CNN_RFM [35], which have demonstrated that this technique can improve the retrieval performance of remote sensing by fusing low-level and high-level features. Our proposed feature fusion module is based on the Transformer network, which performs feature extraction and fusion simultaneously, enabling end-to-end learning of the fused features. We used high-level features with rich semantic information to weigh the low-level features. We then concatenated the weighted low-level features to maximize the use of detailed information while filtering out invalid information. Our experimental results in Table 3 show that our proposed feature fusion module has better retrieval performance than other fusion methods that separate the feature extraction process from the feature fusion process. Although we propose this feature fusion module for CBRSIR tasks, it can be applied to other downstream tasks, such as image classification, object detection, and image semantic segmentation, because multi-scale fusion features can better describe images.

Although the method proposed in this paper demonstrated good performance for CBRSIR, there are still some limitations that need to be addressed. First, we believe that the fast retrieval speed of our proposed multi-scale fusion feature is due to the fact that the retrieval speed of the hash feature is faster than that of the general retrieval feature of the same dimension. However, we did not conduct an experimental analysis of the retrieval speed. In future research, we will conduct a detailed analysis of the retrieval speed compared to other models. Second, the PVTv2 model has several versions with different sizes, but we only analyzed one version of the model in this study. Therefore, further analysis is needed for the other versions of the model.

6. Conclusions

This paper proposes an end-to-end deep hash retrieval feature extraction model PVTA_MSF for remote sensing image retrieval. Our proposed multi-scale feature fusion module weighs and fuses the multi-layer features of the Transformer network to enhance the representation ability of the fused remote sensing image features. The experimental results show that compared with other methods, our improved PVTA_MSF model increased the mAP value from 94.6% to 98.8% in the UCMD dataset, an increase of 4.2%. In the NWPU dataset, the mAP was increased from 90.3% to 91.9%, an increase of 1.6%. Additionally, the intra-class similarity loss function augments the discriminatory strength of the hash features. Through ablation experiments, it can be concluded that the inclusion of the multi-scale feature fusion module and the intra-class similarity loss function resulted in a significant improvement in mAP on the UCMD dataset, increasing it from 97.54% to 98.76%. Similarly, on the NWPU dataset, the mAP was enhanced from 89.31% to 91.93% with the incorporation of these components. By utilizing various improvement methods, we achieved a significant improvement in retrieval performance. The use of hashing techniques contributed to improved retrieval efficiency and reduced memory consumption while maintaining accuracy, making this method adaptable to large-scale remote sensing image retrieval. Although the retrieval performance of this method is improved, it requires a large number of manually labeled training samples. Therefore, future research should focus on reducing the number of training samples needed through methods such as self-supervision and few-shot learning.

Author Contributions

Conceptualization, F.Y.; methodology, F.Y. and K.W.; software, K.W.; validation, M.W., X.M. and R.Z.; resources, F.Y.; writing, original draft preparation, K.W.; writing, review and editing, F.Y. and D.L.; supervision, F.Y. and D.L.; funding acquisition, D.L. and F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 41801288), the Natural Science Foundation of Jiangxi Province (No. 20202BABL202030), and the Key Laboratory of Mine Environmental Monitoring and Improving around Poyang Lake of the Ministry of Natural Resources (No. MEMI-2021-2022-22).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to show their gratitude to the editors and the anonymous reviewers for their comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tang, X.; Yang, Y.; Ma, J.; Cheung, Y.M.; Liu, C.; Liu, F.; Zhang, X.; Jiao, L. Meta-Hashing for Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615419. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv 2022, arXiv:2204.09868. [Google Scholar] [CrossRef]
Ye, F.; Luo, W.; Dong, M.; He, H.; Min, W. SAR Image retrieval based on unsupervised domain adaptation and clustering. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1482–1486. [Google Scholar] [CrossRef]
Sumbul, G.; Ravanbakhsh, M.; Demir, B. Informative and Representative Triplet Selection for Multilabel Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405811. [Google Scholar] [CrossRef]
Zhuo, Z.; Zhou, Z. Remote Sensing Image Retrieval with Gabor-CA-ResNet and Split-Based Deep Feature Transform Network. Remote Sens. 2021, 13, 869. [Google Scholar] [CrossRef]
Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and application. Math. Probl. Eng. 2022, 2022, 5880959. [Google Scholar] [CrossRef]
Ma, J.; Shi, D.; Tang, X.; Zhang, X.; Jiao, L. Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval. Remote Sens. 2022, 14, 1319. [Google Scholar] [CrossRef]
Shabbir, A.; Ali, N.; Ahmed, J.; Zafar, B.; Rasheed, A.; Sajid, M.; Ahmed, A.; Dar, S.H. Satellite and scene image classification based on transfer learning and fine tuning of ResNet50. Math. Probl. Eng. 2021, 2021, 5843816. [Google Scholar] [CrossRef]
Wang, Y.; Ji, S.; Lu, M.; Zhang, Y. Attention boosted bilinear pooling for remote sensing image retrieval. Int. J. Remote Sens. 2020, 41, 2704–2724. [Google Scholar] [CrossRef]
Bo, L.; Sminchisescu, C. Efficient match kernel between sets of features for visual recognition. Adv. Neural Inf. Process. Syst. 2009, 22, 135–143. [Google Scholar]
Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote Sensing Image Registration Using Convolutional Neural Network Features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
Ye, F.; Luo, W.; Dong, M.; Li, D.; Min, W. Content-based Remote Sensing Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8002505. [Google Scholar] [CrossRef]
Kumar, A.; Yadav, D.P.; Kumar, D.; Pant, M.; Pant, G. Multi-scale feature fusion-based lightweight dual stream transformer for detection of paddy leaf disease. Environ. Monit. Assess. 2023, 195, 1020. [Google Scholar] [CrossRef]
Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Tariq, A.; Qin, S. Multiscale Dual-Branch Residual Spectral-Spatial Network With Attention for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5455–5467. [Google Scholar] [CrossRef]
Chen, H.; GUO, X. Multi-scale feature fusion pedestrian detection algorithm based on Transformer. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 536–540. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-Based High-Resolution Remote Sensing Image Retrieval via Unsupervised Feature Learning and Collaborative Affinity Metric Fusion. Remote Sens. 2016, 8, 709. [Google Scholar] [CrossRef]
Imbriaco, R.; Sebastian, C.; Bondarev, E. Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 493. [Google Scholar] [CrossRef]
Hou, D.; Miao, Z.; Xing, H.; Wu, H. Exploiting low dimensional features from the MobileNets for remote sensing image retrieval. Earth Sci. Inform. 2020, 13, 1437–1443. [Google Scholar] [CrossRef]
Wang, Y.; Ji, S.; Zhang, Y. A learnable joint spatial and spectral transformation for high resolution remote sensing image retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8100–8112. [Google Scholar] [CrossRef]
Wu, Z.; Zou, C.; Wang, Y.; Tan, M.; Weise, T. Rotation-Aware Representation Learning for Remote Sensing Image Retrieval. Inf. Sci. 2021, 572, 404–423. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Y.; Xin, H.; Hu, Z.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 56, 950–965. [Google Scholar] [CrossRef]
Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images; Cornell University: Ithaca, NY, USA, 2019. [Google Scholar]
Liu, C.; Ma, J.; Tang, X.; Zhang, X.; Jiao, L. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 4324–4327. [Google Scholar]
Cheng, Q.; Huang, H.; Ye, L.; Fu, P.; Gan, D.; Zhou, Y. A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 4965. [Google Scholar] [CrossRef]
Tan, X.; Zou, Y.; Guo, Z.; Zhou, K.; Yuan, Q. Deep Contrastive Self-Supervised Hashing for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 3643. [Google Scholar] [CrossRef]
Jing, W.; Xu, Z.; Li, L.; Wang, J.; He, Y.; Chen, G. Deep Unsupervised Weighted Hashing for Remote Sensing Image Retrieval. J. Database Manag. (JDM) 2022, 33, 1–19. [Google Scholar] [CrossRef]
Yang, K.; Li, C.; Zhou, W.; Cheng, Q.; Ren, Y. Remote sensing image retrieval based on multi-layer feature integration of convolution neural networks. Sci. Surv. Mapp. 2019, 44, 9–15. [Google Scholar]
Li, Y.; Wang, Q.; Liang, X.; Jiao, L. A Novel Deep Feature Fusion Network for Remote Sensing Scene Classification. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5484–5487. [Google Scholar]
Yin, W.; Zhang, Y.; Sun, X.; Fu, K. A Image Retrieval Method in High-resolution Remote Sensing Images based on Deep Descriptor Fusion. In Proceedings of the Fifth Annual Symposium on High Resolution Earth Observation, Xian, China, 17 October–18 October 2018; pp. 893–908. [Google Scholar]
Alhichri, H.; Alajlan, N.; Bazi, Y.; Rabczuk, T. Multi-Scale Convolutional Neural Network for Remote Sensing Scene Classification. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT), Rochester, MI, USA, 3–5 May 2018; pp. 1–5. [Google Scholar]
Vharkate, M.N.; Musande, V.B. Fusion Based Feature Extraction and Optimal Feature Selection in Remote Sensing Image Retrieval. Multimed. Tools Appl. 2022, 81, 31787–31814. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar]
Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition, Copenhagen, Denmark, 12–14 October 2015; pp. 84–92. [Google Scholar]
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No Fuss Distance Metric Learning Using Proxies. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar]
Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Tacoma, T.; Li, H.; Jin, R. SoftTriple Loss: Deep Metric Learning Without Triplet Sampling. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–3 November 2019. [Google Scholar]
Liu, P.; Gou, G.; Shan, X.; Tao, D.; Zhou, Q. Global Optimal Structured Embedding Learning for Remote Sensing Image Retrieval. Sensors 2020, 20, 291. [Google Scholar] [CrossRef]
Shan, X.; Liu, P.; Wang, Y.; Zhou, Q.; Wang, Z. Deep Hashing Using Proxy Loss on Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 2924. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
Fan, D.-P.; Ji, G.-P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel Reverse Attention Network for Polyp Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2020; pp. 263–273. [Google Scholar]
Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2013, 51, 818–832. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Unsupervised deep feature learning for remote sensing image retrieval. Remote Sens. 2018, 10, 1243. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
Li, X.; Wei, S.; Wang, J.; Du, Y.; Ge, M. Adaptive Multi-Proxy for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 5615. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. The network structure of the model. (a) the multi-scale feature fusion module, and (b) the loss function module.

Figure 2. Global attention mechanism.

Figure 3. Channel attention submodule.

Figure 4. Spatial attention submodule.

Figure 5. Multi-head self-attention mechanism.

Figure 6. Feature fusion module.

Figure 7. Sampling methods of 3 × 3 standard convolution kernel and deformable convolution. (a) a standard convolution, (b–d) are the deformable convolutions.

Figure 8. Schematic diagram of the calculation process of ICS loss.

Figure 9. Comparison of Precision@N on the UCMD dataset and NWPU dataset.

Figure 10. Comparison of P-R curves on the UCMD dataset and NWPU dataset.

Figure 11. Feature heatmaps of images from multiple categories. The regions in the image that are closer to red have higher feature response values, while the regions closer to blue have lower feature response values.

Table 1. Parameters of the convolution layers in the feature fusion module (FFM).

FFM
$C_{1}$	(64,64,3,1)	$C_{6}$	(144,144,3,1)
$C_{2}$	(128,128,3,1)	$C_{7}$	(64,8,3,1)
$C_{3}$	(64,16,3,1)	$C_{8}$	(64,8,3,1)
$C_{4}$	(64,16,3,1)	$C_{9}$	(16,8,3,1)
$C_{5}$	(128,128,3,1)	$C_{10}$	(152,152,3,1)

Table 2. Comparison of the mAP (%) of different methods on the UCMD and NWPU datasets. The best results would be highlighted in bold.

Method	UCMD	NWPU
ResNet50 [55]	81.6	79.8
DBOW [54]	83.0	82.1
GOSLm [41]	85.8	90.3
D-CNN [56]	87.4	73.6
V-DELF [21]	91.6	85.7
JSST [23]	92.9	80.4
MobileNets [22]	-	83.1
Multi-Proxy [57]	94.6	-
PVTA_MSF	98.8	91.9

Table 3. Comparison with other multi-feature fusion methods on the UCMD dataset. The best results would be highlighted in bold.

Method	UCMD
Method	mAP (%)	ANMRR
VD16 [31]	51.05	0.4057
Fused mean-max [33]	71.20	0.5169
IRMFRCAMF [20]	76.57	-
Fused CNN_RFM [35]	95.07	0.0820
PVTA_MSF	98.76	0.0111

Table 4. Comparison of class mAP with other methods on the UCMD dataset. The best results would be highlighted in bold.

Categories	ResNet50 [55]	D-CNN [56]	DBOW [54]	V-DELF [21]	GOSLm [41]	PVT_v2	PVTA_MSF
Agriculture	85	99	92	80	95	100	100
Airplane	93	98	95	97	82	100	100
Baseball	73	82	87	77	90	97	100
Beach	99	99	88	94	92	96	99
Buildings	74	80	93	85	78	100	100
Chaparral	95	100	94	100	95	88	95
Dense	62	65	96	90	55	100	100
Forest	87	99	99	98	95	98	96
Freeway	69	92	78	99	83	100	100
Golf	73	93	85	83	92	94	95
Harbor	97	100	95	100	95	99	100
Intersection	81	79	77	86	80	93	100
Medium-density	80	69	74	92	59	100	100
Mobile	74	89	76	94	80	100	100
Overpass	97	82	86	99	78	100	100
Parking	92	99	67	99	95	100	100
River	66	88	74	87	86	94	97
Runway	93	98	66	99	91	100	100
Sparse	69	83	79	79	91	91	92
Storage	86	60	50	93	95	100	100
Tennis	70	83	94	94	95	98	100
Average	81.6	87.4	83.0	91.6	85.8	97.5	98.8

Table 5. Comparison of class mAP with other methods on NWPU dataset. The best results would be highlighted in bold.

Categories	ResNet50 [55]	D-CNN [56]	DBOW [54]	V-DELF [21]	GOSLm [41]	PVT_v2	PVTA_MSF
Airplane	88	82	98	93	96	94	97
Airport	72	64	95	81	90	93	93
Baseball Diamond	69	61	86	64	93	92	98
Basketball Court	61	59	83	71	90	89	96
Beach	77	78	85	83	96	91	96
Bridge	73	79	95	81	93	92	92
Chaparral	98	99	96	99	98	98	99
Church	56	39	80	64	64	73	74
Circular Farmland	97	88	94	99	97	97	97
Cloud	92	93	98	98	98	98	99
Commercial Area	82	59	79	88	78	77	88
Dense Residential	89	76	90	95	92	78	88
Desert	87	89	97	92	90	90	93
Forest	95	94	95	97	94	90	94
Freeway	65	64	64	86	88	73	80
Golf Course	96	86	82	97	96	98	99
Ground Track Field	63	62	80	77	96	93	97
Harbor	93	90	88	97	99	99	99
Industrial Area	75	65	85	88	90	79	86
Intersection	64	58	80	72	97	91	89
Island	88	90	88	94	93	89	90
Lake	80	75	85	85	89	92	92
Meadow	84	88	90	93	93	84	90
Medium Residential	78	64	94	77	82	77	83
Mobile Home Park	93	85	83	97	94	94	96
Mountain	88	85	95	96	86	90	90
Overpass	87	67	74	90	95	94	93
Palace	41	30	80	56	51	58	68
Parking Lot	95	90	70	97	98	95	97
Railway	88	81	84	89	77	86	88
Railway Station	62	55	86	73	81	81	81
Rectangular Farmland	82	79	66	88	86	85	90
River	70	59	76	75	90	85	83
Roundabout	72	75	83	90	95	96	96
Runway	80	81	78	89	90	87	92
Sea Ice	98	97	90	99	99	99	99
Ship	61	73	65	69	95	95	98
Snowberg	97	96	83	98	99	98	98
Sparse Residential	69	71	84	70	93	91	94
Stadium	81	62	57	86	92	95	95
Storage Tank	88	86	48	94	98	98	99
Tennis Court	80	38	72	78	95	95	95
Terrace	88	83	76	90	89	91	90
Thermal Power Station	68	47	72	78	89	93	95
Wetland	82	71	70	80	85	86	91
Average	79.8	73.6	82.1	85.7	90.3	89.3	91.9

Table 6. Comparison of mAP/(%) at different feature dimensions on the UCMD dataset. The best results would be highlighted in bold.

Method	Different Feature Dimensions
Method	1024	512	256	128	64	32
GOSLm [41]	85.6	85.8	85.1	85.0	84.4	-
V-DELF [21]	89.4	89.8	90.3	90.7	91.2	90.7
PVT_v2	97.5	97.5	97.7	97.6	97.4	97.3
PVTA_MSF	98.8	98.6	98.7	98.5	98.6	98.1

Table 7. Comparison of mAP/(%) at different feature dimensions on the NWPU dataset. The best results would be highlighted in bold.

Method	Different Feature Dimensions
Method	1024	512	256	128	64	32
GOSLm [41]	88.8	90.3	88.6	88.2	87.9	-
V-DELF [21]	85.8	86.2	86.5	86.6	85.9	81.8
MobileNets [22]	71.5	73.1	75.4	77.8	80.4	83.1
PVT_v2	89.3	89.3	89.4	89.2	89.1	88.7
PVTA_MSF	91.9	91.7	91.6	91.2	90.5	89.1

Table 8. The ablation results of PVTA_MSF on UCMD and NWPU datasets. The best results would be highlighted in bold.

Method	UCMD Datasets		NWPU Datasets
Method	mAP (%)	ANMRR	mAP (%)	ANMRR
Backbone	97.54	0.020115	89.31	0.082980
Backbone + FFM	98.29	0.014979	90.95	0.069759
Backbone + FFM + GAM + MHSA	98.47	0.014262	91.20	0.067264
Backbone + FFM + GAM + MHSA + DCN	98.54	0.013228	91.56	0.064709
Backbone + FFM + GAM + MHSA + DCN + ICS loss	98.76	0.011102	91.93	0.061128

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, F.; Wu, K.; Zhang, R.; Wang, M.; Meng, X.; Li, D. Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval. Remote Sens. 2023, 15, 4729. https://doi.org/10.3390/rs15194729

AMA Style

Ye F, Wu K, Zhang R, Wang M, Meng X, Li D. Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval. Remote Sensing. 2023; 15(19):4729. https://doi.org/10.3390/rs15194729

Chicago/Turabian Style

Ye, Famao, Kunlin Wu, Rengao Zhang, Mengyao Wang, Xianglong Meng, and Dajun Li. 2023. "Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval" Remote Sensing 15, no. 19: 4729. https://doi.org/10.3390/rs15194729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

Abstract

1. Introduction

2. Related Work

2.1. CBRSIR Based on CNN Features

2.2. CBRSIR Based on Deep Hashing Features

2.3. Methods Based on Multi-Feature Fusion

2.4. Methods Based on Deep Metric Learning

2.5. Transformer Network (PVTv2)

3. Proposed Methods

3.1. Multi-Scale Feature Fusion Module (MSF)

3.1.1. Feature Extracting Based on Global Attention Mechanism (GAM) in Stage 1

3.1.2. Feature Extracted by MHSA in Stage 2 and Stage 3

3.1.3. Feature Fusion Module (FFM)

3.1.4. Deformable Convolution (DCN Layer)

3.2. Hashing Layer

3.3. Intra-Class Similarity Loss Function for Mini-Batches (ICS Loss)

3.4. Design of Multiple Loss Functions

4. Experiments and Analysis

4.1. Dataset and Experimental Settings

4.2. Comparative Analysis of Retrieval Performance

4.3. Comparison of Multi-Feature Fusion Method

4.4. Class mAP Comparative Analysis

4.5. Analysis of the Influence of Feature Dimensionality

4.6. Precision@N and Precision–Recall (P-R) Curve Analysis

4.7. Ablation Experiments

4.8. Visualization Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI