Next Article in Journal
Spatiotemporal Characteristics of Horizontal Crustal Deformation in the Sichuan–Yunnan Region Using GPS Data
Previous Article in Journal
Real-Time Kinematic Positioning (RTK) for Monitoring of Barchan Dune Migration in the Sanlongsha Dune Field, the Northern Kumtagh Sand Sea, China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval

1
Key Laboratory of Mine Environmental Monitoring and Improving around Poyang Lake of Ministry of Natural Resources, East China University of Technology, Nanchang 330013, China
2
School of Surveying and Geoinformation Engineering, East China University of Technology, Nanchang 330013, China
3
Geographic Information Engineering Brigade of Geological Bureau of Jiangxi Provincial, Nanchang 330002, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(19), 4729; https://doi.org/10.3390/rs15194729
Submission received: 10 July 2023 / Revised: 20 September 2023 / Accepted: 22 September 2023 / Published: 27 September 2023

Abstract

:
For high-resolution remote sensing image retrieval tasks, single-scale features cannot fully express the complexity of the image information. Due to the large volume of remote sensing images, retrieval requires extensive memory and time. Hence, the problem of how to organically fuse multi-scale features and enhance retrieval efficiency is yet to be resolved. We propose an end-to-end deep hash remote sensing image retrieval model (PVTA_MSF) by fusing multi-scale features based on the Pyramid Vision Transformer network (PVTv2). We construct the multi-scale feature fusion module (MSF) by using a global attention mechanism and a multi-head self-attention mechanism to reduce background interference and enhance the representation capability of image features. Deformable convolution is introduced to address the challenge posed by varying target orientations. Moreover, an intra-class similarity (ICS) loss is proposed to enhance the discriminative capability of the hash feature by minimizing the distance among images of the same category. The experimental results show that, compared with other state-of-the-art methods, the proposed hash feature could yield an excellent representation of remote sensing images and improve remote sensing image retrieval accuracy. The proposed hash feature can gain an increase of 4.2% and 1.6% in terms of mAP on the UC Merced and NWPU-RESISC45 datasets, respectively, in comparison with other methods.

Graphical Abstract

1. Introduction

With the rapid advancement of Earth observation technology, the number of remote sensing satellites has increased significantly, resulting in a rapid growth in the volume of remote sensing images [1]. Effectively locating and retrieving the desired remote sensing images from massive databases, as well as efficiently managing and utilizing the remote sensing image data, pose formidable challenges [2]. Remote sensing image retrieval (RSIR) aims to retrieve the required remote sensing images accurately and efficiently from extensive databases and can be categorized into text-based RSIR and content-based RSIR [3]. Text-based RSIR retrieves tagged images from the remote sensing database based on query keywords or labels, but it requires extensive manual annotation of each image in the dataset during the initial phase. Content-based RSIR, on the other hand, performs image retrieval by searching for images in the database that closely resemble the query image. This approach closely aligns with human visual perception and is currently the dominant retrieval method. Due to the complex scene and rich background information of remote sensing images, it is difficult to extract effective retrieval features and accurately measure the similarity of features, which is a problem that needs to be solved.
CBRSIR comprises three main components: feature extraction, reduction of feature dimensionality, and similarity calculation. Initial features for CBRSIR were limited to basic patterns in the images, such as lines, shapes, and textures. These features, known as low-level features, were manually designed. Low-level features such as SIFT [4], LBP [5], and HOG [6] provide typical examples. Low-level features describe local image representation and are aggregated to form mid-level features using descriptor aggregation techniques such as BoW [7], VLAD [8], FK [9], and EMK [10]. With the development of deep learning technology and the introduction of image retrieval, convolutional neural networks (CNNs) [11] are typically used as feature extractors to obtain abstract features of remote sensing images [12], referred to as high-level features. However, the high dimensionality of these deep features has led to challenges such as high computational costs and storage requirements. Therefore, dimensionality reduction techniques are necessary to improve retrieval speed and minimize memory usage. Various studies have shown that encoding or pooling methods can be used to achieve dimensionality reduction on the characteristics. One such technique is hashing, which produces binary hash codes through coding, significantly reducing retrieval time and memory use.
The primary challenge in CBRSIR is the vast area covered by remote sensing images, which depict multiple object categories and complex background information. Retrieval accuracy is affected by the high similarity between images of different categories, significant differences between images of the same category, and diversity in the orientation of image targets. Remote sensing images can be represented from various perspectives using features at different scales. The multi-scale feature fusion methods have been applied in multiple domains [13,14,15], such as hyperspectral image classification [14] and pedestrian detection [15], and have demonstrated significant effectiveness. Inspired by this, some studies in CBRSIR use a feature fusion technique to overcome the limitations of single-feature expression capability by extracting multiple features from the same or different models. Nonetheless, in these methods, the feature fusion and feature extraction processes are usually separated, making it difficult to uniformly learn features at varying scales and perform an end-to-end multi-feature fusion.
Recently, Transformer models have gained significant attention in the field of computer vision. Dosovitskiy et al. [16] proposed the Vision Transformer (ViT) model, which employs a pure Transformer-based approach and is suitable for image classification tasks. After being trained on large datasets, ViT outperformed traditional convolutional neural network (CNN) models and demonstrated stronger generalization capability. However, ViT only generates feature maps of a single resolution, which results in high computational complexity, as global self-attention needs to be computed. To address these issues, Liu et al. [17] proposed the Swin Transformer, which adopts a hierarchical structure similar to CNNs and can process multi-scale images. Moreover, it employs a sliding window operation to calculate local window attention, reducing the computational complexity from quadratic to linear, as in ViT. Wang et al. [18] proposed Pyramid Vision Transformer (PVT), which is the first Transformer-based architecture using a feature pyramid. PVT features a progressive shrinking pyramid structure and a spatial reduction attention mechanism (SRA). Compared to ViT, PVT significantly reduces computational complexity. PVTv2 [19] further improves the original PVT by introducing overlapping patch embeddings and a linear spatial reduction attention mechanism, making the feature pyramid Transformer architecture a viable backbone network for visual tasks. Other than image classification, Transformer models have demonstrated stronger feature extraction capabilities than CNNs in fields such as object detection, semantic segmentation, and image processing. Therefore, this paper proposes an end-to-end hash retrieval of remote sensing images by fusing multi-scale features based on PVTv2. The main contributions and innovations of this paper are as follows:
(1) This paper proposes a multi-scale feature fusion (MSF) module to address the limitation of single-scale feature representation. The MSF module fuses high-level and low-level features extracted from different scales of the PVTv2 model. It uses both the global attention mechanism (GAM) and multi-head self-attention (MHSA) to organically fuse the features of four scales of the PVTv2 model. The MSF module also introduces a lightweight deformable convolution to overcome the issue of various target orientations and angle transformations in remote sensing images.
(2) To address the problem of larger computation and storage overheads caused by high-dimensional features, this paper introduces a hash encoding layer to extract deep hash features for remote sensing image retrieval in an end-to-end manner. Additionally, the multi-scale features and hash features are trained together to enhance the representation ability of hash features.
(3) To address the issue of large differences between images of the same class while achieving a balance between accuracy and computation, this paper proposes an intra-class similarity loss (ICS loss) inspired by deep metric learning (DML) loss. The ICS loss reduces the distance between same-class images and is computed in each batch of training samples to enhance the discriminative power of remote sensing image features.

2. Related Work

This section provides a comprehensive review of related works in CBRSIR, including methods that utilize CNN features and deep hashing features, feature fusion techniques to enhance feature representation, the use of deep metric learning (DML) to optimize convolutional neural networks and extract discriminative features, and an introduction to the Transformer network PVTv2.

2.1. CBRSIR Based on CNN Features

Deep features extracted from CNNs have been increasingly utilized in CBRSIR. For instance, Li et al. [20] designed four unsupervised convolutional neural networks that generate four types of deep features at different layers. By combining these deep features with traditional handcrafted features, they provided more effective features for CBRSIR. Raffaele et al. [21] extracted deep local convolutional features from fine-tuned CNN models and aggregated the local convolutional features into global descriptors using the vector of locally aggregated descriptors (VLAD). They utilized multiplication and addition attention mechanisms to overcome irrelevant background interference. Hou et al. [22] fine-tuned the MobileNet model to extract deep convolutional features and obtained low-dimensional feature representation by changing the dimension of the final fully connected layer. They compared the retrieval accuracy with the principal component analysis (PCA) method of dimensionality reduction. In cross-dataset remote sensing image retrieval, Wang et al. [23] proposed a learnable joint spatial and spectral transformation (JSST) model to correct spatial and spectral distortions in images. This model embedded the spatially and spectrally modified inputs at the front end of the ResNet34 network, thereby improving generalization and adaptability. Wu et al. [24] proposed two rotation-aware networks, namely the feature-map-transformation-based rotation-aware network (FMT-RAN) and spatial-transformer-based rotation-aware network (ST-RAN), to address the issue of images appearing at arbitrary rotation angles.
However, the aforementioned methods extract deep features from convolutional neural networks (CNNs) for retrieval without utilizing the features of Transformer models. In contrast to CNN models, Transformer models can perform global context modeling and better comprehend the semantic relationships of the entire input sequence. Therefore, they can capture global contextual information and extract richer features.

2.2. CBRSIR Based on Deep Hashing Features

Hashing has been widely used in large-scale remote sensing image retrieval due to its prominent advantages in storage and retrieval speed. Li et al. [25] proposed the deep hashing neural network (DHNN), which utilizes deep feature learning neural networks to learn high-dimensional embedding features and hash learning neural networks to learn low-dimensional hashing features. This model can be optimized end-to-end. To address the overfitting issue caused by a limited number of labeled images in remote sensing datasets, Roy et al. [26] proposed a deep hashing network based on metric learning. Liu et al. [27] introduced a deep supervised hashing model using a loss function composed of classification, similarity, and bit entropy terms based on the framework of generative adversarial networks (GANs) to learn compact and effective hash codes. Cheng et al. [28] proposed the semantic consistency deep hashing model, which applies deep hashing to multi-label remote sensing image retrieval. It introduces a paired label similarity loss that fully utilizes multi-label information, demonstrating the effectiveness of hashing methods in multi-label remote sensing image retrieval. Tan et al. [29] proposed deep contrastive self-supervised hashing for remote sensing image retrieval, which utilizes unlabeled images for training. This method assumes that hash codes generated from different views of the same image should be similar, while those generated from different images should be dissimilar. They designed a loss function to preserve the similarity of hash codes. Jing et al. [30] presented a deep unsupervised weighted hashing model that utilizes a pretrained Swin Transformer to extract feature representations. This model uses an adaptive weight-based loss function that assigns weights adaptively to positive and negative samples and combines it with quantization loss, resulting in improved model performance. Although these deep hashing methods have achieved good retrieval results, they extract single-layer features without employing methods for fusing multiple features. Single-feature extraction is insufficient to fully express the rich detailed information and semantic information of remote sensing images. The adoption of multi-feature fusion in hashing methods has the potential to improve the accuracy of remote sensing image retrieval.

2.3. Methods Based on Multi-Feature Fusion

Currently, several studies have focused on the limitations of using a single-feature representation to fully express both visual and semantic information about images. These studies employ feature fusion techniques to enhance feature discriminability. For example, Yang et al. [31] combined the benefits of mid-level and high-level features in a convolutional neural network (CNN) by fusing the convolutional and fully connected layer features, resulting in improved retrieval performance by simultaneously utilizing global and local image information. Li et al. [32] extracted high-level features from the ResNet50 and VGG16 networks and concatenated them, increasing feature representation proficiency and resulting in enhanced classification performance of remote sensing images by utilizing learned parameters from two networks. Yin et al. [33] introduced a mean-max pooling weighted fusion technique to merge high-level features, which effectively led to improved retrieval performance by enhancing feature representation proficiency. Li et al. [20] formulated four CNN models with different feature layers and developed collaborative affinity metric fusion (CAMF) to merge features from different layers and improve retrieval performance. Alhichri et al. [34] utilized three pre-trained SqueezeNet models that could take input images of various scales and fuse the outputs of three CNN models in a cascaded manner. Minakshi et al. [35] invented a fused CNN architecture that merged features extracted from VGG16, VGG19, and ResNet to obtain efficient and accurate features. Additionally, they presented an optimal feature selection model based on Joint MI_RFO to choose the best features for improved retrieval accuracy. Nevertheless, many current fusion methodologies rely on simple summation or concatenation operations, and most research focuses on merging CNN features. Moreover, current methods separate the multi-scale feature extraction process from the feature fusion process, hindering automatic adjustment and merging of multi-scale features based on remote sensing image retrieval requirements and limiting the representation ability of retrieval features. In addition, there is limited research on adaptive weighted fusion of multiple features extracted from Transformer models, which will be the focus of this study.

2.4. Methods Based on Deep Metric Learning

At present, deep metric learning (DML) methods are extensively utilized to enhance the retrieval functionality of networks. Contrastive loss [36] is a previous metric learning method that measures the distance between two samples, narrowing the gap between paired samples of the same category and increasing the gap between samples of different categories. Triplet loss [37] selects a sample as an anchor and two additional samples categorized as positive and negative. It requires distances between samples of the same class to become more compact and distances between samples of different classes to increase. N-pair loss [38] measures the correlation between samples using cosine similarity and matches each anchor with a positive sample and multiple negative samples. Proxy-NCA [39] is an initial proxy-based loss that connects each sample with a proxy assigned for each category. It aims to draw samples closer to proxies of the same category and encourage distance from proxies of different categories. SoftTriple loss [40] enhances softmax loss by connecting multiple proxies to each category, efficiently capturing the hidden distribution of samples and maintaining a wider intra-class spread. Liu et al. [41] substituted the Hinge function with the softmax function to obtain global optimization, successfully overcoming the local optimization problem of triplet loss. Xue et al. [42] proposed a hash retrieval method using proxy-based metric learning mixed with hash coding learning to enhance retrieval speed while maintaining accuracy and minimizing storage space. However, these methods also have notable limitations. For example, metric loss based on image pairs requires a growing number of sample pairs to be formed with more training samples. This leads to additional computation and longer convergence times for the network. Proxy-based losses are successful in addressing issues related to network convergence rate and time complexity. However, they cannot fully utilize sample information, and proxies assigned for each category have a fixed number and cannot be assigned adaptively, resulting in a lack of generalization ability. Therefore, the research direction of this paper will be focused on improving metric loss based on image pairs.

2.5. Transformer Network (PVTv2)

In this paper, the proposed method improves upon the b2 version of PVTv2 [19], which is pre-trained on ImageNet-1K. PVTv2 is an enhanced version of the pyramid-structured Transformer network PVT. It employs overlapping patch embedding to squeeze images, ensuring the continuity of local images. The second modification replaces the fixed position encoding in PVT with the position encoding mechanism using zero padding to enable the network to handle images of all sizes more efficiently. Additionally, the linear space reduction attention mechanism replaces the original spatial reduction attention mechanism to optimize computational costs, restricting computational complexity within a linear range. The PVTv2 network model is a multi-layer structure with four different stages, each consisting of a patch embedding layer and a Transformer encoder. These stages implement four different scales of feature maps. As the network depth increases, the resolution of the feature maps gradually decreases, and the channel dimension of the features gradually increases. The Transformer encoder mainly comprises LayerNorm, MLP, and linear spatial reduction attention. The output features of the four stages have different scales and channel numbers. Specifically, the output features of Stage 1 are 64 × 56 × 56, the output features of Stage 2 are 128 × 28 × 28, the output features of Stage 3 are 320 × 14 × 14, and the output features of Stage 4 are 512 × 7 × 7. In the subsequent study, we also extracted feature matrices from the initial PVTv2 model for retrieval purposes, serving as a comparative baseline in the subsequent comparative experiments to validate the effectiveness of the proposed method in this paper.

3. Proposed Methods

The method proposed in this study is an improvement upon the PVTv2 model. Firstly, we developed a multi-scale feature fusion (MSF) module that includes the global attention mechanism, multi-head self-attention mechanism, feature fusion module, and deformable convolution. Next, we constructed a hashing layer to transform the multi-scale fused features into compact binary hash codes. Finally, we proposed an intra-class similarity loss (ICS loss) function that reduces the distance between samples of the same class and designed a five-branch loss function to train the model. By employing the aforementioned approach, we have made enhancements to the original PVTv2 model, thereby obtaining a novel model, which we have designated as PVTA_MSF.

3.1. Multi-Scale Feature Fusion Module (MSF)

The multi-scale feature fusion module is illustrated in Figure 1a. Its specific construction and operation are as follows: Based on the original PVTv2 model, a global attention module is added after its first stage. The global attention module enhances cross-dimensional interactions, capturing correlations and importance between channels and spatial dimensions. This helps reduce background interference and improve feature representation capability. Multi-head self-attention modules are added after the second and third stages to simultaneously focus on different feature subspaces, extracting richer feature representations. Then, a feature fusion module is designed to fuse the feature maps extracted from all four stages. The principle behind this module is to use high-level features to weight and fuse the low-level features within the module. Subsequently, the output of the feature fusion module is downsampled by a factor of four and subjected to deformable convolutions to enhance the model’s ability to handle deformations, resulting in multi-scale fused features. The specific process is as follows:

3.1.1. Feature Extracting Based on Global Attention Mechanism (GAM) in Stage 1

In the PVTv2 model, Stage 1 possesses the minimum channel number and the maximum resolution of 64 × 56 × 56, making it abundant in low-level image features. Therefore, this paper introduces the global attention mechanism (GAM) that combines channel and spatial attention to process the features in this layer. The GAM [43] is an improvement in the CBAM module [44], which can capture significant features in three dimensions: channel, spatial height, and spatial width. In practical applications, channel attention is used to weight different channels of the input features, giving higher weights to important channels and lower weights to less important ones. This reduces the influence of background channels and focuses the attention on channels that contain useful information. On the other hand, spatial attention is used to weight different spatial positions of the input features, giving higher weights to important positions and lower weights to less important ones. This reduces the impact of background interference and noise, directing attention toward spatial positions that contain valuable information. Therefore, the global attention mechanism allows for interaction between channel attention and spatial attention, enhancing their expressive power across different dimensions. Through this cross-dimensional interaction, the mechanism can better capture the correlations and importance between channels and spatial positions, leading to reduced background interference and improved feature representation capability. The framework of GAM is shown in Figure 2.
The processing of the global attention mechanism can be summarized in (1) and (2) as follows. F 1 represents the input feature map, F 2 represents the output feature map from the channel attention submodule, and F 3 represents the output feature map from the spatial attention submodule; M c represents the channel attention submodule, while M S represents the spatial attention submodule.
F 2 = M c ( F 1 ) F 1
F 3 = M s ( F 2 ) F 2
(1) Figure 3 illustrates the channel attention submodule, which utilizes a three-dimensional array to store information among the three dimensions. Two multi-layer perceptrons (MLPs) are utilized to facilitate interaction between channels and spatial dimensions. The ReLU activation function is applied between the two layers to prevent issues such as gradient vanishing and exploding. Finally, the sigmoid function is ultimately employed to generate channel weight coefficients of the feature map. The input feature map is subsequently multiplied by the channel weight coefficients to accomplish weighted operations. The specific process is summarized in the equations below:
y = a 1 F 1 T + b 1
F 2 = s i g m o i d [ R e L U ( a 2 y + b 2 ) ] T F 1
In (3) and (4), a 1 and a 2 are the randomly initialized weight values of the two MLPs; b 1 and b 2 are the bias terms of the two MLPs.
(2) Figure 4 illustrates the spatial attention submodule, which employs two convolutional layers with a kernel size of 7, padding of 3, and a stride of 1 to integrate spatial information. This process effectively reduces the number of feature map channels with a compression ratio r, thereby decreasing computational costs. Spatial attention weight coefficients are generated using the sigmoid function and multiplied with the input feature map to implement the weighted operation. The calculation process for the feature map size in the convolutional layer is summarized in (5) and (6), while that of the spatial attention submodule is summarized in (7):
W o u t = ( W i n k + 2 p ) S + 1
H o u t = ( H i n k + 2 p ) S + 1
F 3 = M S F 2 F 2 = s i g m o i d [ C o n v B N ( C o n v B N R e L U F 2 ) ]
In (5) and (6), W i n and H i n represent the width and height of the input feature map, W o u t and H o u t represent the width and height of the output feature map, p and S , respectively, represent zero-padding and stride, and k represents the size of the convolutional kernel. In Figure 4, r represents the compression ratio of the number of channels.
The introduction of channel attention effectively suppresses background information interference on the target. Spatial attention is also being applied to fully utilize the spatial data of the original image, assigning greater weight to significant features in a supervised manner. In this way, the discriminative power of the original image features is enhanced.

3.1.2. Feature Extracted by MHSA in Stage 2 and Stage 3

In deep network structures, the local structural information in the extracted features of the backbone network tends to be lost with increasing depth. This problem negatively impacts the global semantic feature learning of remote sensing images. To address this issue, this paper introduces a multi-head self-attention mechanism (MHSA) [45] in Stage 2 and Stage 3. The MHSA establishes relationships between individual elements in the abstract feature map, enhances the interaction between channel dimension and spatial dimensions (height and width), and achieves a receptive field corresponding to the entire image. Moreover, the multi-head self-attention mechanism can simultaneously focus on different feature subspaces, thereby extracting more diverse and comprehensive feature representations. Each attention head can learn different feature attention patterns, enabling the capture of distinct information within the data. By combining the outputs of multiple attention heads, a more comprehensive and diversified feature representation can be obtained, thereby reducing the loss of local structural information and enhancing the expressive power of image features. Figure 5 illustrates the MHSA, which comprises multiple self-attention modules that take three inputs: query matrix Q, key matrix K, and value matrix V. The calculation formula for the self-attention modules is expressed as follows:
A t t e n t i o n Q , K , V = s o f t m a x Q K T d k V = A V
The three inputs of the self-attention modules are different learned embeddings of the input feature maps. After applying the softmax function to the product of Q and K transposes, an attention coefficient matrix A is generated. Each row of A corresponds to the similarity between elements in Q and all elements in K. The attention coefficient matrix A is then multiplied by the value matrix V to obtain the final attention feature output.

3.1.3. Feature Fusion Module (FFM)

As shown in Figure 6, this paper designed a feature fusion module that the output features from the four stages of the PVTv2 model. The fundamental principle of the fusion module is to use high-level features to weight the low-level features within the module and then concatenate the high-level features with the weighted low-level features. To achieve a balance between accuracy and computational cost, we followed the popular solutions [46,47,48] and employed four 1 × 1 convolution layers to reduce the dimension of the output features from the four stages, decreasing the channel numbers from 512, 320, 128, and 64 to 64, 64, 16, and 8, respectively. For high-level features to multiply the low-level features, they must have the same size and channel number. Similarly, when concatenating high-level and low-level features, they must also be of the same size. Therefore, high-level features undergo upsampling and convolution before weighting and fusion. We defined a series of 3 × 3 convolution layers with batch normalization [49] and ReLU activation function [50] for processing the high-level features. The specific parameters of the convolution layer C x are listed in Table 1.
The feature fusion module (FFM) comprises three stages, and for convenience of reference, we denote the output feature maps from high to low as X 4 , X 3 , X 2 , and X 1 for Stage 4 to Stage 1, respectively. The specific implementation process of feature fusion is as follows:
  • The feature map X 4 is used to weight the feature map X 3 , followed by fusing the weighted X 3 with X 4 . This process involves the following specific steps: X 4 undergoes upsampling twice, and the C 1 convolution processes it to obtain C 1 X 4 , resulting in both X 3 and the processed X 4 having 64 channels and a size of 14 × 14. Subsequently, C 1 X 4 is utilized as the weight of X 3 and multiplied by X 3 , and the weighted result is concatenated with C 1 ( X 4 ) along the channel dimension. Finally, another convolution layer, C 2 , is applied to smooth the concatenated feature map, resulting in a fused feature map X 34 R H 16 × W 16 × 128   . This process is summarized in (9).
    X 34 = C 2 ( C o n c a t C 1 X 4 × X 3 , C 1 ( X 4 ) ) ,
  • The feature maps X 4 and X 3 are utilized to weight the feature map X 2 , and the weighted X 2 is then fused with the feature map X 34 . Specifically, X 4 is upsampled four times and processed by the C 3 convolution layer to reduce its dimensions, resulting in C 3 X 4 . Similarly, X 3 is upsampled two times and processed by the C 4 convolution layer to reduce its dimensions, resulting in C 4 X 3 . X 34 is upsampled and processed by the C 5 convolution layer to obtain C 5 ( X 34 ) . This process results in X 4 and X 3 having an increased size of 28 × 28 and reduced 16 channels, and X 34 having an increased size of 28 × 28. Subsequently, C 3 X 4 and C 4 X 3 are used to weight X 2 . The weighted result is then concatenated and fused with C 5 ( X 34 ) along the channel dimension. Finally, another convolution layer, C 6 , is applied to smooth the concatenated feature map, resulting in a fused feature map X 234 R H 8 × W 8 × 144   . This process is summarized in (10).
    X 234 = C 6 ( C o n c a t C 3 X 4 × C 4 X 3 × X 2 , C 5 ( X 34 ) ) ,
  • The feature maps X 4 , X 3 , and X 2 are utilized to weight the feature map X 1 , and the weighted X 1 is then fused with the stage two fused feature map X 234 . Specifically, X 1 is downsampled by two times to reduce its size to 28 × 28. X 4 is upsampled four times and then processed by the C 7 convolution layer to reduce its dimensions, resulting in C 7 X 4 . Similarly, X 3 is upsampled twice and processed by the C 8 convolution layer to reduce its dimensions, resulting in C 8 X 3 . X 2 is processed by the C 9 convolution layer to reduce its dimensions, resulting in C 9 X 2 . This process results in X 4 , X 3 , and X 2 having an increased size of 28 × 28 and reduced eight channels. Subsequently, C 7 X 4 , C 8 X 3 , and C 9 X 2 are used to weight X 1 . The weighted result is then concatenated and fused with X 234 along the channel dimension. Finally, another convolution layer, C 10 , is applied to smooth the concatenated feature map, resulting in a fused feature map X 1234 R H 8 × W 8 × 152 . This process is summarized in Equation (11).
    X 1234 = C 10 ( C o n c a t C 7 X 4 × C 8 X 3 × C 9 X 2 × X 1 , X 234 ) ,

3.1.4. Deformable Convolution (DCN Layer)

To enhance the network’s ability to handle deformations caused by different target orientations and angle transformations in CBRSIR, we introduce a deformable convolution layer [51] after the feature fusion module. Deformable convolution is an improved convolutional operation that adjusts the sampling positions of traditional convolution by introducing learnable offsets, in order to adapt to the deformations and directional variations of different targets. Its principle involves adding offsets to the traditional convolution operation and redefining the sampling positions of the convolution kernel, whereas traditional convolution operates on fixed sampling positions. With the learned offsets, deformable convolution can adjust the sampling positions to better accommodate target deformations. This allows the convolution kernel to sample input features at different locations, thereby better capturing the details and shape variations of the target. This means that regardless of whether the target is horizontal, vertical, or inclined, deformable convolution can adapt to and capture the features of the target, enabling the network to have better robustness and expressive power when dealing with targets with significant shape changes.
The output of the FFM, X 1234 , is downsampled by four times to reduce its size and subsequently passed through a DCN layer. This process is summarized in (12):
X D C N = D C N X 1234 ,
Figure 7a shows a standard convolution with blue dots representing the regular sampling grid. Figure 7b–d show deformable convolutions with red dots representing the sampling locations of the deformable convolution and orange arrows representing the added offsets. It can be observed that the deformable convolution layer utilizes trainable offsets, enabling the sampling grid to deform independently and capture crucial features. The training process expands the range of the convolution kernel, thereby increasing the receptive field.

3.2. Hashing Layer

To enable fast retrieval in large-scale remote sensing databases, we utilize a binary hash method for the feature matrices used in CBRSIR. The multi-scale feature fusion module (MSF) is followed by an extra hash coding layer. The multi-scale feature fusion module generates a fused feature map with 152 channel dimensions. The channel dimension size of the fused feature map is fine-tuned by adjusting the deformable convolution layer to enable hash coding of varying lengths. This is advantageous when compared with other retrieval methods in subsequent work. The hash coding layer activation is the sigmoid function, and the resulting output values undergo a transformation to precisely 0 or 1, using the following threshold:
h x = 0 , i f   x < 0.5 1 , o t h e r w i s e
Here, x represents an element of the input feature vector, which is the input to the hash coding layer. h x represents the hash code in the corresponding position.

3.3. Intra-Class Similarity Loss Function for Mini-Batches (ICS Loss)

Remote sensing images face a challenge in which identical category samples exhibit noticeable differences, while samples from different categories demonstrate high similarity, affecting the retrieval performance of remote sensing image features. Similarity-based metric learning algorithms aim to address this problem by reducing the distance between similar-category samples and increasing the distance between different-category samples.
However, metric loss based on image pairs is effective in increasing the distance between different category features while reducing the distance between those of the same category. But as the number of training samples increases, the computational complexity increases significantly. To balance accuracy and computational complexity and address the problem of large differences between samples of the same category and high similarity between samples of different categories, we propose an intra-class similarity loss function that reduces the distance between samples of the same category in the mini-batches.
The calculation process of this loss function is as follows:
  • As shown in Figure 8, images in a mini-batch are separated by category based on the labels of the images. The intra-class distance D i s m for category m is calculated as follows:
    D i s m = 1 n × n 1 i , j C l a s s m D i , j i f   n > 1 0 e l s e w i s e
    Here, the image i , j belongs to the m-th category, and n is the number of images in the current batch whose category is m ; D i , j represents the Euclidean distance between image i and j .
  • The calculation equation for the intra-class similarity loss function L I C S is as follows:
    L I C S = 1 k D i s ( m ) > 0 D i s m
    Here, k represents the number of categories whose intra-class distance D i s m is greater than 0.

3.4. Design of Multiple Loss Functions

During the network learning stage, we adopt a multi-loss function learning method, which combines the intra-class similarity loss function and the cross-entropy loss function to update the model parameters.
Due to the slow change of hash features, directly using hash features to calculate losses can lead to slow changes in losses, thereby affecting the convergence speed of the model during training. Therefore, we attach a fully connected head module after the output of the MSF module to classify the images and calculate the cross-entropy loss to train the entire model.
The pyramid structure of PVTA_MSF consists of four stages. To reduce the effect of gradient disappearance during error backpropagation, we add an intra-class similarity loss function for each stage. Firstly, a fully connected head module is used to reduce the dimensions of the output of the four stages, and then the ICS loss is calculated using the features after dimensionality reduction.
Finally, we construct a five-branch loss, as shown in Figure 1b. The summary of the loss implementation equation is as follows:
L = L C E + L I C S ( F 4 ) + L I C S ( F 3 ) + L I C S ( F 2 ) + L I C S ( F 1 )
In (16), L C E represents the cross-entropy loss, L I C S ( F 4 ) represents the intra-class similarity loss calculated by Stage 4, L I C S ( F 3 ) represents the intra-class similarity loss calculated by Stage 3, L I C S ( F 2 ) represents the intra-class similarity loss calculated by Stage 2, and L I C S ( F 1 ) represents the intra-class similarity loss calculated by Stage 1.

4. Experiments and Analysis

This section presents the implementation details of our experiments. We extensively tested the effectiveness of our proposed method on two commonly utilized remote sensing image datasets: the UC Merced dataset [52] and the NWPU-RESISC45 dataset [53]. This section comprises eight parts: The first part describes the experimental settings and datasets employed. Parts 2 to 6 conduct comparative analyses of multi-feature retrieval performance. Part seven conducts ablation experiments, and part eight presents visualization experiments.

4.1. Dataset and Experimental Settings

We conducted experiments on two widely utilized remote sensing image databases, namely the UC Merced dataset and the NWPU-RESISC45 dataset.
The UC Merced dataset comprises a collection of remote sensing images downloaded from the U.S. Geological Survey (USGS) by a team from the University of California, Merced. It is widely used for retrieval and classification tasks in the field of remote sensing images and contains 2100 images, 21 geographic categories, and 100 remote sensing images per category. Each image has a size of 256 × 256 and a spatial resolution of 0.3 m. In the following sections, we refer to this dataset as UCMD.
The NWPU-RESISC45 dataset is a large-scale remote sensing image dataset proposed by Northwestern Polytechnical University for scene classification, comprising a total of 31,500 remote sensing images. This dataset has a size of 256 × 256 and spatial resolution ranging from 30 m to 0.2 m, with 45 geographic categories and 700 remote sensing images in each category. We shall refer to this dataset as NWPU for clarity and ease of reference. For fine-tuning the model, we randomly selected 100 images from each category of the NWPU dataset, and similarly, we randomly selected 50 images from each category of the UCMD dataset. The remaining images in each dataset were utilized as a query test set.
The performance of retrieval was assessed using several metrics, including the mean average precision (mAP), the average normalized modified retrieval rate (ANMRR), the Precision@N, and the precision–recall (P-R) curve. The mAP represents the average retrieval accuracy of every classification category, and the average retrieval accuracy of all categories is then computed based on it. The higher the mAP, the more accurate the query precision. ANMRR is employed to evaluate the situation where the correct image is ranked higher in the retrieval results, and the more accurate the retrieval, the smaller the ANMRR. Precision@N evaluates the accuracy of the top N images returned by the retrieval, whereas the PR curve considers both precision and recall.
Fine-tuning was performed on both the UC Merced dataset and the NWPU-RESISC45 dataset. The experiments were performed on a single NVIDIA GeForce RTX 3060 GPU with 32 G RAM, utilizing the Pytorch platform. The software environment consisted of Windows 10, CUDA-11.6, Pytorch-1.12.1, and Python3.9. During the fine-tuning process, all input images were resized uniformly to 224 × 224. We employed a learning rate decay strategy with a decay rate of 0.1, an initial learning rate of 5 × 10−4, and a batch size of 60. The optimizer used was Adam, and the number of iterations for this process was set to 200.

4.2. Comparative Analysis of Retrieval Performance

This study compares seven state-of-the-art remote sensing image retrieval methods. Liu et al.’s method is referred to as GOSLm [41], while the methods introduced by Tang et al. and Raffaele et al. are called DBOW [54] and V-DELF [21], respectively. ResNet50 [55] and D-CNN [56], both excellent methods from V-DELF, are also compared. The ResNet50 features are obtained through VLAD aggregation of the features extracted by ResNet50, while the D-CNN features are obtained using the VGG16 extraction technique. These two methods were fine-tuned on the UCMD and NWPU datasets. Hou et al. [22] utilized MobileNets to extract features and changed the final fully connected layer to obtain low-dimensional features for retrieval. Their method is referred to as MobileNets. In addition, Wang et al. [23] proposed the joint spatial and spectral transformation (JSST) model, which we will refer to as JSST. Li et al. [57] proposed an adaptive multi-proxy method that can allocate multiple proxies for samples. We refer to this method as Multi-Proxy. The datasets for the aforementioned methods were divided as follows: GOSLm, DBOW, V-DELF, ResNet50, and D-CNN split each dataset into a training set and a test set in a 4:1 ratio. MobileNets selected 50 images from each class in the dataset as query images, while the remaining images were randomly split into training and validation sets. JSST utilized the PatternNet dataset for training the network and evaluated the network on other datasets. Multi-Proxy utilizes 50% of the UCMD dataset for training and the remaining 50% for validation. To demonstrate the effectiveness of our PVTA_MSF, we conducted a series of experiments on the UCMD and NWPU datasets. Our improved PVTA_MSF model achieved the best performance on both datasets, as shown in Table 2. Specifically, PVTA_MSF outperformed other methods, achieving a 4.2% accuracy improvement over the best-performing Multi-Proxy method. On the NWPU dataset, PVTA_MSF surpassed other methods, attaining a 1.6% accuracy improvement over the best-performing GOSLm method. In conclusion, our article confirms the efficacy of the proposed improved method, PVTA_MSF, in enhancing the image retrieval performance of remote sensing.

4.3. Comparison of Multi-Feature Fusion Method

In order to validate the effectiveness of our multi-scale feature fusion module, we compared it with other multi-feature-fusion-based remote sensing image retrieval methods. Specifically, we referred to Yang et al.’s [31] method as VD16, Yin et al.’s [33] technique as Fused mean-max, Li et al.’s [20] approach as IRMFRCAMF, and Minakshi et al.’s [35] approach as Fused CNN_RFM. Our method achieved the highest accuracy in both mAP and ANMRR evaluation metrics on the UCMD dataset, as illustrated in Table 3, and outperformed the other methods by a significant margin. These results explicitly demonstrate that our multi-scale feature fusion method is more effective than other multi-feature-fusion-based methods.

4.4. Class mAP Comparative Analysis

In order to further validate the effectiveness of our PVTA_MSF model, we calculated the retrieval accuracy of each specific semantic category following the methods of Tang et al. [54], Raffaele et al. [21], and Liu et al. [41]. Table 4 and Table 5 display the category accuracy of two remote sensing datasets, and we used the experimental results obtained from the aforementioned referenced methods as our comparative benchmarks. The best results are highlighted in bold and black. Table 4 demonstrates that the PVTA_MSF method we propose achieves the best retrieval accuracy in 19 out of 21 categories in the UCMD dataset, with only Chaparral and Forest categories having slightly lower accuracy. Our method also attains a 100% retrieval accuracy in 15 categories. Specifically, while the GOSLm method only achieves 55% and 59% accuracy in the Dense and Medium-density categories, respectively, our method achieves 100% accuracy in both. In addition, while the DBOW and D-CNN methods attain only 50% and 60% accuracy in the Storage category, our method achieves 100% accuracy in that category. Table 5 indicates that our proposed PVTA_MSF method achieves the best retrieval accuracy in 21 out of 45 categories in the NWPU dataset. In contrast, the GOSLm method achieves the best retrieval accuracy in only 12 categories, the V-DELF method in 11 categories, and the DBOW method in 8 categories. While other methods may achieve the best accuracy in some individual categories, their retrieval performance is poor in certain categories, indicating noticeable shortcomings. For example, the D-CNN method only achieves low accuracy of 39%, 30%, 38%, and 47% in the Church, Palace, Tennis Court, and Thermal Power Station categories, respectively. The DBOW method only attains low accuracy of 57% and 48% in the Stadium and Storage Tank categories, respectively. The V-DELF method only obtains 64% and 56% accuracy in the Baseball Diamond and Palace categories, respectively. The GOSLm method only attains a poor accuracy of 51% in the Palace category. In contrast, our PVTA_MSF method achieves a high retrieval accuracy in most categories, with retrieval accuracy higher than 80% in 43 out of 45 categories. Only the Church and Palace categories have lower accuracy of 74% and 68%, respectively, but their retrieval accuracy is only lower than the DBOW method, achieving the second-best result, and higher than the precision of the other five methods.

4.5. Analysis of the Influence of Feature Dimensionality

Raffaele et al. [21] utilized principal component analysis (PCA) to reduce the dimensionality of VLAD vectors, resulting in low-dimensional descriptors. They compared the retrieval accuracy of different sizes of VLAD descriptors. Liu et al. [41] assessed the impact of various embedding sizes on retrieval performance by altering the embedding size during the training process. They concluded that an embedding size of 512 achieved the best performance. Hou et al. [22] obtained low-dimensional features by fine-tuning the final fully connected layer of MobileNets. They evaluated different dimensions for their effects on retrieval performance and found that the optimal low dimension was 32. Our study is based on the works of these researchers. We fine-tuned the deformable convolution layers of the model to output hash features of different dimensional sizes {32, 64, 128, 256, 512, 1024}. We compared our experimental results with those of the three aforementioned methods. Table 6 demonstrates that in the UCMD dataset, PVTA_MSF achieves superior accuracy at a hash feature dimension size of 1024, while significant drops in accuracy are witnessed as dimensions decrease to 32-bit low-dimensional features. According to Table 7 in the NWPU dataset, PVTA_MSF achieves the highest accuracy with a hash feature dimension of 1024. As the dimensionality of the hash feature decreases, accuracy continuously declines. However, PVT_v2 obtains an optimal accuracy at 256 dimensions. The results in Table 6 and Table 7 exhibit that our PVTA_MSF method outperforms other approaches at every feature dimension size. Additionally, our PVTA_MSF method outperforms the initial model PVT_v2 in all feature dimensions.

4.6. Precision@N and Precision–Recall (P-R) Curve Analysis

To further verify the effectiveness of the PVTA_MSF method, we utilized precision@N and precision–recall (P-R) curves as comparative experimental evaluation metrics. Figure 9 illustrates the comparison of precision@N between the initial model PVT_v2 and PVTA_MSF for varying numbers of returned images in both the UCMD and NWPU datasets. Precision@N was calculated using the top 35 returned images for the UCMD dataset, while the NWPU dataset utilized the top 350 returned images. As the number of returned images increased, both methods showed a gradual decrease in precision rates. However, PVTA_MSF outperformed PVT_v2 in both datasets in terms of precision@N. Figure 10 compares the P-R curves between PVT_v2 and PVTA_MSF. As the recall rate increased, the precision rate of both methods decreased. Nevertheless, at the same recall rate, PVTA_MSF exhibited higher precision rates than PVT_v2. Similarly, at the same precision rate, PVTA_MSF demonstrated higher recall rates than PVT_v2. Both evaluation metrics indicate that the PVTA_MSF method excels in retrieval performance, demonstrating its effectiveness and versatility in large-scale remote sensing image retrieval tasks.

4.7. Ablation Experiments

We conducted a series of ablation experiments on the UCMD and NWPU datasets to further evaluate the effectiveness of FFM, GAM, MHSA, DCN, and ICS loss. Table 8′s fifth row displays the retrieval accuracy after implementing different improvement methods, demonstrating that our method produced the highest mAP and the smallest ANMRR on both datasets. As displayed in the fourth row of Table 8, the retrieval accuracy lacking the ICS loss depicts a decrease in mAP of 0.37% and an increase in ANMRR of 0.003581 on the NWPU dataset and a decrease in mAP of 0.22% and an increase in ANMRR of 0.002126 on the UCMD dataset. The third row shows that removing both the ICS loss and DCN led to a decrease in mAP of 0.73% and an increase in ANMRR of 0.006136 on the NWPU dataset and a decrease in mAP of 0.29% and an increase in ANMRR of 0.00316 on the UCMD dataset. The second row shows the retrieval accuracy without ICS loss, DCN, MHSA, and GAM. On the NWPU dataset, this resulted in a decrease in mAP of 0.98% and an increase in ANMRR of 0.008631, while in the UCMD dataset, it led to a decrease in mAP of 0.47% and an increase in ANMRR of 0.003877. The first row of Table 8 illustrates the retrieval accuracy without FFM, ICS loss, DCN, MHSA, and GAM, which resulted in a decrease in mAP of 2.62% and an increase in ANMRR of 0.021852 in the NWPU dataset, and a reduction in mAP of 1.22% and an increase in ANMRR of 0.009013 in the UCMD dataset. The results in Table 8 demonstrate that excluding any improvement module lowers the retrieval accuracy, with a gradual decrease in mAP and a gradual increase in ANMRR. Notably, the removal of the feature fusion module results in the most significant drop in accuracy, providing evidence of the effectiveness of our proposed method. Integrating the aforementioned improvement methods can enhance the retrieval performance of model features.

4.8. Visualization Experiment

In this research, we utilized the Grad-CAM visualization approach [58] to obtain feature heatmaps of remote sensing images, including categories such as airplane, island, ship, stadium, thermal power station, church, dense residential, and roundabout. In Figure 11, the first row illustrates the input image, the second row displays the feature heatmap extracted using the PVT_v2 original model, and the third row shows the feature heatmap obtained through the PVTA_MSF method. The feature heatmap indicates higher feature response values in parts closer to red and lower feature response values in parts closer to blue. The feature heatmap extracted by PVT_v2 failed to accurately distinguish between targets and backgrounds. The widespread red areas on the heatmap represented the model paying attention to irrelevant background areas around the target, instead of focusing exclusively on the target. Conversely, PVTA_MSF, our improved method, outperformed PVT_v2 by effectively distinguishing targets from backgrounds, delimiting target contours more clearly, and minimizing the effect of complex background information on the targets. These effects were particularly noticeable in categories such as island and stadium, where the PVTA_MSF feature heatmap accurately focused on the target by suppressing most of the interference from background information that would otherwise contaminate the heatmap. This evidence supports the conclusion that our proposed improvement method offers better retrieval performance than the initial model.

5. Discussion

In this paper, we proposed a multi-scale feature fusion module to enhance the feature representation ability of remote sensing images. Our approach is in line with previous studies on multi-scale feature fusion for CBRSIR, such as IRMFRCAMF [20] and CNN_RFM [35], which have demonstrated that this technique can improve the retrieval performance of remote sensing by fusing low-level and high-level features. Our proposed feature fusion module is based on the Transformer network, which performs feature extraction and fusion simultaneously, enabling end-to-end learning of the fused features. We used high-level features with rich semantic information to weigh the low-level features. We then concatenated the weighted low-level features to maximize the use of detailed information while filtering out invalid information. Our experimental results in Table 3 show that our proposed feature fusion module has better retrieval performance than other fusion methods that separate the feature extraction process from the feature fusion process. Although we propose this feature fusion module for CBRSIR tasks, it can be applied to other downstream tasks, such as image classification, object detection, and image semantic segmentation, because multi-scale fusion features can better describe images.
Although the method proposed in this paper demonstrated good performance for CBRSIR, there are still some limitations that need to be addressed. First, we believe that the fast retrieval speed of our proposed multi-scale fusion feature is due to the fact that the retrieval speed of the hash feature is faster than that of the general retrieval feature of the same dimension. However, we did not conduct an experimental analysis of the retrieval speed. In future research, we will conduct a detailed analysis of the retrieval speed compared to other models. Second, the PVTv2 model has several versions with different sizes, but we only analyzed one version of the model in this study. Therefore, further analysis is needed for the other versions of the model.

6. Conclusions

This paper proposes an end-to-end deep hash retrieval feature extraction model PVTA_MSF for remote sensing image retrieval. Our proposed multi-scale feature fusion module weighs and fuses the multi-layer features of the Transformer network to enhance the representation ability of the fused remote sensing image features. The experimental results show that compared with other methods, our improved PVTA_MSF model increased the mAP value from 94.6% to 98.8% in the UCMD dataset, an increase of 4.2%. In the NWPU dataset, the mAP was increased from 90.3% to 91.9%, an increase of 1.6%. Additionally, the intra-class similarity loss function augments the discriminatory strength of the hash features. Through ablation experiments, it can be concluded that the inclusion of the multi-scale feature fusion module and the intra-class similarity loss function resulted in a significant improvement in mAP on the UCMD dataset, increasing it from 97.54% to 98.76%. Similarly, on the NWPU dataset, the mAP was enhanced from 89.31% to 91.93% with the incorporation of these components. By utilizing various improvement methods, we achieved a significant improvement in retrieval performance. The use of hashing techniques contributed to improved retrieval efficiency and reduced memory consumption while maintaining accuracy, making this method adaptable to large-scale remote sensing image retrieval. Although the retrieval performance of this method is improved, it requires a large number of manually labeled training samples. Therefore, future research should focus on reducing the number of training samples needed through methods such as self-supervision and few-shot learning.

Author Contributions

Conceptualization, F.Y.; methodology, F.Y. and K.W.; software, K.W.; validation, M.W., X.M. and R.Z.; resources, F.Y.; writing, original draft preparation, K.W.; writing, review and editing, F.Y. and D.L.; supervision, F.Y. and D.L.; funding acquisition, D.L. and F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 41801288), the Natural Science Foundation of Jiangxi Province (No. 20202BABL202030), and the Key Laboratory of Mine Environmental Monitoring and Improving around Poyang Lake of the Ministry of Natural Resources (No. MEMI-2021-2022-22).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to show their gratitude to the editors and the anonymous reviewers for their comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tang, X.; Yang, Y.; Ma, J.; Cheung, Y.M.; Liu, C.; Liu, F.; Zhang, X.; Jiao, L. Meta-Hashing for Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5615419. [Google Scholar] [CrossRef]
  2. Yuan, Z.; Zhang, W.; Fu, K.; Li, X.; Deng, C.; Wang, H.; Sun, X. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv 2022, arXiv:2204.09868. [Google Scholar] [CrossRef]
  3. Ye, F.; Luo, W.; Dong, M.; He, H.; Min, W. SAR Image retrieval based on unsupervised domain adaptation and clustering. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1482–1486. [Google Scholar] [CrossRef]
  4. Sumbul, G.; Ravanbakhsh, M.; Demir, B. Informative and Representative Triplet Selection for Multilabel Remote Sensing Image Retrieval. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5405811. [Google Scholar] [CrossRef]
  5. Zhuo, Z.; Zhou, Z. Remote Sensing Image Retrieval with Gabor-CA-ResNet and Split-Based Deep Feature Transform Network. Remote Sens. 2021, 13, 869. [Google Scholar] [CrossRef]
  6. Mehmood, M.; Shahzad, A.; Zafar, B.; Shabbir, A.; Ali, N. Remote sensing image classification: A comprehensive review and application. Math. Probl. Eng. 2022, 2022, 5880959. [Google Scholar] [CrossRef]
  7. Ma, J.; Shi, D.; Tang, X.; Zhang, X.; Jiao, L. Dual Modality Collaborative Learning for Cross-Source Remote Sensing Retrieval. Remote Sens. 2022, 14, 1319. [Google Scholar] [CrossRef]
  8. Shabbir, A.; Ali, N.; Ahmed, J.; Zafar, B.; Rasheed, A.; Sajid, M.; Ahmed, A.; Dar, S.H. Satellite and scene image classification based on transfer learning and fine tuning of ResNet50. Math. Probl. Eng. 2021, 2021, 5843816. [Google Scholar] [CrossRef]
  9. Wang, Y.; Ji, S.; Lu, M.; Zhang, Y. Attention boosted bilinear pooling for remote sensing image retrieval. Int. J. Remote Sens. 2020, 41, 2704–2724. [Google Scholar] [CrossRef]
  10. Bo, L.; Sminchisescu, C. Efficient match kernel between sets of features for visual recognition. Adv. Neural Inf. Process. Syst. 2009, 22, 135–143. [Google Scholar]
  11. Ye, F.; Su, Y.; Xiao, H.; Zhao, X.; Min, W. Remote Sensing Image Registration Using Convolutional Neural Network Features. IEEE Geosci. Remote Sens. Lett. 2018, 15, 232–236. [Google Scholar] [CrossRef]
  12. Ye, F.; Luo, W.; Dong, M.; Li, D.; Min, W. Content-based Remote Sensing Image Retrieval Based on Fuzzy Rules and a Fuzzy Distance. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8002505. [Google Scholar] [CrossRef]
  13. Kumar, A.; Yadav, D.P.; Kumar, D.; Pant, M.; Pant, G. Multi-scale feature fusion-based lightweight dual stream transformer for detection of paddy leaf disease. Environ. Monit. Assess. 2023, 195, 1020. [Google Scholar] [CrossRef]
  14. Ghaderizadeh, S.; Abbasi-Moghadam, D.; Sharifi, A.; Tariq, A.; Qin, S. Multiscale Dual-Branch Residual Spectral-Spatial Network With Attention for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 5455–5467. [Google Scholar] [CrossRef]
  15. Chen, H.; GUO, X. Multi-scale feature fusion pedestrian detection algorithm based on Transformer. In Proceedings of the 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 12–14 May 2023; pp. 536–540. [Google Scholar]
  16. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  17. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
  18. Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar]
  19. Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
  20. Li, Y.; Zhang, Y.; Tao, C.; Zhu, H. Content-Based High-Resolution Remote Sensing Image Retrieval via Unsupervised Feature Learning and Collaborative Affinity Metric Fusion. Remote Sens. 2016, 8, 709. [Google Scholar] [CrossRef]
  21. Imbriaco, R.; Sebastian, C.; Bondarev, E. Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens. 2019, 11, 493. [Google Scholar] [CrossRef]
  22. Hou, D.; Miao, Z.; Xing, H.; Wu, H. Exploiting low dimensional features from the MobileNets for remote sensing image retrieval. Earth Sci. Inform. 2020, 13, 1437–1443. [Google Scholar] [CrossRef]
  23. Wang, Y.; Ji, S.; Zhang, Y. A learnable joint spatial and spectral transformation for high resolution remote sensing image retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8100–8112. [Google Scholar] [CrossRef]
  24. Wu, Z.; Zou, C.; Wang, Y.; Tan, M.; Weise, T. Rotation-Aware Representation Learning for Remote Sensing Image Retrieval. Inf. Sci. 2021, 572, 404–423. [Google Scholar] [CrossRef]
  25. Li, Y.; Zhang, Y.; Xin, H.; Hu, Z.; Ma, J. Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 56, 950–965. [Google Scholar] [CrossRef]
  26. Roy, S.; Sangineto, E.; Demir, B.; Sebe, N. Metric-Learning based Deep Hashing Network for Content Based Retrieval of Remote Sensing Images; Cornell University: Ithaca, NY, USA, 2019. [Google Scholar]
  27. Liu, C.; Ma, J.; Tang, X.; Zhang, X.; Jiao, L. Adversarial hash-code learning for remote sensing image retrieval. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 4324–4327. [Google Scholar]
  28. Cheng, Q.; Huang, H.; Ye, L.; Fu, P.; Gan, D.; Zhou, Y. A Semantic-Preserving Deep Hashing Model for Multi-Label Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 4965. [Google Scholar] [CrossRef]
  29. Tan, X.; Zou, Y.; Guo, Z.; Zhou, K.; Yuan, Q. Deep Contrastive Self-Supervised Hashing for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 3643. [Google Scholar] [CrossRef]
  30. Jing, W.; Xu, Z.; Li, L.; Wang, J.; He, Y.; Chen, G. Deep Unsupervised Weighted Hashing for Remote Sensing Image Retrieval. J. Database Manag. (JDM) 2022, 33, 1–19. [Google Scholar] [CrossRef]
  31. Yang, K.; Li, C.; Zhou, W.; Cheng, Q.; Ren, Y. Remote sensing image retrieval based on multi-layer feature integration of convolution neural networks. Sci. Surv. Mapp. 2019, 44, 9–15. [Google Scholar]
  32. Li, Y.; Wang, Q.; Liang, X.; Jiao, L. A Novel Deep Feature Fusion Network for Remote Sensing Scene Classification. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5484–5487. [Google Scholar]
  33. Yin, W.; Zhang, Y.; Sun, X.; Fu, K. A Image Retrieval Method in High-resolution Remote Sensing Images based on Deep Descriptor Fusion. In Proceedings of the Fifth Annual Symposium on High Resolution Earth Observation, Xian, China, 17 October–18 October 2018; pp. 893–908. [Google Scholar]
  34. Alhichri, H.; Alajlan, N.; Bazi, Y.; Rabczuk, T. Multi-Scale Convolutional Neural Network for Remote Sensing Scene Classification. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT), Rochester, MI, USA, 3–5 May 2018; pp. 1–5. [Google Scholar]
  35. Vharkate, M.N.; Musande, V.B. Fusion Based Feature Extraction and Optimal Feature Selection in Remote Sensing Image Retrieval. Multimed. Tools Appl. 2022, 81, 31787–31814. [Google Scholar] [CrossRef]
  36. Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar]
  37. Hoffer, E.; Ailon, N. Deep metric learning using triplet network. In Proceedings of the International Workshop on Similarity-Based Pattern Recognition, Copenhagen, Denmark, 12–14 October 2015; pp. 84–92. [Google Scholar]
  38. Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1857–1865. [Google Scholar]
  39. Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No Fuss Distance Metric Learning Using Proxies. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar]
  40. Qian, Q.; Shang, L.; Sun, B.; Hu, J.; Tacoma, T.; Li, H.; Jin, R. SoftTriple Loss: Deep Metric Learning Without Triplet Sampling. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–3 November 2019. [Google Scholar]
  41. Liu, P.; Gou, G.; Shan, X.; Tao, D.; Zhou, Q. Global Optimal Structured Embedding Learning for Remote Sensing Image Retrieval. Sensors 2020, 20, 291. [Google Scholar] [CrossRef]
  42. Shan, X.; Liu, P.; Wang, Y.; Zhou, Q.; Wang, Z. Deep Hashing Using Proxy Loss on Remote Sensing Image Retrieval. Remote Sens. 2021, 13, 2924. [Google Scholar] [CrossRef]
  43. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
  44. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
  45. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5998–6008. [Google Scholar]
  46. Fan, D.-P.; Ji, G.-P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; Shao, L. Pranet: Parallel Reverse Attention Network for Polyp Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2020; pp. 263–273. [Google Scholar]
  47. Wu, Z.; Su, L.; Huang, Q. Cascaded partial decoder for fast and accurate salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3907–3916. [Google Scholar]
  48. Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar] [CrossRef]
  49. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
  50. Glorot, X.; Bordes, A.; Bengio, Y. Deep Sparse Rectifier Neural Networks. J. Mach. Learn. Res. 2011, 15, 315–323. [Google Scholar]
  51. Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
  52. Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2013, 51, 818–832. [Google Scholar] [CrossRef]
  53. Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
  54. Tang, X.; Zhang, X.; Liu, F.; Jiao, L. Unsupervised deep feature learning for remote sensing image retrieval. Remote Sens. 2018, 10, 1243. [Google Scholar] [CrossRef]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  56. Cheng, G.; Yang, C.; Yao, X.; Guo, L.; Han, J. When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2811–2821. [Google Scholar] [CrossRef]
  57. Li, X.; Wei, S.; Wang, J.; Du, Y.; Ge, M. Adaptive Multi-Proxy for Remote Sensing Image Retrieval. Remote Sens. 2022, 14, 5615. [Google Scholar] [CrossRef]
  58. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Figure 1. The network structure of the model. (a) the multi-scale feature fusion module, and (b) the loss function module.
Figure 1. The network structure of the model. (a) the multi-scale feature fusion module, and (b) the loss function module.
Remotesensing 15 04729 g001
Figure 2. Global attention mechanism.
Figure 2. Global attention mechanism.
Remotesensing 15 04729 g002
Figure 3. Channel attention submodule.
Figure 3. Channel attention submodule.
Remotesensing 15 04729 g003
Figure 4. Spatial attention submodule.
Figure 4. Spatial attention submodule.
Remotesensing 15 04729 g004
Figure 5. Multi-head self-attention mechanism.
Figure 5. Multi-head self-attention mechanism.
Remotesensing 15 04729 g005
Figure 6. Feature fusion module.
Figure 6. Feature fusion module.
Remotesensing 15 04729 g006
Figure 7. Sampling methods of 3 × 3 standard convolution kernel and deformable convolution. (a) a standard convolution, (bd) are the deformable convolutions.
Figure 7. Sampling methods of 3 × 3 standard convolution kernel and deformable convolution. (a) a standard convolution, (bd) are the deformable convolutions.
Remotesensing 15 04729 g007
Figure 8. Schematic diagram of the calculation process of ICS loss.
Figure 8. Schematic diagram of the calculation process of ICS loss.
Remotesensing 15 04729 g008
Figure 9. Comparison of Precision@N on the UCMD dataset and NWPU dataset.
Figure 9. Comparison of Precision@N on the UCMD dataset and NWPU dataset.
Remotesensing 15 04729 g009
Figure 10. Comparison of P-R curves on the UCMD dataset and NWPU dataset.
Figure 10. Comparison of P-R curves on the UCMD dataset and NWPU dataset.
Remotesensing 15 04729 g010
Figure 11. Feature heatmaps of images from multiple categories. The regions in the image that are closer to red have higher feature response values, while the regions closer to blue have lower feature response values.
Figure 11. Feature heatmaps of images from multiple categories. The regions in the image that are closer to red have higher feature response values, while the regions closer to blue have lower feature response values.
Remotesensing 15 04729 g011
Table 1. Parameters of the convolution layers in the feature fusion module (FFM).
Table 1. Parameters of the convolution layers in the feature fusion module (FFM).
FFM
C 1 (64,64,3,1) C 6 (144,144,3,1)
C 2 (128,128,3,1) C 7 (64,8,3,1)
C 3 (64,16,3,1) C 8 (64,8,3,1)
C 4 (64,16,3,1) C 9 (16,8,3,1)
C 5 (128,128,3,1) C 10 (152,152,3,1)
Table 2. Comparison of the mAP (%) of different methods on the UCMD and NWPU datasets. The best results would be highlighted in bold.
Table 2. Comparison of the mAP (%) of different methods on the UCMD and NWPU datasets. The best results would be highlighted in bold.
MethodUCMDNWPU
ResNet50 [55]81.679.8
DBOW [54]83.082.1
GOSLm [41]85.890.3
D-CNN [56]87.473.6
V-DELF [21]91.685.7
JSST [23]92.980.4
MobileNets [22]-83.1
Multi-Proxy [57]94.6-
PVTA_MSF98.891.9
Table 3. Comparison with other multi-feature fusion methods on the UCMD dataset. The best results would be highlighted in bold.
Table 3. Comparison with other multi-feature fusion methods on the UCMD dataset. The best results would be highlighted in bold.
MethodUCMD
mAP (%)ANMRR
VD16 [31]51.050.4057
Fused mean-max [33]71.200.5169
IRMFRCAMF [20]76.57-
Fused CNN_RFM [35]95.070.0820
PVTA_MSF98.760.0111
Table 4. Comparison of class mAP with other methods on the UCMD dataset. The best results would be highlighted in bold.
Table 4. Comparison of class mAP with other methods on the UCMD dataset. The best results would be highlighted in bold.
CategoriesResNet50
[55]
D-CNN
[56]
DBOW
[54]
V-DELF
[21]
GOSLm
[41]
PVT_v2PVTA_MSF
Agriculture8599928095100100
Airplane9398959782100100
Baseball738287779097100
Beach99998894929699
Buildings7480938578100100
Chaparral9510094100958895
Dense6265969055100100
Forest87999998959896
Freeway6992789983100100
Golf73938583929495
Harbor97100951009599100
Intersection817977868093100
Medium-density8069749259100100
Mobile7489769480100100
Overpass9782869978100100
Parking9299679995100100
River66887487869497
Runway9398669991100100
Sparse69837979919192
Storage8660509395100100
Tennis708394949598100
Average81.687.483.091.685.897.598.8
Table 5. Comparison of class mAP with other methods on NWPU dataset. The best results would be highlighted in bold.
Table 5. Comparison of class mAP with other methods on NWPU dataset. The best results would be highlighted in bold.
CategoriesResNet50
[55]
D-CNN
[56]
DBOW
[54]
V-DELF
[21]
GOSLm
[41]
PVT_v2PVTA_MSF
Airplane88829893969497
Airport72649581909393
Baseball Diamond69618664939298
Basketball Court61598371908996
Beach77788583969196
Bridge73799581939292
Chaparral98999699989899
Church56398064647374
Circular Farmland97889499979797
Cloud92939898989899
Commercial Area82597988787788
Dense Residential89769095927888
Desert87899792909093
Forest95949597949094
Freeway65646486887380
Golf Course96868297969899
Ground Track Field63628077969397
Harbor93908897999999
Industrial Area75658588907986
Intersection64588072979189
Island88908894938990
Lake80758585899292
Meadow84889093938490
Medium Residential78649477827783
Mobile Home Park93858397949496
Mountain88859596869090
Overpass87677490959493
Palace41308056515868
Parking Lot95907097989597
Railway88818489778688
Railway Station62558673818181
Rectangular Farmland82796688868590
River70597675908583
Roundabout72758390959696
Runway80817889908792
Sea Ice98979099999999
Ship61736569959598
Snowberg97968398999898
Sparse Residential69718470939194
Stadium81625786929595
Storage Tank88864894989899
Tennis Court80387278959595
Terrace88837690899190
Thermal Power Station68477278899395
Wetland82717080858691
Average79.873.682.185.790.389.391.9
Table 6. Comparison of mAP/(%) at different feature dimensions on the UCMD dataset. The best results would be highlighted in bold.
Table 6. Comparison of mAP/(%) at different feature dimensions on the UCMD dataset. The best results would be highlighted in bold.
MethodDifferent Feature Dimensions
10245122561286432
GOSLm [41]85.685.885.185.084.4-
V-DELF [21]89.489.890.390.791.290.7
PVT_v297.597.597.797.697.497.3
PVTA_MSF98.898.698.798.598.698.1
Table 7. Comparison of mAP/(%) at different feature dimensions on the NWPU dataset. The best results would be highlighted in bold.
Table 7. Comparison of mAP/(%) at different feature dimensions on the NWPU dataset. The best results would be highlighted in bold.
MethodDifferent Feature Dimensions
10245122561286432
GOSLm [41]88.890.388.688.287.9-
V-DELF [21]85.886.286.586.685.981.8
MobileNets [22]71.573.175.477.880.483.1
PVT_v289.389.389.489.289.188.7
PVTA_MSF91.991.791.691.290.589.1
Table 8. The ablation results of PVTA_MSF on UCMD and NWPU datasets. The best results would be highlighted in bold.
Table 8. The ablation results of PVTA_MSF on UCMD and NWPU datasets. The best results would be highlighted in bold.
MethodUCMD DatasetsNWPU Datasets
mAP (%)ANMRRmAP (%)ANMRR
Backbone97.540.02011589.310.082980
Backbone + FFM98.290.01497990.950.069759
Backbone + FFM + GAM + MHSA98.470.01426291.200.067264
Backbone + FFM + GAM + MHSA + DCN98.540.01322891.560.064709
Backbone + FFM + GAM + MHSA + DCN + ICS loss98.760.01110291.930.061128
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, F.; Wu, K.; Zhang, R.; Wang, M.; Meng, X.; Li, D. Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval. Remote Sens. 2023, 15, 4729. https://doi.org/10.3390/rs15194729

AMA Style

Ye F, Wu K, Zhang R, Wang M, Meng X, Li D. Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval. Remote Sensing. 2023; 15(19):4729. https://doi.org/10.3390/rs15194729

Chicago/Turabian Style

Ye, Famao, Kunlin Wu, Rengao Zhang, Mengyao Wang, Xianglong Meng, and Dajun Li. 2023. "Multi-Scale Feature Fusion Based on PVTv2 for Deep Hash Remote Sensing Image Retrieval" Remote Sensing 15, no. 19: 4729. https://doi.org/10.3390/rs15194729

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop