1. Introduction
As a result of the rapid development of remote sensing technology, the amount of remote sensing images with higher resolution has dramatically increased. How to effectively manage and analyze remote sensing images has been a hot issue to be solved urgently. Among them, content-based remote sensing image retrieval (CBRSIR) [
1,
2] is a key problem in the effective use of remote sensing big data. This includes two main components, feature extraction and similarity measure. CBRSIR automatically processes the representations of image features, and it measures the similarity between images. The performance of CBRSIR mainly depends on the representation power of feature embedding. Therefore, research on CBRSIR mainly focuses on feature extraction [
3,
4,
5,
6].
With regard to feature extraction, the existing methods can mainly be divided into methods that are based on handcrafted features, and methods based on learning-based features [
7]. Handcrafted features are usually used to extract global features such as color, texture, shape, and local features based on SIFT [
8] and SURF [
9], which belong to low-level features. The Bag of Word model (BOW) [
10,
11,
12], and the Vector of Locally Aggregated Descriptors (VLAD) [
13] are proposed to encode local features, which further improve the feature representation power, they belong to the middle-level feature. Whether a global feature or a local feature, these handcrafted features are difficult for expressing image semantics precisely. That is, there is a “Semantic Gap” between low-level features and high-level semantics. With the progress of deep learning, especially the excellent performance of Convolutional Neural Networks (CNNs) in computer vision tasks such as classification [
14,
15,
16], detection [
17,
18,
19], and segmentation [
20,
21], CNNs are widely applied to image feature automatic extraction. The reason for why CNNs can achieve better performance than handcrafted features is that CNNs can extract high-level semantic features through a large number of convolutional layer stacking with non-linearity. Ge et al. [
22] transfers the pre-trained CNNs trained on ImageNet to the remote sensing image data set and compares the features extracted by pre-trained CNNs with handcrafted features for CBRSIR, which leads to the conclusion that the CNNs features outperform the traditional features by a large margin.
However, it is difficult to achieve a satisfactory retrieval result only by pre-trained or fine-tuned CNNs, by facing the large-scale and high-resolution remote sensing image datasets. We think that this is mainly caused by the following two reasons. The first is insufficient representation power of feature embedding. In
Figure 1, some examples of Aerial Image Data (AID) dataset [
23] that are used for remote sensing image retrieval are shown. It can be inferred from
Figure 1a,b that these two types of remote sensing images are characterized by color and texture features. From
Figure 1c,d, we can see that they are characterized by the local shape of the aircraft and bridge instead of land and water occupying most areas of the image. Therefore, the features used for CBRSIR should be able to take into account both the global and local silent features of the image. However, pre-trained CNNs features may be difficult for meeting this requirement, due to the big difference between the ImageNet data set and the remote sensing image data set, and the difficulty of covering the images’ global and local silent features simultaneously with a convolutional layer or a fully connected layer. Some works [
24,
25,
26] try to solve the above problems from the aspect of improving the data set, by introducing multi-labels and dense labeling remote sensing datasets for training. To a certain extent, this problem can be alleviated, while the disadvantage is also obvious that annotating on multi-labels is time-consuming and costly. The second reason is the inconsistency in the purpose between the training process and the retrieval process. Feature representation used in CBRSIR is the result of training for classification. The accuracy of classification can even reach 99% in the verification set and test set, while when it is used for image retrieval, the accuracy is far less than this. This is mainly due to the difference between classification and retrieval. In classification, softmax loss is usually used for training to encourage features to be separable, which leads the inter-class be disperse. However, in CBRSIR, the similarity is measured between the images by the Euclidean distance or the cosine distance, which requires feature representation not only to be separable but also discriminative. Discriminative features mean inter-class to disperse, and intra-class to be compact as much as possible.
In this paper, we propose two schemes as the response to the above problems. We first propose a new attention module for features extraction for CBRSIR, which can pay more attention to the silent features, and suppress the less useful ones. We then provide a center loss-based multi-task learning network structure to further boost the discriminative power of the features. The framework of our proposed method is shown in
Figure 2.
The main contributions of this paper can be summarized as follows:
To obtain salient and effective features, we propose a new attention module, which can be easily connected with the last convolutional layer of any pre-trained CNNs and can be applied along two dimensions: channel and spatial, attending to emphasize the meaningful features along these two axes.
We propose a multi-task learning network structure, introducing center loss as a network branch in the training phase, to penalize the intra-class distances of features, and to improve the discriminative ability of the deep features.
The two schemes that we proposed can be combined and integrated into the same training network to further improve performance.
The rest of this paper is organized as follows.
Section 2 presents some published work that is related to features extraction for CBRSIR. Our proposed two schemes to generate discriminative feature representation are discussed in
Section 3.
Section 4 displays the experimental results and analysis.
Section 5 includes a discussion, and
Section 6 draws some conclusions.
2. Related Work
In the following section, we will present the related work on feature extraction, attention mechanism, and a loss function.
2.1. Learning-Based Feature Representation for CBRSIR
CNNs have been dominant in feature extraction, and have gradually replaced traditional methods in the field of computer vision. The achievement of CNNs is mainly due to the fact that deep network structures bring a large number of nonlinear functions, and weight parameters can be automatically learned from the training data. However, remote sensing image datasets cannot provide a large amount of data for CNNs training from scratch. CNNs pre-trained on massive datasets have been used to extract feature embedding, which has been proven to be effective and efficient, even when the training data set has a lot of difference with the remote sensing image. There are mainly two ways to exploit pre-trained CNNs, including regarding fully-connected layers or convolutional layers as the feature representation. Many works [
3,
22,
27,
28] have compared the performance of different feature representations extracted among the different networks and different layers. Ge, Jiang, Xu, Jiang and Ye [
22] exploit representations from pre-trained CNNs, and feature combination and compression are adopted to improve the feature representation. The experimental results demonstrate that the pre-trained features and aggregated features are simple, and are able to improve retrieval performance. Zhou, Newsam, Li and Shao [
28] propose to fine-tune the pre-trained CNNs on a remote sensing dataset, and they propose a novel CNN architecture based on a three-layer perceptron that has fewer parameters and that can learn low-dimensional features. The results show that the fine-tuned CNNs and the novel CNN are effective. Li et al. [
29] proposes a novel approach based on deep hashing neural networks for large-scale RMIR. Deep feature learning networks and hashing learning networks are concluded in an end-to-end network. Zhou, Deng and Shao [
26] propose a novel multi-label RSIR method using fully convolutional networks (FCN). A pixel-wise labeled dataset is used for training the FCN network. The segmentation maps of each remote sensing image are predicted and region convolutional features are extracted based on the segmentation maps. The experimental results show that the method achieves state-of-the-art performance. While these methods mainly focus on the depth and the width of network architecture, we pay more attention to “attention”.
2.2. Attention Mechanism
The attention mechanism is an important part of human perception. It focuses on a specific area of the image in “high resolution”, and it perceives the surrounding area of the image in “low resolution”, and then it continuously adjusts its focus point. Actually, the attention mechanism is involved in learning the weight distribution of different parts, which leads to different parts corresponding to different degrees of concentration. The benefits of this property have been proven in many tasks, ranging from machine translation and text summarization in sequence-based tasks to classification and segmentation in computer vision.
References [
30,
31,
32] apply the weight-learned to the original image, and Wang et al. [
33] apply the weight-learned to feature maps. In Hu et al. [
34], the weight is applied to channel scales, to weight different channel features. Closer to our work, Woo et al. [
35] exploits both channel and spatial-wise attention, and each of the attention mechanisms can acquire “what” and “where” to focus. All of these works are proposed for natural image processing, and they have shown their excellent performances in classification, detection and so on. There is no attention model for processing the remote sensing images.
2.3. The Loss Function
The effect of CNNs has been continuously improved, in addition to the improvement of the network structure, and the development of the loss function.
Softmax function is the most commonly used loss function to supervise the learning process for classification. Taking one image as an input, and outputting the image’s identification, this kind of model (softmax loss function) is called the identification model. Siamese networks are proposed in [
36], which take a pair of images as input, and is called a verification model. This model can drive the distance to be closer for positive pairs, and further for negative pairs. After that, a model combining identification and verification is adopted in Reference [
37,
38], which makes the feature more discriminative. Besides, Schroff et al. [
39] proposes triplet loss, and this has proven its effectiveness in many datasets. A model with triplet loss takes anchor, positive and negative three images as input, to minimize the distance between the anchor and the positive, and to maximize the distance between the anchor and the negative.
6. Conclusions
In this paper, we proposed two schemes to acquire the discriminative features for remote sensing images retrieval. Our first scheme attention module, a simple module with small calculations, is applied to capture the silent local features and to suppress less useful ones. Through the execution of the two channel and spatial dimensions, our attention module can emphasize the important features along those two axes. Our second scheme center loss is adopted to improve the network structure of the original classification training. The advantage of center loss is to make the deep features of the inter-class dispersed and intra-class to be as compact as possible, which is more suitable for remote sensing images retrieval. To verify the validation of the approach, a more challenging data set is built, which consists of multiple published datasets for remote sensing images retrieval and scene classification. Finally, extensive experiments on the challenging data set and comparisons with baselines demonstrate the effectiveness and superiority of our two schemes, especially the combination of two schemes that can achieve the best performance.
Though our proposed feature learning approach can achieve better performance, there are still some shortcomings that we cannot neglect. As described in
Section 3.1, our attention module can only be connected to the convolutional layer of CNNs. However, both the fully connected layer and the convolutional layer can be used as the feature representations. In Reference [
27], the fully connected layer of some CNNs can obtain better retrieval performance than the convolutional layer under certain conditions. So, how to overcome the limitation for the use of the attention module is one of our future focuses. In addition, the attention module is proposed for remote sensing images retrieval, but it can also be used for other tasks, such as object detection and scene classification in remote sensing image processing.